It is a website that I wants to listen every pages for changes and update the new value to MongoDB. I have written a python program to make use of multiprocessing module in python, but It is eating all my resources and make my server inaccessible. Tell me what is wrong about it and if better solution exist(I was thinking about Apache Spark Streaming or Kafka Connect to stream every links' updates.)
Update: The problem is that I want to listen 600 web links for changes and update regarding values in MongoDB.
My code is following below:
import pymongo
from bs4 import BeautifulSoup
import sys
import requests
import time
import re
def worker(num,company_id):
while True:
url_home = "http://www.example.com/lastinfo?i={}".format(company_id)
while True:
try:
b = requests.get(url_home,timeout=2.5)
except:
time.sleep(2)
else:
if "Server" not in b.text and "The service is unavailable." not in b.text:
break
else:
time.sleep(2)
company_document_count = re.findall(r"docCount=(.*),", b.text)[0].split(',')[0]
print('Worker:',num)
print("Company ID: "+company_id)
print("Company Document Count: "+str(company_document_count)+"\n")
client = MongoClient(host='x', port=x,username="x",password="x")
db = client['mydb']
mycollection = db['mycollection']
last = mycollection.find_one({"company_id": company_id})["info"][0]["document_count"]
mycollection.update_one({"company_id": company_id,"info.document_count":last}, {"$set": {"info.$":{"document_count":company_document_count}}})
client.close()
time.sleep(0.1)
if __name__ == '__main__':
try:
ids = []
jobs = []
url = "http://www.example.com/allcompanyIds.aspx"
while True:
try:
r = requests.get(url,timeout=2.5)
except:
time.sleep(2)
else:
break
ids = set(re.findall(r"\d{15,20}", r.text))
for index,i in enumerate(ids):
p = multiprocessing.Process(target=worker, args=(index,i,))
jobs.append(p)
p.start()
except KeyboardInterrupt:
print('\nExiting by user request.\n')
sys.exit(0)
I suspect the problem is here:
while True:
try:
b = requests.get(url_home,timeout=2.5)
except:
time.sleep(2)
else:
if "Server" not in b.text and "The service is unavailable." not in b.text:
break
else:
time.sleep(2)
If requests.get()
works, like it should, it will effectively send this message infinitely many times without pause. The way to stop it from eating too many resources is by includinga sleep (this will also stop you from effectively DDOSing whatever url you're accessing) like this:
while True:
try:
b = requests.get(url_home,timeout=2.5)
except:
time.sleep(2)
else:
if "Server" not in b.text and "The service is unavailable." not in b.text:
break
else:
time.sleep(2)
time.sleep(10)
Of course, depending on how frequently you need this information you might want to vary this sleeping period.
PS. generally using while True
should be avoided in my experience, it often leads to programs getting stuck and I think it makes the code more difficult to read
Thanks for your answer. My while loop is because that sometimes my website is not accessible(The status code is 200 but prints The service is unavailable.). It will send the request again to have a successful output.
But what happens if it works just fine? Try to go through it step by step :) @MobinRanjbar
I removed all while True and replace them with
retry_strategy = Retry(total=3,status_forcelist=[429, 500, 502, 503, 504,408],method_whitelist=["HEAD", "GET", "OPTIONS"] ) adapter = HTTPAdapter(max_retries=retry_strategy)
and CPU/RAM usage decreased to 20% on each core and 900MB (1000 web links). Thank you and I will accept your answer.