Warm tip: This article is reproduced from serverfault.com, please click

My python code eating all RAM/CPU resources and make the server inaccessible

发布于 2020-11-30 10:30:10

It is a website that I wants to listen every pages for changes and update the new value to MongoDB. I have written a python program to make use of multiprocessing module in python, but It is eating all my resources and make my server inaccessible. Tell me what is wrong about it and if better solution exist(I was thinking about Apache Spark Streaming or Kafka Connect to stream every links' updates.)

Update: The problem is that I want to listen 600 web links for changes and update regarding values in MongoDB.

My code is following below:

import pymongo
from bs4 import BeautifulSoup
import sys
import requests
import time
import re

def worker(num,company_id):
    while True:
        url_home = "http://www.example.com/lastinfo?i={}".format(company_id)
        while True:
            try:
                b = requests.get(url_home,timeout=2.5)
            except:
                time.sleep(2)
            else:
                if "Server" not in b.text and "The service is unavailable." not in b.text:
                    break
                else:
                    time.sleep(2)

        company_document_count = re.findall(r"docCount=(.*),", b.text)[0].split(',')[0]
        print('Worker:',num)
        print("Company ID: "+company_id)
        print("Company Document Count: "+str(company_document_count)+"\n")
        client = MongoClient(host='x', port=x,username="x",password="x")
        db = client['mydb']
        mycollection = db['mycollection']
        last = mycollection.find_one({"company_id": company_id})["info"][0]["document_count"]
        mycollection.update_one({"company_id": company_id,"info.document_count":last}, {"$set": {"info.$":{"document_count":company_document_count}}})
        client.close()
        time.sleep(0.1)

if __name__ == '__main__':
    try:
        ids = []
        jobs = []
        url = "http://www.example.com/allcompanyIds.aspx"
        while True:
            try:
                r = requests.get(url,timeout=2.5)
            except:
                time.sleep(2)
            else:
                break
        ids = set(re.findall(r"\d{15,20}", r.text))

        for index,i in enumerate(ids):
            p = multiprocessing.Process(target=worker, args=(index,i,))
            jobs.append(p)
            p.start()
    except KeyboardInterrupt:
        print('\nExiting by user request.\n')
        sys.exit(0)
Questioner
Mobin Ranjbar
Viewed
0
Nathan 2020-11-30 18:36:46

I suspect the problem is here:

while True:
    try:
        b = requests.get(url_home,timeout=2.5)
    except:
        time.sleep(2)
    else:
        if "Server" not in b.text and "The service is unavailable." not in b.text:
            break
        else:
            time.sleep(2)

If requests.get() works, like it should, it will effectively send this message infinitely many times without pause. The way to stop it from eating too many resources is by includinga sleep (this will also stop you from effectively DDOSing whatever url you're accessing) like this:

while True:
    try:
        b = requests.get(url_home,timeout=2.5)
    except:
        time.sleep(2)
    else:
        if "Server" not in b.text and "The service is unavailable." not in b.text:
            break
        else:
            time.sleep(2)
    time.sleep(10)

Of course, depending on how frequently you need this information you might want to vary this sleeping period.

PS. generally using while True should be avoided in my experience, it often leads to programs getting stuck and I think it makes the code more difficult to read