Warm tip: This article is reproduced from serverfault.com, please click

apache-kafka apache-spark mongodb parallel-processing python

My python code eating all RAM/CPU resources and make the server inaccessible

发布于 2020-11-30 10:30:10

It is a website that I wants to listen every pages for changes and update the new value to MongoDB. I have written a python program to make use of multiprocessing module in python, but It is eating all my resources and make my server inaccessible. Tell me what is wrong about it and if better solution exist(I was thinking about Apache Spark Streaming or Kafka Connect to stream every links' updates.)

Update: The problem is that I want to listen 600 web links for changes and update regarding values in MongoDB.

My code is following below:

import pymongo
from bs4 import BeautifulSoup
import sys
import requests
import time
import re

def worker(num,company_id):
    while True:
        url_home = "http://www.example.com/lastinfo?i={}".format(company_id)
        while True:
            try:
                b = requests.get(url_home,timeout=2.5)
            except:
                time.sleep(2)
            else:
                if "Server" not in b.text and "The service is unavailable." not in b.text:
                    break
                else:
                    time.sleep(2)

        company_document_count = re.findall(r"docCount=(.*),", b.text)[0].split(',')[0]
        print('Worker:',num)
        print("Company ID: "+company_id)
        print("Company Document Count: "+str(company_document_count)+"\n")
        client = MongoClient(host='x', port=x,username="x",password="x")
        db = client['mydb']
        mycollection = db['mycollection']
        last = mycollection.find_one({"company_id": company_id})["info"][0]["document_count"]
        mycollection.update_one({"company_id": company_id,"info.document_count":last}, {"$set": {"info.$":{"document_count":company_document_count}}})
        client.close()
        time.sleep(0.1)

if __name__ == '__main__':
    try:
        ids = []
        jobs = []
        url = "http://www.example.com/allcompanyIds.aspx"
        while True:
            try:
                r = requests.get(url,timeout=2.5)
            except:
                time.sleep(2)
            else:
                break
        ids = set(re.findall(r"\d{15,20}", r.text))

        for index,i in enumerate(ids):
            p = multiprocessing.Process(target=worker, args=(index,i,))
            jobs.append(p)
            p.start()
    except KeyboardInterrupt:
        print('\nExiting by user request.\n')
        sys.exit(0)

Questioner

Mobin Ranjbar

Viewed

0

Nathan 2020-11-30 18:36:46

I suspect the problem is here:

while True:
    try:
        b = requests.get(url_home,timeout=2.5)
    except:
        time.sleep(2)
    else:
        if "Server" not in b.text and "The service is unavailable." not in b.text:
            break
        else:
            time.sleep(2)

If requests.get() works, like it should, it will effectively send this message infinitely many times without pause. The way to stop it from eating too many resources is by includinga sleep (this will also stop you from effectively DDOSing whatever url you're accessing) like this:

while True:
    try:
        b = requests.get(url_home,timeout=2.5)
    except:
        time.sleep(2)
    else:
        if "Server" not in b.text and "The service is unavailable." not in b.text:
            break
        else:
            time.sleep(2)
    time.sleep(10)

Of course, depending on how frequently you need this information you might want to vary this sleeping period.

PS. generally using while True should be avoided in my experience, it often leads to programs getting stuck and I think it makes the code more difficult to read

Mobin Ranjbar 2020-11-30 10:41:36

Thanks for your answer. My while loop is because that sometimes my website is not accessible(The status code is 200 but prints The service is unavailable.). It will send the request again to have a successful output.

Nathan 2020-11-30 12:16:40

But what happens if it works just fine? Try to go through it step by step :) @MobinRanjbar

Mobin Ranjbar 2020-12-01 07:11:04

I removed all while True and replace them with retry_strategy = Retry(total=3,status_forcelist=[429, 500, 502, 503, 504,408],method_whitelist=["HEAD", "GET", "OPTIONS"] ) adapter = HTTPAdapter(max_retries=retry_strategy) and CPU/RAM usage decreased to 20% on each core and 900MB (1000 web links). Thank you and I will accept your answer.

热门帖子

1

推荐一些好玩的/大众的手游

2

求指教后端项目迁移方案

3

迷你洗衣机是不是都是智商税？

4

求助一个排查了半年没解决的 MySQL order by 子句导致索引失效的问题， 500 多万条记录的小表要查快两分钟

5

个人开发了一款 WordPress 主题： iPao，集成了 AI 总结功能

6

偶然发现奇游加速器会在系统里植入根证书

7

国内有蒲公英替代品推荐吗？

8

语音助手这个东西真的会监听谈话并且上传，从而泄漏隐私吗？

9

出一些有意思的域名-明盘

10

jetbrains 全家桶升级 2024 后，在滚动代码时候感觉有点掉帧

热门github

1

A multi-platform library for OpenGL, OpenGL ES, Vulkan, window and input

2

Dev tool that writes scalable apps from scratch while the developer oversees the implementation

3

shadcn/ui, but for Svelte. ✨

4

The Python Risk Identification Tool for generative AI (PyRIT) is an open access automation framework to empower security professionals and machine learning engineers to proactively find risks in their generative AI systems.

5

Performance-portable, length-agnostic SIMD with runtime dispatch

6

ZK Credo

7

OpenCodeInterpreter: Integrating Code Generation with Execution and Refinement

8

Joplin - the secure note taking and to-do app with synchronisation capabilities for Windows, macOS, Linux, Android and iOS.

9

Mamba is a new state space model architecture showing promising performance on information-dense data such as language modeling, where previous subquadratic models fall short of Transformers. It is based on the line of progress on structured state space models, with an efficient hardware-aware design and implementation in the spirit of FlashAttention.

10

This repository contains System Design resources which are useful while preparing for interviews and learning Distributed Systems

11

Curso para aprender el lenguaje de programación Python desde cero y para principiantes. 75 clases, 37 horas en vídeo, código, proyectos y grupo de chat. Fundamentos, frontend, backend, testing, IA...

12

🎓 Path to a free self-taught education in Computer Science!

13

1️⃣🐝🏎️ The One Billion Row Challenge -- A fun exploration of how quickly 1B rows from a text file can be aggregated with Java

14

A collective list of free APIs

15

📚 Freely available programming books