Warm tip: This article is reproduced from stackoverflow.com, please click
aws-lambda python scrapy twisted

Scrapy run 2 spiders with outputs to 2 different files using one process (AWS Lambda)

发布于 2020-03-29 12:48:30

I'm trying to run Scrapy on an AWS Lambda function and everything is almost working, except that I need to run 2 lambdas in the 1 function. The main catch is that I need the 2 spiders to output to 2 different JSON files.

The docs look like they've got a very close solution:

import scrapy
from scrapy.crawler import CrawlerProcess

class MySpider1(scrapy.Spider):
    # Your first spider definition
    ...

class MySpider2(scrapy.Spider):
    # Your second spider definition
    ...

process = CrawlerProcess()
process.crawl(MySpider1)
process.crawl(MySpider2)
process.start()

Except for the fact that if I were to input my settings into the CrawlerProcess like I currently have:

CrawlerProcess({
    'FEED_FORMAT': 'json',
    'FEED_URI': '/tmp/fx_today_data.json'
})

Then both spiders would output to the one file fx_today_data.json.

I've tried creating 2 CrawlerProcesses but that gives me the ReactorNotRestartable Error which I've tried solving using this thread but with no success.

I've also tried running the scrapy code like so:

subprocess.call(["scrapy", "runspider", "./spiders/fx_today_data.py", "-o", "/tmp/fx_today_data.json"])

But this results in the usual 'scrapy' command not found - because I don't have a virtualenv set up in the Lambda function (I don't know if it's worth setting one up for this?).

Does anyone know how to run 2 Scrapy Spiders (they don't have to run at the same time) in one process and have them output to separate files?

Questioner
Jamie
Viewed
110
Jamie 2020-02-01 00:18

Thanks to Corentin and this guide, I was able to get it working.

By individually creating a custom_settings class attribute for the spiders I could run them off the one CrawlerProcess and not have to worry as they individually had their own file outputs.

The final code looks a lot like the docs example I provided in the question.

I also ended up having to use from multiprocessing.context import Process and to use a try block to terminate the process (before it's even been assigned!) in order to make sure that I would avoid the ReactorNotRestartable error.