I'm trying to run Scrapy on an AWS Lambda function and everything is almost working, except that I need to run 2 lambdas in the 1 function. The main catch is that I need the 2 spiders to output to 2 different JSON files.
The docs look like they've got a very close solution:
import scrapy
from scrapy.crawler import CrawlerProcess
class MySpider1(scrapy.Spider):
# Your first spider definition
...
class MySpider2(scrapy.Spider):
# Your second spider definition
...
process = CrawlerProcess()
process.crawl(MySpider1)
process.crawl(MySpider2)
process.start()
Except for the fact that if I were to input my settings into the CrawlerProcess
like I currently have:
CrawlerProcess({
'FEED_FORMAT': 'json',
'FEED_URI': '/tmp/fx_today_data.json'
})
Then both spiders would output to the one file fx_today_data.json
.
I've tried creating 2 CrawlerProcesses but that gives me the ReactorNotRestartable
Error which I've tried solving using this thread but with no success.
I've also tried running the scrapy code like so:
subprocess.call(["scrapy", "runspider", "./spiders/fx_today_data.py", "-o", "/tmp/fx_today_data.json"])
But this results in the usual 'scrapy' command not found - because I don't have a virtualenv set up in the Lambda function (I don't know if it's worth setting one up for this?).
Does anyone know how to run 2 Scrapy Spiders (they don't have to run at the same time) in one process and have them output to separate files?
Thanks to Corentin and this guide, I was able to get it working.
By individually creating a custom_settings
class attribute for the spiders I could run them off the one CrawlerProcess
and not have to worry as they individually had their own file outputs.
The final code looks a lot like the docs example I provided in the question.
I also ended up having to use from multiprocessing.context import Process
and to use a try
block to terminate the process (before it's even been assigned!) in order to make sure that I would avoid the ReactorNotRestartable
error.