温馨提示:本文翻译自stackoverflow.com，查看原文请点击：python - Scrapy run 2 spiders with outputs to 2 different files using one process (AWS Lambda)

aws-lambda python scrapy twisted

python - Scrapy使用一个进程运行2个Spider，并输出到2个不同的文件（AWS Lambda）

发布于 2020-03-29 13:16:37

我正在尝试在AWS Lambda函数上运行Scrapy，并且几乎一切正常，除了我需要在1个函数中运行2个lambda。主要问题是我需要2个Spider才能输出到2个不同的 JSON文件。

该文档看起来他们已经有了一个非常接近的解决方案：

import scrapy
from scrapy.crawler import CrawlerProcess

class MySpider1(scrapy.Spider):
    # Your first spider definition
    ...

class MySpider2(scrapy.Spider):
    # Your second spider definition
    ...

process = CrawlerProcess()
process.crawl(MySpider1)
process.crawl(MySpider2)
process.start()

除了我要输入的设置外，CrawlerProcess我现在有以下事实：

CrawlerProcess({
    'FEED_FORMAT': 'json',
    'FEED_URI': '/tmp/fx_today_data.json'
})

然后，两个蜘蛛将输出到一个文件fx_today_data.json。

我尝试创建2个CrawlerProcesses，但这给了我ReactorNotRestartable尝试使用此线程解决的错误，但没有成功。

我也尝试过运行scrapy代码，如下所示：

subprocess.call(["scrapy", "runspider", "./spiders/fx_today_data.py", "-o", "/tmp/fx_today_data.json"])

但这导致找不到通常的“ scrapy”命令-因为我没有在Lambda函数中设置virtualenv（我不知道是否值得为此设置一个？）。

没有人知道如何运行2个Scrapy蜘蛛（他们不具备在同一时间运行）的一个过程，并让它们输出到单独的文件？

提问者

Jamie

被浏览

131

查看英文版

查看原文

Jamie 2020-02-01 00:18

多亏了Corentin和本指南，我才能够使它正常工作。

通过custom_settings为蜘蛛创建单独的类属性，我可以将它们运行起来CrawlerProcess，而不必担心，因为它们分别具有自己的文件输出。

最终代码看起来很像我在问题中提供的docs示例。

为了确保避免发生该错误，我还不得不使用from multiprocessing.context import Process和使用一个try块来终止该进程（甚至在分配它之前！）ReactorNotRestartable。

相关问题

1

如何使用python cut方法创建bin，接受一个参数并返回适当的bin？

2

从具有特定条件的列表列表创建字典

3

根据行值选择列，Python，Pandas

4

在数据框中绘制零和一的计数

5

python函数。

6

在两个DataFrame之间执行大量Pandas查找的最佳方法

7

如何获取Pandas数据透视表中的列数和每列的宽度？

8

在Pandas数据框中分组时缺少所需值时显示一列

9

Python隐藏壁虱但显示壁虱标签

10

获取Entry和checkbutton值Tkinter时出现问题

热门github

1

2

Python tool for converting files and office documents to Markdown.

3

4

Home of the WebKit project, the browser engine used by Safari, Mail, App Store and many other applications on macOS, iOS and Linux. (翻译：WebKit 项目的主页，Safari、Mail、App Store 和 macOS、iOS 和 Linux 上的许多其他应用程序使用的浏览器引擎。)

5

Perplexica is an AI-powered search engine. It is an Open source alternative to Perplexity AI

6

Lightweight coding agent that runs in your terminal

7

🔥 🔥 🔥 Open Source Airtable Alternative (翻译：将任何 MySQL、PostgreSQL、SQL Server、SQLite 和 MariaDB 转换为智能电子表格。)

8

基于大模型和 RAG 的智能问数系统。Text-to-SQL Generation via LLMs using RAG.

9

TimesFM (Time Series Foundation Model) is a pretrained time-series foundation model developed by Google Research for time-series forecasting.

10

An AI Hedge Fund Team

11

Tongyi DeepResearch, the Leading Open-source DeepResearch Agent

12

AI coding agent, built for the terminal.

13

Open-Source Chrome extension for AI-powered web automation. Run multi-agent workflows using your own LLM API key. Alternative to OpenAI Operator.

14

Powerful menu bar manager for macOS

15

Flutter makes it easy and fast to build beautiful apps for mobile and beyond (翻译：Flutter 可以轻松快速地为移动设备及其他应用构建漂亮的应用程序)