Scrapy：如何从其他python脚本运行两次或更多的蜘蛛？

2023年3月19日 273次阅读

Scrapy版本：1.0.5

我已经搜索了很长时间,但大多数解决方法在当前的Scrapy版本中都不起作用.

我的蜘蛛是在jingdong_spider.py中定义的,界面(通过Scrapy Documentation学习它)来运行蜘蛛如下：

# interface
def search(keyword):
    configure_logging({'LOG_FORMAT': '%(levelname)s: %(message)s'})
    runner = CrawlerRunner()
    d = runner.crawl(JingdongSpider,keyword)
    d.addBoth(lambda _: reactor.stop())
    reactor.run() # the script will block here until the crawling is finished

然后在temp.py中我将调用上面的搜索(关键字)来运行spider.

现在的问题是：我曾经调用过一次搜索(关键字),而且效果很好.但是我把它叫了两次,例如,

在temp.py

search('iphone')
search('ipad2')

它报告说：

Traceback (most recent call last): File
“C:/Users/jiahao/Desktop/code/bbt_climb_plus/temp.py”, line 7, in
search(‘ipad2’) File “C:\Users\jiahao\Desktop\code\bbt_climb_plus\bbt_climb_plus\spiders\jingdong_spider.py”,
line 194, in search
reactor.run() # the script will block here until the crawling is finished File
“C:\Python27\lib\site-packages\twisted\internet\base.py”, line 1193,
in run
self.startRunning(installSignalHandlers=installSignalHandlers) File “C:\Python27\lib\site-packages\twisted\internet\base.py”, line
1173, in startRunning
ReactorBase.startRunning(self) File “C:\Python27\lib\site-packages\twisted\internet\base.py”, line 684, in
startRunning
raise error.ReactorNotRestartable() twisted.internet.error.ReactorNotRestartable

第一次搜索(关键字)成功,但后者出错了.

你能帮忙吗？

最佳答案在您的代码示例中,您正在调用twisted.reactor,在每次函数调用时启动它.这不起作用,因为每个过程只有一个反应堆而你不能
start it twice.

有两种方法可以解决你的问题,这两种方法都在documentation here中描述.要么坚持使用CrawlerRunner,要么将reactor.run()移到search()函数之外,以确保它只被调用一次.或者使用CrawlerProcess并简单地调用crawler_process.start().第二种方法更容易,您的代码看起来像这样：

from scrapy.crawler import CrawlerProcess
from dirbot.spiders.dmoz import DmozSpider

def search(runner, keyword):
    return runner.crawl(DmozSpider, keyword)

runner = CrawlerProcess()
search(runner, "alfa")
search(runner, "beta")
runner.start()