Scrapy-9.常见问题

2023年8月3日 308次阅读来源: 王南北丶

本文地址：https://www.jianshu.com/p/779c793cabee

CrawlerPorcess

在Scrapy中，我们有时候需要将爬虫的运行使用代码来执行，或者是要同时执行多个爬虫，那么可以使用Scrapy中的CrawlerProcess。

使用CrawlerProcess后，就不用再用scrapy crawl命令启动爬虫了。

以下是爬取单个的例子：

import scrapy
from scrapy.crawler import CrawlerProcess

class MySpider(scrapy.Spider):
    # 你定义的爬虫
    ...

# 生成一个CrawlerProcess对象，生成的时候可以传入Settings对象
process = CrawlerProcess({
    'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})

# 使用CrawlerProcess对象绑定Spider对象
process.crawl(MySpider)
# 启动CrawlerProcess，开始抓取
# 并且会阻塞在此处，一直到Spider执行完毕
process.start()

另外，Scrapy还有一个很方便的方式，能够在另一个文件之中将Spider对象导入到CrawlerProcess中。

使用这种方法就可以很方便的把Spider和运行分别写到两个模块中。

from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings

# 在生成CrawlProcess时将get_project_settings传入其中
process = CrawlerProcess(get_project_settings())

# 然后就可以在crawl()方法中直接传入Spider的名称，这里的followall就是一个Spider的名字
process.crawl('followall', domain='scrapinghub.com')
process.start()

以下是运行多个爬虫的例子：

import scrapy
from scrapy.crawler import CrawlerProcess

class MySpider1(scrapy.Spider):
    # 定义的爬虫1
    ...

class MySpider2(scrapy.Spider):
    # 定义的爬虫2
    ...

process = CrawlerProcess()
process.crawl(MySpider1)
process.crawl(MySpider2)
process.start()

系列文章：

    原文作者：王南北丶
    原文地址: https://www.jianshu.com/p/779c793cabee
    本文转自网络文章，转载此文章仅为分享知识，如有侵权，请联系博主进行删除。