平时运行scrapy都是采用命令行 scrapy crawl xxxxx的形式,其实官方已经为我们考虑了,可以通过一个脚本.py就可以使用”python3 xxx.py”的形式运行scrapy程序,官网链接
- 显示如何运行一个单一的爬虫(直接在自己写的spider中直接改就可以)
官网代码
import scrapy
from scrapy.crawler import CrawlerProcess
class MySpider(scrapy.Spider):
# Your spider definition
...
process = CrawlerProcess({
'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})
process.crawl(MySpider)
process.start() # the script will block here until the crawling is finished
解释
其实就是把我们的核心scrapy.py文件中的class中的所有东西放在这里面或者把除了class……之外的东西添加到Myspider.py文件中,然后在当前目录下运行”python3 Myspider.py”就可以,但其实这个地方还要把items.py加载到当前目录下,不然会报错
PS:使用最新版的scrapy官方文档,不然会报错twisted.internet.error.ReactorNotRestartable
还有一个问题不会解决,就是数据怎么输入json文件中,发现用脚本运行scrapy之后pipeline没有起作用,等我找到原因再来补充
- 如果你在一个Scrapy项目中,还有一些额外的帮助程序可以用来在项目中导入这些组件。您可以自动将您的spider名称传送给 CrawlerProcess,然后使用 get_project_settings 通过您的项目设置获取 Settings 实例。
也就是说get_project_settings可以让你的settings在脚本中起作用,在数据处理的部分还是很有用的下面这段代码需要在设置文件目录下新建脚本
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
process = CrawlerProcess(get_project_settings())
# 'followall' is the name of one of the spiders of the project.
process.crawl('followall', domain='scrapinghub.com')
process.start() # the script will block here until the crawling is finished
- 另一个Scrapy实用程序可以更好地控制抓取过程:scrapy.crawler.CrawlerRunner,在自己写的spider中直接修改就可以
rom twisted.internet import reactor
import scrapy
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging
class MySpider(scrapy.Spider):
# Your spider definition
...
configure_logging({'LOG_FORMAT': '%(levelname)s: %(message)s'})
runner = CrawlerRunner()
d = runner.crawl(MySpider)
d.addBoth(lambda _: reactor.stop())
reactor.run() # the script will block here until the crawling is finished
===================================================
摆弄了一两天,终于可以在管道输出了
目录结构如下
Scraper/
scrapy.cfg
ScrapyScript.py
Scraper/
__init__.py
items.py
pipelines.py
settings.py
spiders/
__init__.py
my_spider.py
只需要在配置文件的目录下新建一个脚本就可以
1from scrapy.crawler import Crawler
2 from scrapy import log, signals
3 from scrapy.settings import Settings
4 from twisted.internet import reactor
5 from XXX.spiders.XXX import plaSpider(自己写的爬虫文件路径)
6 from scrapy.crawler import CrawlerProcess
7 from scrapy.utils.project import get_project_settings
8 import os
9 spider = plaSpider()
10 settings = get_project_settings()
11 os.environ['SCRAPY_SETTINGS_MODULE'] = 'wcc_pla.settings'
12 settings_module_path = os.environ['SCRAPY_SETTINGS_MODULE']
13 settings.setmodule(settings_module_path, priority='project')
14 crawler = CrawlerProcess(settings)
15 #crawler.signals.connect(reactor.stop, signal=signals.spider_closed)
16 #crawler.configure()
17 crawler.crawl(plaSpider)
18 crawler.start()
19 log.start(loglevel=log.INFO)
20 reactor.run()
此时输出的json文件会在script.py同一级目录下