Scrapy_note03

2019年6月11日 222次阅读来源: thinkando

序章

本章讲如何用爬虫下载文件

项目需求：
- 下载http://matplotlib.org 网站中所有例子的源码文件到本地

01 页面分析

011 分析链接

$ scrapy shell http://matplotlib.org/examples/index.html
view(response)

观察发现，所有链接都在<div class=”toctree-wrapper compound”>下的每一个<li class=”toctree-l2″> 中
image.png
共抓到506条链接

>>>from scrapy.linkextractors import LinkExtractor
>>>le = LinkExtractor(restrict_css='div.toctree-wrapper.compound li.toctree-l2')
>>>links = le.extract_links(response)
>>>[link.url for link in links]
>>> len(links)
506

012 分析页面

>>> fetch('http://matplotlib.org/examples/animation/animate_decay.html')
>>> view(response)

《Scrapy_note03》 image.png

提取源码下载地址

>>> href=response.css('a.reference.external::attr(href)').extract_first()
>>> href
'animate_decay.py'
>>> response.urljoin(href)
'https://matplotlib.org/examples/animation/animate_decay.py'

02 编码实现

共四步
1 创建Scrapy项目，并使用scrapy genspider 命令创建 Spider
2 在配置文件中启用FilesPipeline, 并指定文件下载目录
3 实现 ExampleItem
4 实现ExamplesSpider

1 创建Scrapy项目，并使用scrapy genspider 命令创建 Spider

$ scrapy startproject matplotlib_examples
$ cd matplotlib_examples
$ scrapy genspider examples matplotlib.org

2 在配置文件(setting.py)中启用FilesPipeline, 并指定文件下载目录

ITEM_PIPELINES = {
    'scrapy.pipelines.files.FilesPipeline':1,
}
FILES_STORE = 'examples_src

3 实现 ExampleItem

在file_urls 和 files 两个字段，在items.py中完成如下代码

class ExampleItem(scrapy.Item):
    file_urls = scrapy.Field()
    files = scrapy.Field()

4 实现ExamplesSpider (examples.py文件)

# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from ..items import ExampleItem

class ExamplesSpider(scrapy.Spider):
    name = 'examples'
    allowed_domains = ['matplotlib.org']
    # 设置其实爬取点
    start_urls = ['http://matplotlib.org/examples/index.html']

    # 提取每个文件的链接，用其构造Request对象并提交
    def parse(self, response):
        le = LinkExtractor(restrict_css='div.toctree-wrapper.compound',deny='/index.html$')
        print(len(le.extract_links(response)))
        for link in le.extract_links(response):
            yield scrapy.Request(link.url,callback=self.parse_example)
    # 分析页面，提取源码下载地址
    def parse_example(self, response):
        href = response.css('a.reference.external::attr(href)').extract_first()
        url = response.urljoin(href)
        example = ExampleItem()
        example['file_urls'] = [url]
        return example

03 运行结果

$ scrapy crawl examples -o examples.json

文件下载结果信息

$ cat examples.json

{"file_urls": ["https://matplotlib.org/examples/animation/animate_decay.py"], "files": [{"url": "https://matplotlib.org/examples/animation/animate_decay.py", "path": "full/769c3346594cdc7614da607216022177d1834c84.py", "checksum": "444b19b47ac56a3680e1a9f801fd612d"}]},
{"file_urls": ["https://matplotlib.org/mpl_examples/api/power_norm_demo.py"], "files": [{"url": "https://matplotlib.org/mpl_examples/api/power_norm_demo.py", "path": "full/db82afab30511b0044a0669a090b72ee2a4aa245.py", "checksum": "e88adcdedb8f1dbfa6a77f80aea9d1d6"}]},

$ cat examples_src

《Scrapy_note03》

这串代码看不懂，需要写一个脚本，把这窜代码重命名

在pipelines.py 实现如下代码

from scrapy.pipelines.files import FilesPipeline
from urllib.parse import urlparse
from os.path import basename,dirname,join

class MyFilesPipeline(FilesPipeline):
    def file_path(self, request, response=None, info=None):
        path = urlparse(request.url).path
        return join(basename(dirname(path)),basename(path))

修改配置文件, 使用MyFilePipeline 代替 FilesPipeline

ITEM_PIPELINES = {
    #'scrapy.pipelines.files.FilesPipeline':1,
    'matplotlib_examples.pipelines.MyFilesPipeline':1,
}
FILES_STORE = 'examples_src'

删除文件，重新运行爬虫

$ rm -r examples_src/full
$ rm examples.json 
$ scrapy crawl examples -o examples.json
$ tree example_src

《Scrapy_note03》

最后507个文件按类别被下载到26个目录
image.png

    原文作者：thinkando
    原文地址: https://www.jianshu.com/p/222e673e20a9
    本文转自网络文章，转载此文章仅为分享知识，如有侵权，请联系博主进行删除。