Item以及Itempipeline的使用

2023年3月16日 319次阅读

在上一篇博客中，最后的结果是通过yield一个dict，但dict缺少数据结构，没法保证每一处返回都能返回相同的字段。因此scrapy提供了Item类，用来声明爬取数据的数据结构，该类提供了dict-like的接口，因此可以很方便的使用。

Item

每一个自定义的数据结构涉及到2个类：

scrapy.Item：基类；
scrapy.Field：用来描述自定义数据包含哪些字段信息，也仅此而已，并没有实际的作用。

比如，按照上一个博客的例子，爬取http://quotes.toscrape.com/的数据结构可以定义如下：

import scrapy

class QuoteItem(scrapy.Item):
    author = scrapy.Field()
    quote = scrapy.Field()
    tags = scrapy.Field()

这种定义方式跟Django.model很像。
By the way, scraoy.Field()可以带一个serializer参数，用于Item Expoter导出数据时使用，后面会提到。

Item的常见用法

Item提供dict-like的API接口，因此其大部分用法与dict一样。

class BookItem(scrapy.Item):
    name = scrapy.Field()
    author = scrapy.Field()
    price = scrapy.Field()

#创建Item对象
>>> book = BookItem(name = 'Scrapy book',author = 'Tom', price = 10)
>>> book2 = BookItem({'name' : 'Python book', 'author' : 'John'}) #从字典中生成Item对象
#访问键值
>>> book['name']
'Scrapy book'
>>> 'name' in book
True
>>> book['name']
'Scrapy book'
>>> 'name' in book
True
>>> book2 = BookItem(name = 'Python book', author = 'John')
>>> 'price' in book2        # price是否已经设定值
False
>>> 'price' in book2.fields     # price是否是声明的field
True
#设定值
>>> book2['price'] = 12
>>> book2
{'author': 'John', 'name': 'Python book', 'price': 12}
#访问已经被赋值的键
>>> book2.keys()
dict_keys(['name', 'author', 'price'])
>>> book2.items()
ItemsView({'author': 'John', 'name': 'Python book', 'price': 12})
>>> book1_copy = book.copy()
>>> book1_copy
{'author': 'Tom', 'name': 'Scrapy book', 'price': 10}

在spider中使用Item

上一篇博客的代码更改后如下：

# -*- coding: utf-8 -*-
import scrapy
from ..items import QuoteItem

class QuotesSpider(scrapy.Spider):
    name = 'quotes'
    allowed_domains = ['quotes.toscrape.com']
    start_urls = ['http://quotes.toscrape.com/']

#    def start_requests(self):
 #       url = "http://quotes.toscrape.com/"
  #      yield scrapy.Request(url, callback = self.parse)

    def parse(self, response):
        quote_selector_list = response.css('body > div > div:nth-child(2) > div.col-md-8 div.quote')

        for quote_selector in quote_selector_list:
            quote = quote_selector.css('span.text::text').extract_first()
            author = quote_selector.css('span small.author::text').extract_first()
            tags = quote_selector.css('div.tags a.tag::text').extract()

            yield QuoteItem({'quote':quote, 'author':author, 'tags':tags})

        next_page_url = response.css('ul.pager li.next a::attr(href)').extract_first()
        if next_page_url:
            next_page_url = response.urljoin(next_page_url)

            yield scrapy.Request(next_page_url, callback = self.parse)

ItemPipeline

当item从spider爬取获得之后，会被送到ItemPipeline，在scrapy，ItemPipeline是处理数据的组件，它们接收Item参数并再其之上进行处理。

ItemPipeline的典型用法：

清理脏数据；
验证数据的有效性；
去重
保存item到db，即持久化存储

如何实现一个ItemPipeline

ItemPipeline的定义放置于pielines.py中，实现一个ItemPipeline无需继承指定基类，只需要实现以下方法：
process_item(self, item, spider)：必须实现的方法，该方法每个item被spideryield时都会调用。该方法如果返回一个Dict或Item，那么返回的数据将会传递给下一个PipeLine（如果有的话）；抛出一个DropItem异常，那么该Item既不会被继续处理，也不会被导出。通常，在我们在检测到无效数据或想要过滤掉某些数据的时候使用；

其他方法可以实现，但非必须：
open_spider(self, spider)：在spider打开时（数据爬取前）调用该函数，该函数通常用于数据爬取前的某些初始化工作，如打开数据库连接；
close_spider(self, spider)：在spider关闭时（数据爬取后）调用该函数，该函数通常用于数据爬取前的清理工作，如关闭数据库连接；
from_crawler(cls, crawler)：类方法，其返回一个ItemPipeline对象，如果定义了该方法，那么scrapy会通过该方法创建ItemPipeline对象；通常，在该方法中通过crawler.settings获取项目的配置文件，根据配置生成对象。

下面我们实现一个保存Quote到本地文件的ItemPipeline，来看看怎么实现一个自定义的ItemPipeline;

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
from scrapy.exceptions import NotConfigured
import json
import scrapy

class SaveFilePipeline(object):

    def __init__(self, file_name = None):
        if file_name is None:
            raise NotConfigured
        self.file_name = file_name
        self.fp = None

    def open_spider(self, spider):
        self.fp = open(self.file_name, 'w')

    def close_spider(self, spider):
        self.fp.close()

    def process_item(self, item, spider):
        json_item = json.dumps(dict(item))
        self.fp.write(json_item + "\n")

    @classmethod
    def from_crawler(cls, crawler):
        file_name = crawler.settings.get('FILE_NAME')
        # file_name = scrapy.conf.settings['FILE_NAME'] #这种方式也可以获取到配置
        return cls(file_name)

启用ItemPipeline

在settings.py中添加以下内容：

ITEM_PIPELINES = {
    'newproject.pipelines.SaveFilePipeline': 300,
}
FILE_NAME = 'save_result.json'

其中，ITEM_PIPELINES是一个字典文件，键为要打开的ItemPipeline类，值为优先级，ItemPipeline是按照优先级来调用的，值越小，优先级越高。

总结

本篇介绍了如何设定爬取的数据结构以及利用ItemPipeline来实现对数据的保存，了解ItemPipeline的原理。下一节将学习下内置的ItemPipeline，FilesPipeline和ImagesPipeline。