Scrapy去重

一、原生

1、模块

from scrapy.dupefilters import RFPDupeFilter

2、RFPDupeFilter方法

a、request_seen

核心:爬虫每执行一次yield Request对象,则执行一次request_seen方法

作用:用来去重,相同的url只能访问一次

实现:将url值变成定长、唯一的值,如果这个url对象存在,则返回True表名已经访问过,若url不存在则添加该url到集合

1)、request_fingerprint

作用:对request(url)变成定长唯一的值,如果使用md5的话,下面的两个url值不一样

注意:request_fingerprint() 只接收request对象

from scrapy.utils.request import request_fingerprint
from scrapy.http import Request

#
url1 = 'https://test.com/?a=1&b=2'
url2 = 'https://test.com/?b=2&a=1'
request1 = Request(url=url1)
request2 = Request(url=url2)

# 只接收request对象
rfp1 = request_fingerprint(request=request1)
rfp2 = request_fingerprint(request=request2)
print(rfp1)
print(rfp2)

if rfp1 == rfp2:
    print('url相同')
else:
    print('url不同')

2)、request_seen

def request_seen(self, request):
    # request_fingerprint 将request(url) -> 唯一、定长
    fp = self.request_fingerprint(request)
    if fp in self.fingerprints:
        return True        # 返回True,表明已经执行过一次
    self.fingerprints.add(fp)

b、open

父类BaseDupeFilter中的方法,爬虫开始时,执行

def open(self):
    # 爬虫开始
    pass

c、close

爬虫结束时执行

def close(self, reason):
    # 关闭爬虫时执行
    pass

d、log

记录日志

def log(self, request, spider):
    # 记录日志
    pass

e、from_settings

原理及作用:和pipelines中的from_crawler一致

@classmethod
def from_settings(cls, settings):
    return cls()

二、自定义

待续

1、配置文件(settings.py)

# 原生
# DUPEFILTER_CLASS = 'scrapy.dupefilter.RFPDupeFilter'
DUPEFILTER_CLASS = 'toscrapy.dupefilters.MyDupeFilter'

2、自定义去重类(继承BaseDupeFilter)

from scrapy.dupefilters import BaseDupeFilter
from scrapy.utils.request import request_fingerprint
#


class MyDupeFilter(BaseDupeFilter):
    def __init__(self):
        self.visited_fp = set()

    @classmethod
    def from_settings(cls, settings):
        return cls()

    def request_seen(self, request):
        # 判断当前的request对象是否,在集合中,若在则放回True,表明已经访问,否则,访问该request的url并将该url添加到集合中
        if request_fingerprint(request) in self.visited_fp:
            return True
        self.visited_fp.add(request_fingerprint(request))

    def open(self):  # can return deferred
        print('开启爬虫')

    def close(self, reason):  # can return a deferred
        print('结束爬虫')

    def log(self, request, spider):  # log that a request has been filtered
        pass

3、前提条件

yield request的对象

yield scrapy.Request(url=_next, callback=self.parse, dont_filter=True)

dont_filter不能为True,这个值默认为False

    原文作者:mysql
    原文地址: https://www.cnblogs.com/wt7018/p/11741458.html
    本文转自网络文章,转载此文章仅为分享知识,如有侵权,请联系博主进行删除。
点赞