scrapy rules 规则的使用

  • 参考
  • 一般爬虫的逻辑是:给定起始页面,发起访问,分析页面包含的所有其他链接,然后将这些链接放入队列,再逐次访问这些队列,直至边界条件结束。 为了针对列表页+详情页这种模式,需要对链接抽取(link extractor)的逻辑进行限定。好在scrapy已经提供,关键是你知道这个接口,并灵活运用
rules = (Rule(SgmlLinkExtractor(allow=('category/20/index_\d+\.html'), restrict_xpaths=("//div[@class='left']"))),
        Rule(SgmlLinkExtractor(allow=('a/\d+/\d+\.html'), restrict_xpaths=("//div[@class='left']")), callback='parse_item'),
    )

解释:

  • 参数含义
  • Rule是在定义抽取链接的规则,上面的两条规则分别对应列表页的各个分页页面和详情页,关键点在于通过restrict_xpath来限定只从页面特定的部分来抽取接下来将要爬取的链接。
  • CrawlSpider的rules属性是直接从起始url请求返回的response对象中提取url,然后自动创建新的请求返回response, 由callback解析规则提取url返回的的response。
  • follow用途
    第一:这是我爬取豆瓣新书的规则 rules = (Rule(LinkExtractor(allow=(r’^https://book.douban.com/subject/[0-9]*/’),),callback=’parse_item’,follow=False), ),在这条规则下,只会爬取首页(start_urls)中的和规则符合的链接。假设我把follow修改为True,那么爬虫会在爬取的页面中再寻找符合规则的url,如此循环,直到把全站爬取完毕。
  • CrawlSpider已经重写了parse函数, 所有自动创建新的请求返回的response, 都由parse函数解析, rule无论有无callback,都由同一个_parse_response函数处理,只不过他会判断是否有follow和callback

案例

# -*- coding: utf-8 -*-
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor


class ToscrapeRuleSpider(CrawlSpider):
    name = 'toscrape-rule'
    allowed_domains = ['toscrape.com']
    start_urls = ['http://quotes.toscrape.com/']
    custom_settings = {
        'FEED_FORMAT': 'Json',
        'FEED_EXPORT_ENCODING': 'utf-8',
        'FEED_URI': 'rule1.json'
    }
    # 必须是列表
    rules = [
        # follow=False(不跟进), 只提取首页符合规则的url,然后爬取这些url页面数据,callback解析
        # Follow=True(跟进链接), 在次级url页面中继续寻找符合规则的url,如此循环,直到把全站爬取完毕
        Rule(LinkExtractor(allow=(r'/page/'), deny=(r'/tag/')), callback='parse_item', follow=True)
    ]

    def parse_item(self, response):
        self.logger.info('Hi, this is an item page! %s', response.url)
        for quote in response.xpath('//div[@class="quote"]'):
            yield {
                'text': quote.xpath('./span[@class="text"]/text()').extract_first(),
                'author': quote.xpath('.//small[@class="author"]/text()').extract_first(),
                'tags': quote.xpath('.//div[@class="tags"]/a/text()').extract()
            }
  • 结果(follow=True): 爬取了所有的索引页
2018-07-14 22:36:40 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-07-14 22:36:41 [toscrape-rule] INFO: Hi, this is an item page! http://quotes.toscrape.com/page/2/
2018-07-14 22:36:42 [toscrape-rule] INFO: Hi, this is an item page! http://quotes.toscrape.com/page/3/
2018-07-14 22:36:42 [toscrape-rule] INFO: Hi, this is an item page! http://quotes.toscrape.com/page/1/
2018-07-14 22:36:42 [toscrape-rule] INFO: Hi, this is an item page! http://quotes.toscrape.com/page/4/
2018-07-14 22:36:42 [toscrape-rule] INFO: Hi, this is an item page! http://quotes.toscrape.com/page/5/
2018-07-14 22:36:43 [toscrape-rule] INFO: Hi, this is an item page! http://quotes.toscrape.com/page/6/
2018-07-14 22:36:43 [toscrape-rule] INFO: Hi, this is an item page! http://quotes.toscrape.com/page/7/
2018-07-14 22:36:44 [toscrape-rule] INFO: Hi, this is an item page! http://quotes.toscrape.com/page/8/
2018-07-14 22:36:44 [toscrape-rule] INFO: Hi, this is an item page! http://quotes.toscrape.com/page/9/
2018-07-14 22:36:44 [toscrape-rule] INFO: Hi, this is an item page! http://quotes.toscrape.com/page/10/
2018-07-14 22:36:44 [scrapy.core.engine] INFO: Closing spider (finished)
  • 结果(follow=False): 只爬取page2的数据,因为在首页只提取到/page/2/这一个链接
2018-07-14 22:44:00 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-07-14 22:44:08 [toscrape-rule] INFO: Hi, this is an item page! http://quotes.toscrape.com/page/2/
2018-07-14 22:44:08 [scrapy.core.engine] INFO: Closing spider (finished)

《scrapy rules 规则的使用》 爬虫.png

    原文作者:seven1010
    原文地址: https://www.jianshu.com/p/da992f153bdb
    本文转自网络文章,转载此文章仅为分享知识,如有侵权,请联系博主进行删除。
点赞