极简Scrapy爬虫2：爬取多页内容

2023年3月15日 227次阅读来源: Tim_Lee

运行环境：
* Python 2.7.12  
* Scrapy 1.2.2
* Mac OS X 10.10.3 Yosemite

继续爬取Scrapy 1.2.2文档提供的练习网址：

“http://quotes.toscrapy.com“

可以暂时不用考虑爬虫被封的情况，用于初级爬虫练习。

目标

爬取该网站所有页的名言（quote）、作者（author）以及标签（tag）。

增加内容

response.urljoin()：将相对网址拼接成绝对网址。
scrapy.Request()：发出请求。可以用参数（callback=），对返回的响应（response）进行解析。

步骤1：增加爬虫代码

继续使用极简爬虫中的代码，增加三行内容即可。但是更改了爬虫的名字（name），变成name = 'quotes_2_2'，用以区分第一个爬虫。

        next_page = response.css('li.next a::attr("href")').extract_first()
        next_full_url = response.urljoin(next_page)
        yield scrapy.Request(next_full_url, callback=self.parse)

完整代码如下：



    import scrapy
    
    class QuotesSpider(scrapy.Spider):
        name = 'quotes_2_2'
        start_urls = [
            'http://quotes.toscrape.com',
        ]
        allowed_domains = [
            'toscrape.com',
        ]
    
        def parse(self,response):
            for quote in response.css('div.quote'):
                yield{
                    'quote': quote.css('span.text::text').extract_first(),
                    'author': quote.css('small.author::text').extract_first(),
                    'tags': quote.css('div.tags a.tag::text').extract(),
                }
                
            next_page = response.css('li.next a::attr("href")').extract_first()
            next_full_url = response.urljoin(next_page)
            yield scrapy.Request(next_full_url, callback=self.parse)

分析

首先，需要找到下一页的入口网址。使用Chrome或者firefox的firebug对网页进行分析，找到下一页的图标“next —>“的标签中，有下一页的网址。具体网页html片段如下：

<ul class="pager">
            <li class="next">
                <a href="/page/2/">Next <span aria-hidden="true">→</span></a>
            </li>
</ul>

只需要通过Scrapy的CSS选择器定位到li.next a位置即可，意思是类名是next的li标签，下面的a标签。

再把这个a标签中的网址提取出来。网址是其中href="/page/2/"这一段，使用a::attr("href")来提取。（注：如果是文本就是::attr(text)提取；如果是图片的src链接，就是::attr(src)来提取），然后赋值到next_page变量上。

            next_page = response.css('li.next a::attr("href")').extract_first()

这里只能得到一个相对网址/page/2/。Scrapy并不能爬取相对网址，因此需要使用response.urljoin()来转化成相对网址。

            next_full_url = response.urljoin(next_page)

最后，对这个下一页的网址，再发出一个请求，使用yield scrapy.Request()，但是在Request()的参数中，使用callback=self.parse，表示继续调用parse()函数进行解析，提取其中需要的内容

            yield scrapy.Request(next_full_url, callback=self.parse)

步骤2：运行爬虫

使用运行命令scrapy crawl运行爬虫：

$ scrapy crawl quotes_2_2 -o results_2_2_01.json

最后可以得到100条名言。

改进

因为下一页可能并不存在，所以也可以加入一个判断语句，判断下一页存在的话才去爬取。

只需要在上面的代码增加一行判断语句的代码即可。

        next_page = response.css('li.next a::attr("href")').extract_first()
        if next_page is not None:
            next_full_url = response.urljoin(next_page)
            yield scrapy.Request(next_full_url, callback=self.parse)

如果“next_page”这个网址存在，才向服务器发请求。

    原文作者：Tim_Lee
    原文地址: https://www.jianshu.com/p/4f3183358206
    本文转自网络文章，转载此文章仅为分享知识，如有侵权，请联系博主进行删除。