Scrapy入门案例

2023年9月7日 268次阅读来源: zpxuzhen

Scrapy入门案例

Scrapy教程:

官方《Scrapy 1.5 documentation》

中文《Scrapy 0.24.1文档》

安装环境:

Python 2.7.12
Scrapy 0.24.1
Ubuntu 16.04

安装步骤:

pip install scrapy==0.24.1

pip install service_identity==17.0.0

Creating a project

scrapy startproject tutorial

Our first Spider

This is the code for our first Spider. Save it in a file named quotes_spider.py under thetutorial/spiders directory in your project:

import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"

    def start_requests(self):
        urls = [
            'http://quotes.toscrape.com/page/1/',
            'http://quotes.toscrape.com/page/2/',
        ]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        page = response.url.split("/")[-2]
        filename = 'quotes-%s.html' % page
        with open(filename, 'wb') as f:
            f.write(response.body)
        self.log('Saved file %s' % filename)

How to run our spider

scrapy crawl quotes

A shortcut to the start_requests method

import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'http://quotes.toscrape.com/page/1/',
        'http://quotes.toscrape.com/page/2/',
    ]

    def parse(self, response):
        page = response.url.split("/")[-2]
        filename = 'quotes-%s.html' % page
        with open(filename, 'wb') as f:
            f.write(response.body)

上述两种写法等价，而start_urls是start_requests的简洁写法。

为了创建一个Spider，您必须继承 scrapy.Spider 类， 且定义以下三个属性:
name: 用于区别Spider。 该名字必须是唯一的，您不可以为不同的Spider设定相同的名字。
start_urls: 包含了Spider在启动时进行爬取的url列表。 因此，第一个被获取到的页面将是其中之一。 后续的URL则从初始的URL获取到的数据中提取。
parse() 是spider的一个方法。 被调用时，每个初始URL完成下载后生成的 Response 对象将会作为唯一的参数传递给该函数。 该方法负责解析返回的数据(response data)，提取数据(生成item)以及生成需要进一步处理的URL的 Request 对象。

Selectors选择器简介

Selector有四个基本的方法:
xpath(): 传入xpath表达式，返回该表达式所对应的所有节点的selector list列表 。
css(): 传入CSS表达式，返回该表达式所对应的所有节点的selector list列表.
extract(): 序列化该节点为unicode字符串并返回list。
re(): 根据传入的正则表达式对数据进行提取，返回unicode字符串list列表。

Extracting data

The best way to learn how to extract data with Scrapy is trying selectors using the shell Scrapy shell. Run:

scrapy shell 'http://quotes.toscrape.com/page/1/'

当shell载入后，您将得到一个包含response数据的本地 response 变量。输入 response.body 将输出response的包体，输出 response.headers 可以看到response的包头。

你可以使用 response.selector.xpath() 、 response.selector.css()或者response.xpath() 和 response.css() 甚至sel.xpath() 、sel.css()来获取数据，他们之间是等价的。

# 测试这些css方法看看输出啥
response.css('title')
response.css('title').extract()
response.css('title::text')
response.css('title::text').extract()
# response.css('title::text').extract_first()  #extract_first在0.24.1版本不可用
response.css('title::text')[0].extract()
response.css('title::text').re(r'Quotes.*')
response.css('title::text').re(r'Q\w+')
response.css('title::text').re(r'(\w+) to (\w+)')

# 测试这些xpath方法看看输出啥
response.xpath('//title')
response.xpath('//title').extract()
response.xpath('//title/text()')
response.xpath('//title/text()').extract()
# response.xpath('//title/text()').extract_first()   #extract_first在0.24.1版本不可用
response.xpath('//title/text()')[0].extract()
response.xpath('//title/text()').re(r'Quotes.*')
response.xpath('//title/text()').re(r'Q\w+')
response.xpath('//title/text()').re(r'(\w+) to (\w+)')

上面css与xpath 表达式部分不同其他对应一致，他们的输出结果基本一样，除了个别。

Extracting data in our spider

import scrapy
from tutorial.items import QuotesItem

class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'http://quotes.toscrape.com/page/1/',
        'http://quotes.toscrape.com/page/2/',
    ]

    def parse(self, response):
        for quote in response.css('div.quote'):
            item = QuotesItem()
            item['title'] = quote.css('span.text::text')[0].extract(),
            item['author'] = quote.css('small.author::text')[0].extract(),
            item['tags'] = quote.css('div.tags a.tag::text')[0].extract(),
            yield item

Spider将爬到的数据以Item对象返回，因此，还需要定义一个QuotesItem在items.py中

import scrapy

class QuotesItem(scrapy.Item):
    title = scrapy.Field()
    author = scrapy.Field()
    tags = scrapy.Field()

Run 查看输出log,确定没有错误,否则返回修改上面代码

 scrapy crawl quotes

Storing the scraped data

The simplest way to store the scraped data is by using Feed exports, with the following command:

scrapy crawl quotes -o quotes.json

You can also use other formats, like JSON Lines:

scrapy crawl quotes -o quotes.jl

Following links

import scrapy
from tutorial.items import QuotesItem

class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'http://quotes.toscrape.com/page/1/',
    ]

    def parse(self, response):
        for quote in response.css('div.quote'):
            item = QuotesItem()
            item['title'] = quote.css('span.text::text')[0].extract(),
            item['author'] = quote.css('small.author::text')[0].extract(),
            item['tags'] = quote.css('div.tags a.tag::text')[0].extract(),
            yield item
        next_page = response.css('li.next a::attr(href)')[0].extract()
        if next_page is not None:
            next_page = 'http://quotes.toscrape.com'+next_page
            yield scrapy.Request(next_page, callback=self.parse)

这里关键是response.css(‘li.next a::attr(href)’)[0].extract()获取到了’/page/2/’，然后通过scrapy.Request递归调用，再次爬取了’/page/2/’，这样实现了跟踪链接的效果。

新版本会提供response.urljoin来代替我们现在手动拼接url，

最新版还会出response.follow来代替response.urljoin和scrapy.Request两步操作。

    原文作者：zpxuzhen
    原文地址: https://www.jianshu.com/p/b9d7e7bc3e7b
    本文转自网络文章，转载此文章仅为分享知识，如有侵权，请联系博主进行删除。