极简Scrapy爬虫1：爬取单页内容

2023年4月23日 253次阅读来源: Tim_Lee

运行环境：
* Python 2.7.12  
* Scrapy 1.2.2
* Mac OS X 10.10.3 Yosemite

Scrapy 1.2.2文档提供了一个练习用的网址：

“http://quotes.toscrapy.com“

可以暂时不用考虑爬虫被封的情况，用于初级爬虫练习。

目标

爬取该网站的名言（quote）、作者（author）以及标签（tag）。

整体代码

步骤1：建立项目

在希望保存项目的目录下，使用命令行输入：

scrapy startproject quotes_2

其中scrapy startproject是命令，quotes_2是项目名称，可以随便取。

步骤2：编写爬虫

最开始，只实现一个小目标：只爬取第一页的内容。

在项目目录中，有一个spiders文件夹（本例中为/quotes_2/quotes2/spiders/），新建爬虫文件quotes_2_1.py，整体内容如下：

import scrapy

class QuotesSpider(scrapy.Spider):
    name = 'quotes_2_1'
    start_urls = [
        'http://quotes.toscrape.com'
    ]
    allowed_domains = [
        'toscrape.com'
    ]

    def parse(self,response):
        for quote in response.css('div.quote'):
            yield{
                'quote': quote.css('span.text::text').extract_first(),
                'author': quote.css('small.author::text').extract_first(),
                'tags': quote.css('div.tags a.tag').extract(),
            }

分析内容

import scrapy
引入scrapy包。
必备三件套：name，start_urls, parse()
```
class QuotesSpider(scrapy.Spider):
  name = 'quotes_2_1'
  start_urls = [
      'http://quotes.toscrape.com'
  ]

  def parse(self,response):
```
在爬虫中，必须有这三个项目。
- name：爬虫的名字。字符串形式，如果是windows系统最好使用双引号。
- start_urls：起始网址。为列表形式，一定要使用括号。
- parse()：解析函数。对返回的服务器返回的响应（response）进行解析的函数，参数为(self,response)。全网爬取也可以换成rules。
另外，allowed_domains规定了爬取的范围，如果不希望爬取外联网站，可使用该可选项。

parse()函数

      for quote in response.css('div.quote'):

response.css是scrapy的CSS选择器（selector），在后面的括号中规定条件，就可以对需要爬取的内容进行定位。这里'div.quote'的意思是找到名字叫”quote”的div。

因为查看网页源代码，可以发现每一条名言都是在”quote”的div中。

<div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
        <span class="text" itemprop="text">“A day without sunshine is like, you know, night.”</span>
        <span>by <small class="author" itemprop="author">Steve Martin</small>
        <a href="/author/Steve-Martin">(about)</a>
        </span>
        <div class="tags">
            Tags:
            <meta class="keywords" itemprop="keywords" content="humor,obvious,simile"> 
            
            <a class="tag" href="/tag/humor/page/1/">humor</a>
            
            <a class="tag" href="/tag/obvious/page/1/">obvious</a>
            
            <a class="tag" href="/tag/simile/page/1/">simile</a>
            
        </div>
</div>

定位到每一条名言以后，可以用python的for … in …进行遍历。

          yield{
              'quote': quote.css('span.text::text').extract_first(),
              'author': quote.css('small.author::text').extract_first(),
              'tags': quote.css('div.tags a.tag').extract(),
          }

对于每一条名言，可以用yield{}得到需要的元素。注意类似于字典格式，每一条需要有逗号分隔开。每一个抓取的元素，也许需要进行定位，但是因为在循环中，使用quote.css()定位，然后进行提取。

有两个关键点：

::text：是CSS选择器的语法，表示指定该元素的文本内容。
.extract()：表示把所有内容提取出来。如果只提取第一项，使用.extract_first()或者.extract()[0]，推荐使用前者，因为提取的时候没有第一项的话，.extract_first()不会报错，而后者会。

步骤3：运行爬虫

进入命令行，在项目的目录下（该例为/quotes/），

查看爬虫：输入scrapy list，如果爬虫内容没有问题，这会显示爬虫的名称（name）。如果有问题就会报错。
运行爬虫：scrapy crawl quotes_2_1 -o result_2_1_01.json，可以得到以下结果（中间内容忽略）：
```
[
{"quote": "\u201cThe world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.\u201d", "author": "Albert Einstein", "tags": ["change", "deep-thoughts", "thinking", "world"]},
......
{"quote": "\u201cA day without sunshine is like, you know, night.\u201d", "author": "Steve Martin", "tags": ["humor", "obvious", "simile"]}
]
```
其中
scrapy crawl：运行爬虫的命令。
quotes_2_1：在爬虫代码quotes_2_1.py中指定过name = 'quotes_2_1'，使用该处的名字（name）。
-o result_2_1_01.json: 输出到json文件。-o是可选参数，result_2_1_01.json名字可以随便取，但是格式一般为json，jl或者csv等。

    原文作者：Tim_Lee
    原文地址: https://www.jianshu.com/p/9dabbd15bea4
    本文转自网络文章，转载此文章仅为分享知识，如有侵权，请联系博主进行删除。