Scrapy入门案例
Scrapy教程:
中文 《Scrapy 0.24.1文档》
安装环境:
- Python 2.7.12
- Scrapy 0.24.1
- Ubuntu 16.04
安装步骤:
pip install scrapy==0.24.1
pip install service_identity==17.0.0
Creating a project
scrapy startproject tutorial
Our first Spider
This is the code for our first Spider. Save it in a file named quotes_spider.py
under thetutorial/spiders
directory in your project:
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
def start_requests(self):
urls = [
'http://quotes.toscrape.com/page/1/',
'http://quotes.toscrape.com/page/2/',
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
page = response.url.split("/")[-2]
filename = 'quotes-%s.html' % page
with open(filename, 'wb') as f:
f.write(response.body)
self.log('Saved file %s' % filename)
How to run our spider
scrapy crawl quotes
A shortcut to the start_requests method
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
start_urls = [
'http://quotes.toscrape.com/page/1/',
'http://quotes.toscrape.com/page/2/',
]
def parse(self, response):
page = response.url.split("/")[-2]
filename = 'quotes-%s.html' % page
with open(filename, 'wb') as f:
f.write(response.body)
上述两种写法等价,而start_urls是start_requests的简洁写法。
为了创建一个Spider,您必须继承 scrapy.Spider 类, 且定义以下三个属性:
name: 用于区别Spider。 该名字必须是唯一的,您不可以为不同的Spider设定相同的名字。
start_urls: 包含了Spider在启动时进行爬取的url列表。 因此,第一个被获取到的页面将是其中之一。 后续的URL则从初始的URL获取到的数据中提取。
parse() 是spider的一个方法。 被调用时,每个初始URL完成下载后生成的 Response 对象将会作为唯一的参数传递给该函数。 该方法负责解析返回的数据(response data),提取数据(生成item)以及生成需要进一步处理的URL的 Request 对象。
Selectors选择器简介
Selector有四个基本的方法:
xpath(): 传入xpath表达式,返回该表达式所对应的所有节点的selector list列表 。
css(): 传入CSS表达式,返回该表达式所对应的所有节点的selector list列表.
extract(): 序列化该节点为unicode字符串并返回list。
re(): 根据传入的正则表达式对数据进行提取,返回unicode字符串list列表。
Extracting data
The best way to learn how to extract data with Scrapy is trying selectors using the shell Scrapy shell. Run:
scrapy shell 'http://quotes.toscrape.com/page/1/'
当shell载入后,您将得到一个包含response数据的本地 response 变量。输入 response.body 将输出response的包体, 输出 response.headers 可以看到response的包头。
你可以使用 response.selector.xpath() 、 response.selector.css()或者response.xpath() 和 response.css() 甚至sel.xpath() 、sel.css()来获取数据,他们之间是等价的。
# 测试这些css方法看看输出啥
response.css('title')
response.css('title').extract()
response.css('title::text')
response.css('title::text').extract()
# response.css('title::text').extract_first() #extract_first在0.24.1版本不可用
response.css('title::text')[0].extract()
response.css('title::text').re(r'Quotes.*')
response.css('title::text').re(r'Q\w+')
response.css('title::text').re(r'(\w+) to (\w+)')
# 测试这些xpath方法看看输出啥
response.xpath('//title')
response.xpath('//title').extract()
response.xpath('//title/text()')
response.xpath('//title/text()').extract()
# response.xpath('//title/text()').extract_first() #extract_first在0.24.1版本不可用
response.xpath('//title/text()')[0].extract()
response.xpath('//title/text()').re(r'Quotes.*')
response.xpath('//title/text()').re(r'Q\w+')
response.xpath('//title/text()').re(r'(\w+) to (\w+)')
上面css与xpath 表达式部分不同其他对应一致,他们的输出结果基本一样,除了个别。
Extracting data in our spider
import scrapy
from tutorial.items import QuotesItem
class QuotesSpider(scrapy.Spider):
name = "quotes"
start_urls = [
'http://quotes.toscrape.com/page/1/',
'http://quotes.toscrape.com/page/2/',
]
def parse(self, response):
for quote in response.css('div.quote'):
item = QuotesItem()
item['title'] = quote.css('span.text::text')[0].extract(),
item['author'] = quote.css('small.author::text')[0].extract(),
item['tags'] = quote.css('div.tags a.tag::text')[0].extract(),
yield item
Spider将爬到的数据以Item对象返回,因此,还需要定义一个QuotesItem在items.py中
import scrapy
class QuotesItem(scrapy.Item):
title = scrapy.Field()
author = scrapy.Field()
tags = scrapy.Field()
Run 查看输出log,确定没有错误,否则返回修改上面代码
scrapy crawl quotes
Storing the scraped data
The simplest way to store the scraped data is by using Feed exports, with the following command:
scrapy crawl quotes -o quotes.json
You can also use other formats, like JSON Lines:
scrapy crawl quotes -o quotes.jl
Following links
import scrapy
from tutorial.items import QuotesItem
class QuotesSpider(scrapy.Spider):
name = "quotes"
start_urls = [
'http://quotes.toscrape.com/page/1/',
]
def parse(self, response):
for quote in response.css('div.quote'):
item = QuotesItem()
item['title'] = quote.css('span.text::text')[0].extract(),
item['author'] = quote.css('small.author::text')[0].extract(),
item['tags'] = quote.css('div.tags a.tag::text')[0].extract(),
yield item
next_page = response.css('li.next a::attr(href)')[0].extract()
if next_page is not None:
next_page = 'http://quotes.toscrape.com'+next_page
yield scrapy.Request(next_page, callback=self.parse)
这里关键是response.css(‘li.next a::attr(href)’)[0].extract()获取到了’/page/2/’,然后通过scrapy.Request递归调用,再次爬取了’/page/2/’,这样实现了跟踪链接的效果。
新版本会提供response.urljoin来代替我们现在手动拼接url,
最新版还会出response.follow来代替response.urljoin和scrapy.Request两步操作。