今天是2016年6月26日,开始学习爬虫。
软件包使用Scrapy。
已经在linux虚拟机下安装了anaconda3,安装Scrapy,版本为1.1。
以这个网址作为https://doc.scrapy.org/en/1.1/intro/tutorial.html做为教程。
以前用过爬虫,但非常简单,现在需要爬取天气,地震,突发事件等,尝试使用scrapy来获取信息。
首先建一个项目,用拼音,取名tianqi
具体方法如下:
scrapy startproject tianqi
发现我的home目录下已经有tianqi这个目录。
进入这个目录,列出目录内容:
[root@wangqi tianqi]# ls -l
total 8
-rw-rw-r– 1 eyeglasses root 256 Jun 26 14:21 scrapy.cfg
drwxrwxr-x 4 eyeglasses root 4096 Jun 26 14:21 tianqi
有一个目录和一个文件。
文件的名称是scrapy.cfg,看得出来,这个是配置文件,看看里面有些什么内容。
——————————————————————————————
# Automatically created by: scrapy startproject
#
# For more information about the [deploy] section see:
# https://scrapyd.readthedocs.org/en/latest/deploy.html
[settings]
default = tianqi.settings
[deploy]
#url = http://localhost:6800/
project = tianqi
———————————————————————————————————-
感觉没有什么东西。
看看目录tianqi,看看里面有什么内容:
[root@wangqi tianqi]# ls -l
total 20
-rw-rw-r– 1 eyeglasses root 0 Jul 14 2016 __init__.py
-rw-r–r– 1 eyeglasses root 285 Jun 26 14:21 items.py
-rw-r–r– 1 eyeglasses root 286 Jun 26 14:21 pipelines.py
drwxrwxr-x 2 eyeglasses root 4096 Jun 26 15:13 __pycache__
-rw-r–r– 1 eyeglasses root 3128 Jun 26 14:21 settings.py
drwxrwxr-x 3 eyeglasses root 4096 Jun 26 15:49 spiders
有4个文件和2个目录,文件分别为:__init__.py ,items.py ,pipelines.py, settings.py,后缀都是py,看来是python源码,看看里面的内容。
[root@wangqi tianqi]#vi __init__.py
执行上面的命令,发现里面什么都没有,在python模块的每一个包中,都有一个__init__.py文件(这个文件定义了包的属性和方法)如果里面什么都没有,那么你也可以将这个目录作为一个包,作为模块导入。
———————————————————————
[root@wangqi tianqi]# vi items.py #Items是保存爬取到的数据的容器
# -*- coding: utf-8 -*-
# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html
import scrapy
class TianqiItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
pass
——————————————————————————–
继续看pipelines.py,这个是管道
————————————————————–
[root@wangqi tianqi]# vi pipelines.py
# -*- coding: utf-8 -*-
# Define your item pipelines here
#
# Don’t forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
class TianqiPipeline(object):
def process_item(self, item, spider):
return item
——————————————————————–
看看settings.py里面的内容,可以看见全局配置,还有robots.txt rules,robots协议定义了网站允许爬虫爬行的范围。
[root@wangqi tianqi]# vi settings.py
# -*- coding: utf-8 -*-
# Scrapy settings for tianqi project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
# http://doc.scrapy.org/en/latest/topics/settings.html
# http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
# http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html
BOT_NAME = ‘tianqi’
SPIDER_MODULES = [‘tianqi.spiders’]
NEWSPIDER_MODULE = ‘tianqi.spiders’
# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = ‘tianqi (+http://www.yourdomain.com)’
# Obey robots.txt rules
ROBOTSTXT_OBEY = True
# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32
# Configure a delay for requests for the same website (default: 0)
# See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16
# Disable cookies (enabled by default)
#COOKIES_ENABLED = False
# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False
# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
# ‘Accept’: ‘text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8’,
# ‘Accept-Language’: ‘en’,
#}
# Enable or disable spider middlewares
# See http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html
现在我们来看看目录,有两个目录,先看__pycache__,
当第一次运行 python 脚本时,解释器会将 *.py 脚本进行编译并保存到 __pycache__ 目录
下次执行脚本时,若解释器发现你的 *.py 脚本没有变更,便会跳过编译一步,直接运行保存在 __pycache__ 目录下的 *.pyc 文件
—————————————————————–
[root@wangqi __pycache__]# ls -l
total 8
-rw-r–r– 1 eyeglasses root 125 Jun 26 15:13 __init__.cpython-35.pyc
-rw-r–r– 1 eyeglasses root 240 Jun 26 15:13 settings.cpython-35.pyc
———————————————————————————
关闭 pycache
单次关闭: 运行脚本时添加 -B 参数即可
永久关闭: 设置环境变量 PYTHONDONTWRITEBYTECODE=1 即可
剩下最后一个目录,spiders,这个目录是用来放代码的目录,比如你要爬行哪个网站,就取名,我一般是按照域名来命名,便于记忆。
[root@wangqi spiders]# ls -l
total 12
-rw-rw-r– 1 eyeglasses root 161 Jul 14 2016 __init__.py
drwxrwxr-x 2 eyeglasses root 4096 Jun 26 15:21 __pycache__
-rw-r–r– 1 eyeglasses root 589 Jun 26 15:10 weather_spider.py
这里面有三个文件,我自己建了一个weather_spider.py,用来爬行天气网站,里面是爬行代码,而__init__.py与__pycache__上面有详细说明。
—————————————————————————
[root@wangqi spiders]# vi weather_spider.py
#!/bin/env python
# -*- coding:utf-8 -*-
import scrapy
class WeatherSpider(scrapy.Spider):
name = “weather”
def start_requests(self):
urls = [
‘http://sc.weather.com.cn/chengdu/index.shtml’,
‘http://sc.weather.com.cn/neijiang/index.shtml’,
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
page = response.url.split(“/”)[-2]
filename = ‘weather-%s.html’ % page
with open(filename, ‘wb’) as f:
f.write(response.body)
self.log(‘Saved file %s’ % filename)
———————————————————————————————–
开始运行,输入命令:
[eyeglasses@wangqi spiders]$ scrapy crawl weather
2017-06-27 10:18:31 [scrapy] INFO: Scrapy 1.1.1 started (bot: tianqi)
2017-06-27 10:18:31 [scrapy] INFO: Overridden settings: {‘NEWSPIDER_MODULE’: ‘tianqi.spiders’, ‘ROBOTSTXT_OBEY’: True, ‘BOT_NAME’: ‘tianqi’, ‘SPIDER_MODULES’: [‘tianqi.spiders’]}
2017-06-27 10:18:31 [scrapy] INFO: Enabled extensions:[‘scrapy.extensions.corestats.CoreStats’, ‘scrapy.extensions.telnet.TelnetConsole’, ‘scrapy.extensions.logstats.LogStats’]
2017-06-27 10:18:31 [scrapy] INFO: Enabled downloader middlewares:[‘scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware’, ‘scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware’, ‘scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware’, ‘scrapy.downloadermiddlewares.useragent.UserAgentMiddleware’, ‘scrapy.downloadermiddlewares.retry.RetryMiddleware’, ‘scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware’, ‘scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware’, ‘scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware’, ‘scrapy.downloadermiddlewares.redirect.RedirectMiddleware’, ‘scrapy.downloadermiddlewares.cookies.CookiesMiddleware’, ‘scrapy.downloadermiddlewares.chunked.ChunkedTransferMiddleware’, ‘scrapy.downloadermiddlewares.stats.DownloaderStats’]
2017-06-27 10:18:31 [scrapy] INFO: Enabled spider middlewares:[‘scrapy.spidermiddlewares.httperror.HttpErrorMiddleware’, ‘scrapy.spidermiddlewares.offsite.OffsiteMiddleware’, ‘scrapy.spidermiddlewares.referer.RefererMiddleware’, ‘scrapy.spidermiddlewares.urllength.UrlLengthMiddleware’, ‘scrapy.spidermiddlewares.depth.DepthMiddleware’]
2017-06-27 10:18:31 [scrapy] INFO: Enabled item pipelines:[]
2017-06-27 10:18:31 [scrapy] INFO: Spider opened
2017-06-27 10:18:31 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-06-27 10:18:31 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:60232017-06-27 10:18:31 [scrapy] DEBUG: Redirecting (302) tofrom
2017-06-27 10:18:31 [scrapy] DEBUG: Crawled (200)(referer: None)
2017-06-27 10:18:32 [scrapy] DEBUG: Crawled (200)(referer: None)2017-06-27 10:18:32 [weather] DEBUG: Saved file quotes-neijiang.html
2017-06-27 10:18:32 [scrapy] DEBUG: Crawled (200)(referer: None)
2017-06-27 10:18:32 [weather] DEBUG: Saved file quotes-chengdu.html
2017-06-27 10:18:32 [scrapy] INFO: Closing spider (finished)
2017-06-27 10:18:32 [scrapy] INFO: Dumping Scrapy stats:
{‘downloader/request_bytes’: 969,
‘downloader/request_count’: 4,
‘downloader/request_method_count/GET’: 4,
‘downloader/response_bytes’: 34374,
‘downloader/response_count’: 4,
‘downloader/response_status_count/200’: 3,
‘downloader/response_status_count/302’: 1,
‘finish_reason’: ‘finished’,
‘finish_time’: datetime.datetime(2017, 6, 27, 2, 18, 32, 553229),
‘log_count/DEBUG’: 7,
‘log_count/INFO’: 7,
‘response_received_count’: 3,
‘scheduler/dequeued’: 2,
‘scheduler/dequeued/memory’: 2,
‘scheduler/enqueued’: 2,
‘scheduler/enqueued/memory’: 2,
‘start_time’: datetime.datetime(2017, 6, 27, 2, 18, 31, 471145)}
2017-06-27 10:18:32 [scrapy] INFO: Spider closed (finished)
查看目录,发现有html文件生成,成功。