任务
使用递归抓取简书用户信息
解释:获取到一个初识的用户url,我们需要对其进行请求,解析出这个用户的关注的和粉丝中再抽取url,循环往复,周而复始,获取他们的
昵称-nickname
关注数-followed
粉丝数- following
文章数-articles
文字数-charlength
喜欢数-likes
创建scrapy项目
scrapy startproject JianShu
生成爬虫
切换到项目文件夹中,在这里是JianShu文件夹,代码如下
cd JianShu
然后生成爬虫,注意爬虫名不能与项目名相。
scrapy genspider 爬虫名 域名
scrapy genspider jianshu https://www.jianshu.com
构建爬虫各功能模块
scrapy爬虫框架,概括起来是
spider下的爬虫脚本负责业务逻辑,发起请求,解析数据。
middleware负责对爬虫进行伪装或者加代理
item将爬虫脚本中的请求解析的数据封装到数据容器
并传递给pipeline以保存到csv、txt或者数据库中去。
settings存储项目各种参数
main主程序,运行开始爬数据
伪装请求头
更好地伪装浏览器,防止被Ban。
更换不同的user_agent,Scrapy使用Middleware即可
Spider 中间件(Middleware) 下载器中间件是介入到 Scrapy 的 spider 处理机制的钩子框架,可以添加代码来处理发送给Spiders的 response 及 spider 产生的 item 和 request。
步骤一
创建一个中间件(HeadersMiddleware)
middlewares.py
from scrapy.downloadermiddlewares.useragent import UserAgentMiddleware
from JianShu.settings import UserAgentList
import random
class HeadersDownloaderMiddleware(UserAgentMiddleware):
"""
给请求随机加入伪装头headers
"""
def process_request(self, request, spider):
ua = random.choice(UserAgentList)
if ua:
request.headers.setdefault('User-Agent', ua)
步骤二
在scrapy中,我们先在settings.py中加入多个浏览器User-Agent,取消DOWNLOADER_MIDDLEWARES的前的注释,激活中间件。
settings.py
DOWNLOADER_MIDDLEWARES = {
'JianShu.middlewares.HeadersDownloaderMiddleware': 400,
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
}
UserAgentList = ["Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50",
"Mozilla/5.0 (Windows; U; Windows NT 6.1; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50",
"Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0;",
"Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0)",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)",
"Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:2.0.1) Gecko/20100101 Firefox/4.0.1",
"Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1",
"Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; en) Presto/2.8.131 Version/11.11",
"Opera/9.80 (Windows NT 6.1; U; en) Presto/2.8.131 Version/11.11"
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_0) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Maxthon 2.0)",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; TencentTraveler 4.0)"
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; The World)",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SE 2.X MetaSr 1.0; SE 2.X MetaSr 1.0; .NET CLR 2.0.50727; SE 2.X MetaSr 1.0)",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; 360SE)",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Avant Browser)",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)",
]
settings.py中的MIDDLEWARES的路径,应该是:
yourproject.middlewares(文件名).middleware类
如果你的中间件的类名和文件名都使用了RandomUserAgentMiddleware,那这个路径应该写成:
yourproject.RandomUserAgentMiddleware.RandomUserAgentMiddleware
tem容器-整理数据
将item理解成存储数据的容器,类似于字典。只不过这个字典可以还有很多功能,可以在scrapy中飞来飞去的
from scrapy import Item,Field
class JianshuItem(Item):
nickname = Field()
description = Field()
followed = Field()
following = Field()
articles = Field()
charlength = Field()
likes = Field()
pipeline-存储到csv文件中
经过item整理后的数据,我们就可以通过pipeline保存到csv中去
import csv
class CSVPipeline(object):
def __init__(self):
#初始化csv
self.csvf = open('data.csv', 'a+', encoding='gbk', newline='')
self.writer = csv.writer(self.csvf)
self.writer.writerow(('nickname', 'description', 'followed', 'fpllowing', 'articles', 'charlength', 'likes'))
self.csvf.close()
def process_item(self, item, spider):
with open('data.csv', 'a+', encoding='gbk', newline='') as f:
writer = csv.writer(f)
writer.writerow((item['nickname'], item['description'], item['followed'], item['following'], item['articles'], item['charlength'], item['likes']))
return item
再打开settings.py,取消ITEM_PIPELINES注释。让item与pipeline完美衔接,一个负责整理数据,一个负责保存数据。
ITEM_PIPELINES = {
'JianShu.pipelines.CSVPipeline': 300,}
编写爬虫
使用到xpath解析
注意,response.xpath()得到的是selector对象(而且是selector列表),selector对象有extract方法。所以
解析都是一个人的关注、粉丝、文章数等信息的提取
nickname = response.xpath("//div[@class='main-top']/div[@class='title']/a/text()").extract()[0]
#返回li的selector对象列表
info_selectors = response.xpath("//div[@class='main-top']/div[@class='info']/ul/li")
followed_url = 'https://www.jianshu.com'+info_selectors[0].xpath("./div/a/@href").extract()[0]
followed = info_selectors[0].xpath("./div/a/p/text()").extract()[0]following = info_selectors[1].xpath("./div/a/p/text()").extract()[0]
articles = info_selectors[2].xpath("./div/a/p/text()").extract()[0]charlength = info_selectors[3].xpath("./div/p/text()").extract()[0]
likes = info_selectors[4].xpath("./div/p/text()").extract()[0]
description = re.sub('\s','',''.join(response.xpath("//div[@class='js-intro']/text()").extract()))
关注列表解析,递归批量获取简书用户信息
pages = int(float(followed)/10)
for page in range(1,pages+1):
userlist_url = followed_url + '?page={page}'.format(page=page)
yield Request(userlist_url, callback=self.parseuserlist, dont_filter=True)
def parseuserlist(self,response):
url_list = response.xpath("//ul[@class='user-list']/li/div[@class='info']/a/@href").extract()
url_list = ['https://www.jianshu.com'+url for url in url_list]
for url in url_list:
yield Request(url,callback=self.parse,dont_filter=True)
整理汇总爬虫-jianshu.py
from scrapy import Spider, Request
from JianShu.items import JianshuItem
import re
class JianshuSpider(Spider):
name = 'jianshu'
allowed_domains = ['https://www.jianshu.com']
start_urls = ['https://www.jianshu.com/u/cf09bc3817a7']
def start_requests(self):
start_url = 'https://www.jianshu.com/u/1562c7f16a04'
yield Request(start_url, callback=self.parse)
def parse(self, response):
item = JianshuItem()
nickname = response.xpath("//div[@class='main-top']/div[@class='title']/a/text()").extract()[0]
info_selectors = response.xpath("//div[@class='main-top']/div[@class='info']/ul/li")
followed_url = 'https://www.jianshu.com'+info_selectors[0].xpath("./div/a/@href").extract()[0]
followed = info_selectors[0].xpath("./div/a/p/text()").extract()[0]
pages = int(float(followed)/10)
for page in range(1,pages+1):
userlist_url = followed_url + '?page={page}'.format(page=page)
yield Request(userlist_url, callback=self.parseuserlist, dont_filter=True)
following_url = 'https://www.jianshu.com' + info_selectors[1].xpath("./div/a/@href").extract()[0]
following = info_selectors[1].xpath("./div/a/p/text()").extract()[0]
print(following_url,following)
articles_url = 'https://www.jianshu.com' + info_selectors[2].xpath("./div/a/@href").extract()[0]
articles = info_selectors[2].xpath("./div/a/p/text()").extract()[0]
charlength = info_selectors[3].xpath("./div/p/text()").extract()[0]
likes = info_selectors[4].xpath("./div/p/text()").extract()[0]
description = re.sub('\s','',''.join(response.xpath("//div[@class='js-intro']/text()").extract()))
item['nickname'] = nickname
item['description'] = description
item['followed'] = followed
item['following'] = following
item['articles'] = articles
item['charlength'] = charlength
item['likes'] = likes
yield item
def parseuserlist(self,response):
url_list = response.xpath("//ul[@class='user-list']/li/div[@class='info']/a/@href").extract()
url_list = ['https://www.jianshu.com'+url for url in url_list]
for url in url_list:
yield Request(url,callback=self.parse,dont_filter=True)
当然为了方便调试,我们在项目的根目录创建一个main.py文件
main.py
from scrapy.cmdline import execute
import os,sys
sys.path.append(os.path.dirname(os.path.basename(__file__)))
#注意,jianshu是爬虫名,不是项目名
execute(['scrapy','crawl','jianshu'])
运行main.py文件即可。
这里再贴上完整的配置settings.py
settings.py
BOT_NAME = 'JianShu'
SPIDER_MODULES = ['JianShu.spiders']
NEWSPIDER_MODULE = 'JianShu.spiders'
"""
DOWNLOADER_MIDDLEWARES = {
'JianShu.middlewares.HeadersDownloaderMiddleware': None,
}
"""
ITEM_PIPELINES = {
'JianShu.pipelines.CSVPipeline': 300,
}
DOWNLOAD_DELAY = 0.1
ROBOTSTXT_OBEY = False
DOWNLOADER_MIDDLEWARES = {
'JianShu.middlewares.HeadersDownloaderMiddleware': 400,
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
}
UserAgentList = ["Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50",
"Mozilla/5.0 (Windows; U; Windows NT 6.1; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50",
"Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0;",
"Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0)",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)",
"Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:2.0.1) Gecko/20100101 Firefox/4.0.1",
"Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1",
"Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; en) Presto/2.8.131 Version/11.11",
"Opera/9.80 (Windows NT 6.1; U; en) Presto/2.8.131 Version/11.11"
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_0) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Maxthon 2.0)",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; TencentTraveler 4.0)"
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; The World)",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SE 2.X MetaSr 1.0; SE 2.X MetaSr 1.0; .NET CLR 2.0.50727; SE 2.X MetaSr 1.0)",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; 360SE)",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Avant Browser)",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)",
]