Scrapy 1.4 + Python 3.6模拟登陆知乎

2019年6月11日 251次阅读来源: 卩丨ar丶倪儿彡

最近学习Python爬虫，用到了Scrapy这个爬虫框架。目前Scrapy的最新版本已经达到了Scrapy 1.4，并且支持Python 3，但是网上找到的中文资料基本都是老版本的Scrapy，并且只支持到Python 2.7。于是笔者决定将Scrapy 1.4的学习过程记录下来，供各位童鞋参考。

这篇文章讲解了如何模拟登陆知乎，参考另一篇简书文章Scrapy模拟登陆知乎，不过这篇文章是基于Python 2.7的版本，细节方面没弄清楚的童鞋可以去看看原文章。

模拟登陆过程都是在Spider类中完成的，这里用最基本的Spider实现。

1.在spiders目录下创建一个py文件：zhihu_spider.py，继承自Spider，先把基本的一些东西写进去：

class ZhihuQuestionSpider(Spider):
url_base = ‘https://www.zhihu.com’ # 供后面获取链接使用
name = ‘login’ # Spider的名字
start_url_base = ‘https://www.zhihu.com/collection/27915947?page=’ # 爬取页面
start_urls = [‘https://www.zhihu.com/collection/27915947’] # 这里没用到这个
headers = {
“Accept”: “text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8”,
“Accept-Encoding”: “gzip,deflate”,
“Accept-Language”: “en-US,en;q=0.8,zh-TW;q=0.6,zh;q=0.4”,
“Connection”: “keep-alive”,
“Content-Type”: ” application/x-www-form-urlencoded; charset=UTF-8″,
“User-Agent”: “Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.97 Safari/537.36”,
“Referer”: “http://www.zhihu.com”
}

里面都是一些之后会用到的量，注释讲的很清楚了。

2.重写Spider的start_requests方法。

这里是整个爬虫的开始，从知乎首页开始，模拟登陆环境，这里的cookiejar是Scrapy的Cookie中间件的关键字，1表示这里只需要保存一个Cookie。这个方法将获取到的页面交给post_login方法模拟登陆。

# 从这里开始↓
def start_requests(self):
return [Request(“https://www.zhihu.com/”, headers=self.headers, meta={“cookiejar”: 1}, callback=self.post_login)]

3.post_login模拟登陆方法。注意修改一下登陆邮箱和密码。

# 进入知乎主页模仿登陆
def post_login(self, response):
self.log(‘preparing login…’)
xsrf = Selector(response).xpath(‘//div[@data-za-module=”SignInForm”]//form//input[@name=”_xsrf”]/@value’).extract()[0]
self.log(xsrf)
return FormRequest(“https://www.zhihu.com/login/email”, meta={‘cookiejar’: response.meta[‘cookiejar’]},
headers=self.headers,
formdata={
‘_xsrf’: xsrf,
‘password’: ‘不能把密码告诉你们’,
’email’: ‘邮箱也不行’,
‘remember_me’: ‘true’,
},
callback=self.after_login,
)

4.登陆成功后的爬虫操作。这里先介绍以下这个案例的爬虫行为：

首先是最开始爬取的页面：这是一个名为生活文体艺术见闻的收藏，共有九百多篇收藏。

《Scrapy 1.4 + Python 3.6模拟登陆知乎》

从这个页面获取每个提问的页面，在从提问页面里获取问题、链接、阅读量、关注量等信息，如下图中红圈圈出来的信息。

《Scrapy 1.4 + Python 3.6模拟登陆知乎》

这里有个问题，收藏页面只显示10个提问，剩下的提问在分页里面，怎么处理呢？

可以通过这个链接搞定：

start_url_base = ‘https://www.zhihu.com/collection/27915947?page=’ # 爬取页面

page=后面填入相应的数字就代表相应的分页，当我们将第一个分页的提问爬取完成之后，将这个url后面加入其他的数字就可以爬取下一页的提问了。

5.登陆成功后进入相应的分页（即每页显示十个提问的那个页面），这个收藏一共91页，所以循环（1~91）次。

# 登陆成功后从start_urls里读出初始url，注入cookie
def after_login(self,response):
# 创建csv文件
with open(‘items.csv’, ‘w’, newline=”) as csvfile:
writer = csv.writer(csvfile, dialect=(‘excel’))
writer.writerow([‘标题’, ‘链接’, ‘关注量’, ‘浏览量’])
for page in range(1, 92):
url = self.start_url_base + str(page)
yield Request(url, meta={‘cookiejar’: 1}, headers=self.headers, callback=self.request_question)

6.分析该分页，得到提问的链接，并进入提问页面爬取有用的信息。笔者还是更喜欢用BeautifulSoup，将xpath转为BeautifulSoup可以这样：

soup = BeautifulSoup(request.body, ‘lxml’)

# 分析得到urls
def request_question(self,request):
soup = BeautifulSoup(request.body, ‘lxml’)
for urlDiv in soup.find_all(‘div’, class_=’zm-item’):
url = self.url_base + urlDiv.find(‘a’).get(‘href’)
yield Request(url,meta={‘cookiejar’:1},headers = self.headers,callback=self.parse_question)

7.最后一步了，进入提问页面，提取有用信息保存到csv中。

# 获取有用的信息
def parse_question(self,response):
soup = BeautifulSoup(response.body, ‘lxml’)
item = ZhihuItem()
item[‘questionTitle’] = soup.find(‘h1’).string
item[‘url’] = response.url
followDiv = soup.find(‘div’, class_=’NumberBoard QuestionFollowStatus-counts’)
item[‘follow’] = followDiv.find_all(‘div’, class_=’NumberBoard-value’)[0].string
item[‘page_view’] = followDiv.find_all(‘div’, class_=’NumberBoard-value’)[1].string
# 将问题标题写入csv
with open(‘items.csv’, ‘a’, newline=”) as csvfile:
writer = csv.writer(csvfile, dialect=(‘excel’))
writer.writerow([item[‘questionTitle’], item[‘url’], item[‘follow’], item[‘page_view’]])
return item

8.在命令行里输入：scrapy crawl login运行爬虫，爬取结果如下：

《Scrapy 1.4 + Python 3.6模拟登陆知乎》

不过没能解决知乎安全验证的问题，当爬取过多、过快时就要输入验证码了。

（PS.谁能告诉简书怎么插入代码啊，“`完全没用呀！！！）

    原文作者：卩丨ar丶倪儿彡
    原文地址: https://www.jianshu.com/p/5fc189fed200
    本文转自网络文章，转载此文章仅为分享知识，如有侵权，请联系博主进行删除。