第一步抓包
打开Fiddler监听浏览器端口
- 开始抓包
豆瓣登陆首页
- 找到登陆API
登陆请求头部信息
- 登陆请求表单提交
3.1
表单数据(无验证码)
3.2
表单数据(有验证码)
登陆时遇到验证码formdata表单数据会多两条
“captcha-solution”:验证码数据
“captcha-id”:验证码ID
- 登陆成功,提取个人信息
访问登陆之后才能访问的页面
Scrapy代码
1. spider.py文件
# -*- coding: utf-8 -*-
import scrapy
import urllib.request
from PIL import Image
class DoubanSpider(scrapy.Spider):
name = 'douban'
allowed_domains = ['douban.com']
login_url = "https://accounts.douban.com/login"
start_urls = [login_url]
def parse(self, response):
img_link = response.xpath('//img[@id="captcha_image"]/@src').extract_first()
captcha_id = response.xpath('//input[@name="captcha-id"]/@value').extract_first()
if img_link is None:
print("登陆时没有遇到验证码...")
formdata = {
"source": "index_nav",
"redir": "https://www.douban.com",
"form_email": "smartliu_it@163.com",
"form_password": "xxxxxxxxxx", # 密码
"login": "登录",
}
else:
print("登陆时遇到验证码...")
# 图片存储路径
img_path = "/home/python/Desktop/douBan-豆瓣登录/douBan/spiders/douban.jpg"
# 第一个参数接受url,第二个参数接受存储路径
urllib.request.urlretrieve(img_link, img_path)
try:
# 自动打开验证码图片
im = Image.open(img_path)
im.show()
except:
print("打开图片失败...")
captcha_solution = input("请输入验证码:")
formdata = {
"source": "index_nav",
"redir": "https://www.douban.com",
"form_email": "xxxxxxxxxx", # 密码
"form_password": "kongshan123.0",
"login": "登录",
"captcha-solution": captcha_solution,
"captcha-id": captcha_id,
}
print("正在登陆中...")
return scrapy.FormRequest(self.login_url, formdata=formdata, callback=self.login_after)
def login_after(self, response):
r = response.xpath('//a[@class="bn-more"]/span/text()').extract_first()
if r is None:
print("登陆失败!")
else:
print("登陆成功!当前账户为:%s" % r)
2. settings.py文件
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3192.0 Safari/537.36'
# ROBOTSTXT_OBEY = True