requests库登陆的模式自行百度一下”python模拟登陆新版知乎”,selenium也有。
准备
- 获得登陆post数据的url
- 获取登陆所需cookie
- 获取登陆参数
1.获取登陆post数据的url很简单,直接打开调试模式在登陆页面输入错误的账密,sign_in页面的request_url就是目标地址。
https://www.zhihu.com/api/v3/oauth/sign_in
2.同样输入错误的账号密码在sign_in页面有一个captcha?lang=en页面,同样方法拿到对应url.
https://www.zhihu.com/api/v3/oauth/captcha?lang=en
3.登陆参数,回到sign_in页面,chrome调试返回如下:
Request URL:https://www.zhihu.com/api/v3/oauth/sign_in
Request Method:POST
Status Code:401
Remote Address:211.159.244.190:443
Referrer Policy:no-referrer-when-downgrade
access-control-allow-credentials:true
access-control-allow-headers:
access-control-allow-methods:GET,PATCH,PUT,POST,DELETE,OPTIONS
access-control-allow-origin:https://www.zhihu.com
content-encoding:gzip
content-length:115
content-type:application/json; charset=utf-8
date:Sun, 25 Mar 2018 16:00:11 GMT
server:nginx
status:401
vary:Accept-Encoding
www-authenticate:Bearer realm="zhihu"
x-backend-server:zhihu-web.zapi-account.477c51df---10.64.194.2:31015[10.64.194.2:31015]
x-req-id:65962D55AB7C78A
x-req-ssl:proto=TLSv1.2,sni=www.zhihu.com,cipher=ECDHE-RSA-AES256-GCM-SHA384
x-za-experiment:default:None,ge3:ge3_9,ge2:ge2_1,SE_I:c,nwebQAGrowth:experiment,is_office:false,nweb_growth_people:default,info:0,is_show_unicom_free_entry:unicom_free_entry_off,biu:0,app_store_rate_dialog:close,android_profile_panel:panel_b,live_store:ls_a2_b1_c1_f2,nweb_search:nweb_search_heifetz,new_live_feed_mediacard:new,hybrid_zhmore_video:yes,new_mobile_column_appheader:new_header,enable_tts_play:post,rt:y,growth_search:s2,qrcode_login:qrcode,qaweb_related_readings_content_control:close,rows:1,biua:0,android_pass_through_push:all,new_mobile_app_header:true,enable_vote_down_reason_menu:enable,u_re:0,android_db_recommend_action:open,zcm-lighting:zcm,android_db_feed_hash_tag_style:button,mobile_feed_guide:button,is_new_noti_panel:no,wechat_share_modal:wechat_share_modal_show,nweb_search_suggest:default,growth_banner:default
x-za-response-id:7cd5b62dda1da2b9142991f2f07bee0c
:authority:www.zhihu.com
:method:POST
:path:/api/v3/oauth/sign_in
:scheme:https
accept:application/json, text/plain, */*
accept-encoding:gzip, deflate, br
accept-language:zh-CN,zh;q=0.9,en;q=0.8
authorization:oauth c3cef7c66a1843f8b3a9e6a1e3160e20
content-length:1229
content-type:multipart/form-data; boundary=----WebKitFormBoundaryNEOiAmJV7kWT8DkJ
cookie:__DAYU_PP=vIaiqBY2aV7uRIjBNZIJ56ea6e9077b8; _xsrf=7675120f-2bfa-4b1f-a708-41ed0a4cbdf4; q_c1=fc333f7760fe4a38be92609bf166d219|1521907926000|1521907926000; _zap=b531b644-4f98-435b-a879-500803fefaf1; __utma=155987696.1937086906.1521908507.1521908507.1521908507.1; __utmc=155987696; __utmz=155987696.1521908507.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none); d_c0="ALCujRVJVg2PTqqMZytO05gVzbHV-etkt7g=|1521908820"; capsion_ticket="2|1:0|10:1521993599|14:capsion_ticket|44:ZTJmNmQ0YzBlYjNmNDY0Y2IyMzBmMmVjZDgyZDhhNmI=|96a1703f8665b4746725006ac4aacba50ad3d8e0c71048011700b8de14e5ec66"
origin:https://www.zhihu.com
referer:https://www.zhihu.com/signup?next=%2F
user-agent:Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.167 Safari/537.36
x-udid:ALCujRVJVg2PTqqMZytO05gVzbHV-etkt7g=
x-xsrftoken:7675120f-2bfa-4b1f-a708-41ed0a4cbdf4
------WebKitFormBoundaryNEOiAmJV7kWT8DkJ
Content-Disposition: form-data; name="client_id"
c3cef7c66a1843f8b3a9e6a1e3160e20
------WebKitFormBoundaryNEOiAmJV7kWT8DkJ
Content-Disposition: form-data; name="grant_type"
password
------WebKitFormBoundaryNEOiAmJV7kWT8DkJ
Content-Disposition: form-data; name="timestamp"
1521993604811
------WebKitFormBoundaryNEOiAmJV7kWT8DkJ
Content-Disposition: form-data; name="source"
com.zhihu.web
------WebKitFormBoundaryNEOiAmJV7kWT8DkJ
Content-Disposition: form-data; name="signature"
de35ff024d7152d39503fa099180afc2720db403
------WebKitFormBoundaryNEOiAmJV7kWT8DkJ
Content-Disposition: form-data; name="username"
+8613250079979
------WebKitFormBoundaryNEOiAmJV7kWT8DkJ
Content-Disposition: form-data; name="password"
admin123
------WebKitFormBoundaryNEOiAmJV7kWT8DkJ
Content-Disposition: form-data; name="captcha"
------WebKitFormBoundaryNEOiAmJV7kWT8DkJ
Content-Disposition: form-data; name="lang"
en
------WebKitFormBoundaryNEOiAmJV7kWT8DkJ
Content-Disposition: form-data; name="ref_source"
homepage
------WebKitFormBoundaryNEOiAmJV7kWT8DkJ
Content-Disposition: form-data; name="utm_source"
------WebKitFormBoundaryNEOiAmJV7kWT8DkJ--
很大一段,我们只管其中几个参数:
- Response Headers里面的authorization
2.Request Payload下面的所有参数:
------WebKitFormBoundaryNEOiAmJV7kWT8DkJ
Content-Disposition: form-data; name="client_id"
c3cef7c66a1843f8b3a9e6a1e3160e20
------WebKitFormBoundaryNEOiAmJV7kWT8DkJ
Content-Disposition: form-data; name="grant_type"
password
------WebKitFormBoundaryNEOiAmJV7kWT8DkJ
Content-Disposition: form-data; name="timestamp"
1521993604811
------WebKitFormBoundaryNEOiAmJV7kWT8DkJ
Content-Disposition: form-data; name="source"
com.zhihu.web
------WebKitFormBoundaryNEOiAmJV7kWT8DkJ
Content-Disposition: form-data; name="signature"
de35ff024d7152d39503fa099180afc2720db403
------WebKitFormBoundaryNEOiAmJV7kWT8DkJ
Content-Disposition: form-data; name="username"
+8613250079979
------WebKitFormBoundaryNEOiAmJV7kWT8DkJ
Content-Disposition: form-data; name="password"
admin123
------WebKitFormBoundaryNEOiAmJV7kWT8DkJ
Content-Disposition: form-data; name="captcha"
------WebKitFormBoundaryNEOiAmJV7kWT8DkJ
Content-Disposition: form-data; name="lang"
en
------WebKitFormBoundaryNEOiAmJV7kWT8DkJ
Content-Disposition: form-data; name="ref_source"
homepage
------WebKitFormBoundaryNEOiAmJV7kWT8DkJ
Content-Disposition: form-data; name="utm_source"
------WebKitFormBoundaryNEOiAmJV7kWT8DkJ--
这里换算成键值对就是:
{
'client_id':'c3cef7c66a1843f8b3a9e6a1e3160e20'
'grant_type':'password'
......
}
其中,signature比较难找,在chrome里面直接shitf+ctrl+f全局搜索’signature’,搜索出来在一个叫
https://static.zhihu.com/heifetz/main.app.327d25e7f280cfb582a1.js 的js里面,ide打开js文件直接搜索signature,生成逻辑如下:
function r(e, t) {
var n = Date.now(), r = new a.a("SHA-1", "TEXT");
return r.setHMACKey("d1b964811afb40118a12068ff74a12f4", "TEXT"), r.update(e), r.update(i.a), r.update("com.zhihu.web"), r.update(String(n)), s({
clientId: i.a, grantType: e, timestamp: n, source: "com.zhihu.web",
signature: r.getHMAC("HEX")
}, t)
}
生成的是HMC的SHA-1值,由grantType,clientId,”com.zhihu.web”,timestamp还有‘d1b964811afb40118a12068ff74a12f4’这个字节生成。
对应的python代码:
client_id = 'c3cef7c66a1843f8b3a9e6a1e3160e20'
grant_type = 'password'
timestamp = str(round(time.time() * 1000))
source = 'com.zhihu.web'
def get_signature():
hm = hmac.new(b'd1b964811afb40118a12068ff74a12f4', None, hashlib.sha1)
hm.update(grant_type.encode())
hm.update(client_id.encode())
hm.update(timestamp.encode())
hm.update(source.encode())
return hm.hexdigest()
注意,一定是grant_type后client_id后timestamp后source,调转了输出的密文是不一样的。
其他字段固定的,timestamp参照上面代码。
scrapy:
class ZhihuLoginSpider(scrapy.Spider):
name = 'zhihu_login'
allowed_domains = ['www.zhihu.com']
start_urls = ['http://www.zhihu.com/']
##声明相应字段
client_id = 'c3cef7c66a1843f8b3a9e6a1e3160e20'
grant_type = 'password'
timestamp = str(round(time.time() * 1000))
source = 'com.zhihu.web'
captcha = ""
lang = 'en'
ref_source = "homepage"
utm_source = ""
#注意header 要加authorization
header = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.167 Safari/537.36",
"authorization": f"oauth {client_id}"
}
#获得相应的signature
def get_signature(self):
hm = hmac.new(b'd1b964811afb40118a12068ff74a12f4', None, hashlib.sha1)
hm.update(self.grant_type.encode())
hm.update(self.client_id.encode())
hm.update(self.source.encode())
hm.update(self.timestamp.encode())
return hm.hexdigest()
def parse(self, response):
pass
#重写start_requests方法,要它先get https://www.zhihu.com/api/v3/oauth/captcha?lang=en拿到capsion_ticket 的cookie,不然没有这个cookie无法登陆
def start_requests(self):
return [scrapy.Request(url='https://www.zhihu.com/api/v3/oauth/captcha?lang=en', callback=self.call_data,
headers=self.header)]
##正式模拟登陆,post相应字段
def call_data(self, response):
##该字段表明是否要填写验证码,true就是需要填写,false则不用。
print(json.loads(response.text)["show_captcha"])
post_data = {
'client_id': self.client_id,
'grant_type': self.grant_type,
'timestamp': self.timestamp,
'source': self.source,
'captcha': self.captcha,
'signature': self.get_signature(),
'username': '填写自己的用户',
'password': '填写自己的密码',
'lang': self.lang,
'ref_source': self.ref_source,
'utm_source': self.utm_source
}
return scrapy.FormRequest(url='https://www.zhihu.com/api/v3/oauth/sign_in', formdata=post_data,
headers=self.header, callback=self.login_callback)
#登陆成功后直接访问知乎首页,登陆状态下有相应数据返回
def login_callback(self, response):
return Request(url='http://www.zhihu.com', headers=self.header, callback=self.login_callback1)
#数据返回在这的response
def login_callback1(self, response):
pass
我这里没有验证码的相关逻辑,各位看官自行完善。