Scrapy登录新版知乎

requests库登陆的模式自行百度一下”python模拟登陆新版知乎”,selenium也有。

准备

  • 获得登陆post数据的url
  • 获取登陆所需cookie
  • 获取登陆参数

1.获取登陆post数据的url很简单,直接打开调试模式在登陆页面输入错误的账密,sign_in页面的request_url就是目标地址。

https://www.zhihu.com/api/v3/oauth/sign_in

2.同样输入错误的账号密码在sign_in页面有一个captcha?lang=en页面,同样方法拿到对应url.

https://www.zhihu.com/api/v3/oauth/captcha?lang=en

3.登陆参数,回到sign_in页面,chrome调试返回如下:

Request URL:https://www.zhihu.com/api/v3/oauth/sign_in
Request Method:POST
Status Code:401 
Remote Address:211.159.244.190:443
Referrer Policy:no-referrer-when-downgrade
access-control-allow-credentials:true
access-control-allow-headers:
access-control-allow-methods:GET,PATCH,PUT,POST,DELETE,OPTIONS
access-control-allow-origin:https://www.zhihu.com
content-encoding:gzip
content-length:115
content-type:application/json; charset=utf-8
date:Sun, 25 Mar 2018 16:00:11 GMT
server:nginx
status:401
vary:Accept-Encoding
www-authenticate:Bearer realm="zhihu"
x-backend-server:zhihu-web.zapi-account.477c51df---10.64.194.2:31015[10.64.194.2:31015]
x-req-id:65962D55AB7C78A
x-req-ssl:proto=TLSv1.2,sni=www.zhihu.com,cipher=ECDHE-RSA-AES256-GCM-SHA384
x-za-experiment:default:None,ge3:ge3_9,ge2:ge2_1,SE_I:c,nwebQAGrowth:experiment,is_office:false,nweb_growth_people:default,info:0,is_show_unicom_free_entry:unicom_free_entry_off,biu:0,app_store_rate_dialog:close,android_profile_panel:panel_b,live_store:ls_a2_b1_c1_f2,nweb_search:nweb_search_heifetz,new_live_feed_mediacard:new,hybrid_zhmore_video:yes,new_mobile_column_appheader:new_header,enable_tts_play:post,rt:y,growth_search:s2,qrcode_login:qrcode,qaweb_related_readings_content_control:close,rows:1,biua:0,android_pass_through_push:all,new_mobile_app_header:true,enable_vote_down_reason_menu:enable,u_re:0,android_db_recommend_action:open,zcm-lighting:zcm,android_db_feed_hash_tag_style:button,mobile_feed_guide:button,is_new_noti_panel:no,wechat_share_modal:wechat_share_modal_show,nweb_search_suggest:default,growth_banner:default
x-za-response-id:7cd5b62dda1da2b9142991f2f07bee0c
:authority:www.zhihu.com
:method:POST
:path:/api/v3/oauth/sign_in
:scheme:https
accept:application/json, text/plain, */*
accept-encoding:gzip, deflate, br
accept-language:zh-CN,zh;q=0.9,en;q=0.8
authorization:oauth c3cef7c66a1843f8b3a9e6a1e3160e20
content-length:1229
content-type:multipart/form-data; boundary=----WebKitFormBoundaryNEOiAmJV7kWT8DkJ
cookie:__DAYU_PP=vIaiqBY2aV7uRIjBNZIJ56ea6e9077b8; _xsrf=7675120f-2bfa-4b1f-a708-41ed0a4cbdf4; q_c1=fc333f7760fe4a38be92609bf166d219|1521907926000|1521907926000; _zap=b531b644-4f98-435b-a879-500803fefaf1; __utma=155987696.1937086906.1521908507.1521908507.1521908507.1; __utmc=155987696; __utmz=155987696.1521908507.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none); d_c0="ALCujRVJVg2PTqqMZytO05gVzbHV-etkt7g=|1521908820"; capsion_ticket="2|1:0|10:1521993599|14:capsion_ticket|44:ZTJmNmQ0YzBlYjNmNDY0Y2IyMzBmMmVjZDgyZDhhNmI=|96a1703f8665b4746725006ac4aacba50ad3d8e0c71048011700b8de14e5ec66"
origin:https://www.zhihu.com
referer:https://www.zhihu.com/signup?next=%2F
user-agent:Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.167 Safari/537.36
x-udid:ALCujRVJVg2PTqqMZytO05gVzbHV-etkt7g=
x-xsrftoken:7675120f-2bfa-4b1f-a708-41ed0a4cbdf4
------WebKitFormBoundaryNEOiAmJV7kWT8DkJ
Content-Disposition: form-data; name="client_id"

c3cef7c66a1843f8b3a9e6a1e3160e20
------WebKitFormBoundaryNEOiAmJV7kWT8DkJ
Content-Disposition: form-data; name="grant_type"

password
------WebKitFormBoundaryNEOiAmJV7kWT8DkJ
Content-Disposition: form-data; name="timestamp"

1521993604811
------WebKitFormBoundaryNEOiAmJV7kWT8DkJ
Content-Disposition: form-data; name="source"

com.zhihu.web
------WebKitFormBoundaryNEOiAmJV7kWT8DkJ
Content-Disposition: form-data; name="signature"

de35ff024d7152d39503fa099180afc2720db403
------WebKitFormBoundaryNEOiAmJV7kWT8DkJ
Content-Disposition: form-data; name="username"

+8613250079979
------WebKitFormBoundaryNEOiAmJV7kWT8DkJ
Content-Disposition: form-data; name="password"

admin123
------WebKitFormBoundaryNEOiAmJV7kWT8DkJ
Content-Disposition: form-data; name="captcha"


------WebKitFormBoundaryNEOiAmJV7kWT8DkJ
Content-Disposition: form-data; name="lang"

en
------WebKitFormBoundaryNEOiAmJV7kWT8DkJ
Content-Disposition: form-data; name="ref_source"

homepage
------WebKitFormBoundaryNEOiAmJV7kWT8DkJ
Content-Disposition: form-data; name="utm_source"


------WebKitFormBoundaryNEOiAmJV7kWT8DkJ--

很大一段,我们只管其中几个参数:

  1. Response Headers里面的authorization
    2.Request Payload下面的所有参数:
------WebKitFormBoundaryNEOiAmJV7kWT8DkJ
Content-Disposition: form-data; name="client_id"

c3cef7c66a1843f8b3a9e6a1e3160e20
------WebKitFormBoundaryNEOiAmJV7kWT8DkJ
Content-Disposition: form-data; name="grant_type"

password
------WebKitFormBoundaryNEOiAmJV7kWT8DkJ
Content-Disposition: form-data; name="timestamp"

1521993604811
------WebKitFormBoundaryNEOiAmJV7kWT8DkJ
Content-Disposition: form-data; name="source"

com.zhihu.web
------WebKitFormBoundaryNEOiAmJV7kWT8DkJ
Content-Disposition: form-data; name="signature"

de35ff024d7152d39503fa099180afc2720db403
------WebKitFormBoundaryNEOiAmJV7kWT8DkJ
Content-Disposition: form-data; name="username"

+8613250079979
------WebKitFormBoundaryNEOiAmJV7kWT8DkJ
Content-Disposition: form-data; name="password"

admin123
------WebKitFormBoundaryNEOiAmJV7kWT8DkJ
Content-Disposition: form-data; name="captcha"


------WebKitFormBoundaryNEOiAmJV7kWT8DkJ
Content-Disposition: form-data; name="lang"

en
------WebKitFormBoundaryNEOiAmJV7kWT8DkJ
Content-Disposition: form-data; name="ref_source"

homepage
------WebKitFormBoundaryNEOiAmJV7kWT8DkJ
Content-Disposition: form-data; name="utm_source"


------WebKitFormBoundaryNEOiAmJV7kWT8DkJ--

这里换算成键值对就是:

{
      'client_id':'c3cef7c66a1843f8b3a9e6a1e3160e20'
      'grant_type':'password'
       ......
}

其中,signature比较难找,在chrome里面直接shitf+ctrl+f全局搜索’signature’,搜索出来在一个叫
https://static.zhihu.com/heifetz/main.app.327d25e7f280cfb582a1.js 的js里面,ide打开js文件直接搜索signature,生成逻辑如下:


    function r(e, t) {
        var n = Date.now(), r = new a.a("SHA-1", "TEXT");
        return r.setHMACKey("d1b964811afb40118a12068ff74a12f4", "TEXT"), r.update(e), r.update(i.a), r.update("com.zhihu.web"), r.update(String(n)), s({
            clientId: i.a, grantType: e, timestamp: n, source: "com.zhihu.web",
            signature: r.getHMAC("HEX")
        }, t)
    }

生成的是HMC的SHA-1值,由grantType,clientId,”com.zhihu.web”,timestamp还有‘d1b964811afb40118a12068ff74a12f4’这个字节生成。

对应的python代码:

client_id = 'c3cef7c66a1843f8b3a9e6a1e3160e20'
grant_type = 'password'
timestamp = str(round(time.time() * 1000))
source = 'com.zhihu.web'

def get_signature():
    hm = hmac.new(b'd1b964811afb40118a12068ff74a12f4', None, hashlib.sha1)
    hm.update(grant_type.encode())
    hm.update(client_id.encode())
    hm.update(timestamp.encode())
    hm.update(source.encode())
    return hm.hexdigest()

注意,一定是grant_type后client_id后timestamp后source,调转了输出的密文是不一样的。

其他字段固定的,timestamp参照上面代码。

scrapy:


class ZhihuLoginSpider(scrapy.Spider):
    name = 'zhihu_login'
    allowed_domains = ['www.zhihu.com']
    start_urls = ['http://www.zhihu.com/']
    ##声明相应字段
    client_id = 'c3cef7c66a1843f8b3a9e6a1e3160e20'
    grant_type = 'password'
    timestamp = str(round(time.time() * 1000))
    source = 'com.zhihu.web'
    captcha = ""
    lang = 'en'
    ref_source = "homepage"
    utm_source = ""
    #注意header 要加authorization
    header = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.167 Safari/537.36",
        "authorization": f"oauth {client_id}"
    }
   #获得相应的signature
    def get_signature(self):
        hm = hmac.new(b'd1b964811afb40118a12068ff74a12f4', None, hashlib.sha1)
        hm.update(self.grant_type.encode())
        hm.update(self.client_id.encode())
        hm.update(self.source.encode())
        hm.update(self.timestamp.encode())
        return hm.hexdigest()

    def parse(self, response):
        pass
    #重写start_requests方法,要它先get https://www.zhihu.com/api/v3/oauth/captcha?lang=en拿到capsion_ticket 的cookie,不然没有这个cookie无法登陆
    def start_requests(self):
        return [scrapy.Request(url='https://www.zhihu.com/api/v3/oauth/captcha?lang=en', callback=self.call_data,
                               headers=self.header)]
   ##正式模拟登陆,post相应字段
    def call_data(self, response):
   ##该字段表明是否要填写验证码,true就是需要填写,false则不用。
        print(json.loads(response.text)["show_captcha"])
        post_data = {
            'client_id': self.client_id,
            'grant_type': self.grant_type,
            'timestamp': self.timestamp,
            'source': self.source,
            'captcha': self.captcha,
            'signature': self.get_signature(),
            'username': '填写自己的用户',
            'password': '填写自己的密码',
            'lang': self.lang,
            'ref_source': self.ref_source,
            'utm_source': self.utm_source
        }
        return scrapy.FormRequest(url='https://www.zhihu.com/api/v3/oauth/sign_in', formdata=post_data,
                                  headers=self.header, callback=self.login_callback)
   #登陆成功后直接访问知乎首页,登陆状态下有相应数据返回
    def login_callback(self, response):
        return Request(url='http://www.zhihu.com', headers=self.header, callback=self.login_callback1)
  #数据返回在这的response
    def login_callback1(self, response):
        pass

我这里没有验证码的相关逻辑,各位看官自行完善。

    原文作者:ChanZeeBm
    原文地址: https://www.jianshu.com/p/bf1cbabbd383
    本文转自网络文章,转载此文章仅为分享知识,如有侵权,请联系博主进行删除。
点赞