Scrapy 模拟登录新版知乎

2023年7月31日 341次阅读

写这篇文章是因为知乎登录已经改版了，新版登录和老版登录区别还是挺大了，新版登录的 post 请求减少了一些字段的同时新增了一些字段，而且新增的字段如 signature 的值是通过一些算法得到的，比较难处理，因此记录一下自己的登录过程：

第一步

因为需要登录，所以重写 Scrapy 的入口函数，作用是验证当前登录是否需要验证码，并获取到 cookie；(老版知乎登录时直接访问登录首页得到 _xsrf 字段信息，新版已经不需要该字段了，因此不需要去访问登录首页的 url )

    def start_requests(self):
        yield scrapy.Request('https://www.zhihu.com/api/v3/oauth/captcha?lang=cn',
                       headers=self.headers, callback=self.is_need_capture)

第二步

请求验证码图片，并将图片下载到本地（注意，有些验证码可能是让你点击倒立的文字，这需要你弄清楚每个字的坐标大概范围就行了，将这些坐标携带过去）

    def is_need_capture(self, response):
        yield scrapy.Request('https://www.zhihu.com/captcha.gif?r=%d&type=login' % (time.time() * 1000),
                             headers=self.headers, callback=self.capture, meta={"resp": response})

第三步

在调试模式下输入错误的密码登录，得到登录携带的 post 参数，如下，signature 是根据算法生成的，，timestamp 是时间戳等等，最后 scrapy.FormRequest 请求登录，每一次请求过程都需要带上 headers ，不然会报错。底部会放上源码地址

post_data = {
            "client_id": clientId,
            "username": "18610379194",
            "password": "tuyue7208562",
            "grant_type": grantType,
            "source": source,
            "timestamp": timestamp,
            "signature": self.get_signature(grantType, clientId, source, timestamp),  # 获取签名
            "lang": "cn",
            "ref_source": "homepage",
            "captcha": self.get_captcha(need_cap),  # 获取图片验证码
            "utm_source": ""
        }

        return [scrapy.FormRequest(
            url="https://www.zhihu.com/api/v3/oauth/sign_in",
            formdata=post_data,
            headers=self.headers,
            callback=self.check_login
        )]

此时请求个人主页，得到主页的个人信息，至此登录完成

 def check_login(self, response):
        # 验证是否登录成功
        yield scrapy.Request('https://www.zhihu.com/inbox', headers=self.headers)

源码地址

最后，感谢 github 上的 weldon2010 这位作者，该作者是用 requests 库实现的知乎模拟登录附地址(https://github.com/weldon2010/Python/blob/master/login_zhihu.py)，本文在此基础上修改为 Scrapy 框架模拟登录，因为需要爬取知乎内容，故作此更改，希望能帮助到大家!