web-crawler – 需要身份验证的抓取网站

2019年8月3日 248次阅读

我如何编写一个登录到
okcupid的简单脚本(在cURL / python / ruby / bash / perl / java中)并记录我每天收到的邮件数量？

输出将是这样的：

1/21/2011    1 messages
1/22/2011    0 messages
1/23/2011    2 messages
1/24/2011    1 messages

主要问题是我之前从未编写过Web爬虫.我不知道如何以编程方式登录到像okcupid这样的网站.如何在加载不同页面时保持身份验证？等等..

一旦我访问了原始HTML,我就可以通过正则表达式和地图等等了.

最佳答案这是使用cURL的解决方案,可以下载收件箱的第一页.正确的解决方案将迭代每页消息的最后一步. $USERNAME和$PASSWORD需要填写您的信息.

#!/bin/sh

## Initialize the cookie-jar
curl --cookie-jar cjar --output /dev/null https://www.okcupid.com/login

## Login and save the resulting HTML file as loginResult.html (for debugging purposes)
curl --cookie cjar --cookie-jar cjar \
  --data 'dest=/?' \
  --data 'username=$USERNAME' \
  --data 'password=$PASSWORD' \
  --location \
  --output loginResult.html \
    https://www.okcupid.com/login

## Download the inbox and save it as inbox.html
curl --cookie cjar \
  --output inbox.html \
  http://www.okcupid.com/messages

该技术在video tutorial about cURL中解释.