自动化爬取淘宝中的订单
这是 淘宝会员登录页 。因为之前做的爬虫都是通过框架或从登录页取得Cookie,再注入进去实现登陆过程的。但淘宝的反爬机制很难算出Cookie,很多Cookie都是通过JS的计算,所以不得不学习源码,反到最后看的头痛。。。
第一次尝试
(1)登录
通过 Jsoup get登录页成功返回Cookie:
/**
* 初始化淘宝登录页
*/
Response firstLoginInitResp = Jsoup.connect("https://login.taobao.com/member/login.jhtml?redirectURL=http%3A%2F%2Ftrade.taobao.com%2Ftrade%2Fitemlist%2Flist_export_order.htm")
.header("Host", "login.taobao.com")
.header("Connection", "keep-alive")
.header("Accept", "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8")
.header("Upgrade-Insecure-Requests", "1")
.userAgent("Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.93 Safari/537.36")
.header("Accept-Encoding", "gzip, deflate, sdch")
.header("Accept-Language", "zh-CN,zh;q=0.8")
.execute();
Map<String, String> firstLoginInitCookies = firstLoginInitResp.cookies();
System.out.println("code: "+firstLoginInitResp.statusCode()+", msg: "+firstLoginInitResp.statusMessage()+", 第一次登陆淘宝返回的Cookie: "+firstLoginInitCookies.toString());
_tb_token_=e71873665bdae
t=7770a28456dfcad8106b11406e3bc765
cookie2=17c4314a2a5b448f59aa038202b96019
v=0
返回成功后,JS动态添加了俩个Cookie:
l=
isg=
最后将Cookie重新注入,并传送消息体到登录页(这是为了js再次动态设置Cookie)
Response secondLoginInitResp = Jsoup.connect("https://login.taobao.com/member/login.jhtml?redirectURL=http%3A%2F%2Ftrade.taobao.com%2Ftrade%2Fitemlist%2Flist_export_order.htm%3Fpage_no%3D1")
.header("Host", "login.taobao.com")
.header("Connection", "keep-alive")
.header("Content-Length", secondLoginInitData.length()+"")
.header("Cache-Control", "max-age=0")
.header("Accept", "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8")
.header("Origin", "https://login.taobao.com")
.header("Upgrade-Insecure-Requests", "1")
.userAgent("Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.93 Safari/537.36")
.header("Content-Type", "application/x-www-form-urlencoded")
.referrer("https://login.taobao.com/member/login.jhtml?redirectURL=http%3A%2F%2Ftrade.taobao.com%2Ftrade%2Fitemlist%2Flist_export_order.htm%3Fpage_no%3D1")
.header("Accept-Encoding", "gzip, deflate")
.header("Accept-Language", "zh-CN,zh;q=0.8")
.cookies(firstLoginInitCookies)
.data(secondLoginInitMap)
.execute();
Map<String, String> secondLoginInitCookies = secondLoginInitResp.cookies();
System.out.println("code: "+secondLoginInitResp.statusCode()+", msg: "+secondLoginInitResp.statusMessage()+", 第二次登陆淘宝返回的Cookie: "+secondLoginInitCookies.toString());
结果返回的Cookie为空。此处省略过多废话。。。只好再采用其他方式。
第二次尝试
这次将采用Selenium自动化框架完成自动登录,再获取Cookie注入到请求中,最后完成爬取。
因为需要用浏览器来完成自动化登录,所以应注意Firefox、Chrome、IE与Selenium对应的版本(本人火狐版本24 下载地址、Selenium2.40 下载地址)。
import java.util.Map;
import java.util.Set;
import java.util.concurrent.TimeUnit;
import org.jsoup.Connection.Response;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.openqa.selenium.By;
import org.openqa.selenium.Cookie;
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.WebElement;
import org.openqa.selenium.firefox.FirefoxDriver;
import common.DateUtil;
import common.FileUtil;
import common.Log;
/**
* 淘宝爬虫
* @author Alex
* @date 2017年3月22日
*/
public class TaobaoCrawler extends Log{
public String login(String username, String password){
logger.info("Start firefox browser succeed...");
try {
WebDriver webDriver = new FirefoxDriver(); //创建火狐驱动(谷歌IE需下载驱动程序并添加浏览器插件,还有注意版本对应,比较麻烦,请百度版本对应)
webDriver.manage().timeouts().implicitlyWait(10, TimeUnit.SECONDS);
webDriver.get("https://login.taobao.com/member/login.jhtml?redirectURL=http://trade.taobao.com/trade/itemlist/list_export_order.htm?page_no=1");
WebElement passLoginEle= webDriver.findElement(By.xpath("//a[@class='forget-pwd J_Quick2Static' and @target='_blank' and @href='']")); //密码登录
logger.info("密码登录是否显示可见:"+passLoginEle.isDisplayed());
passLoginEle.click(); //显示账号密码表单域(模仿点击事件,将隐藏视图变为可见)
webDriver.findElement(By.id("TPL_username_1")).clear();
webDriver.findElement(By.id("TPL_username_1")).sendKeys(username); //输入用户名
webDriver.findElement(By.id("TPL_password_1")).clear();
webDriver.findElement(By.id("TPL_password_1")).sendKeys(password); //输入密码
webDriver.findElement(By.id("J_SubmitStatic")).click(); //点击登录按钮
webDriver.switchTo().defaultContent();
try {
while (true) { //不停的检测,一旦当前页面URL不是登录页面URL,就说明浏览器已经进行了跳转
Thread.sleep(500L);
if (!webDriver.getCurrentUrl().startsWith("https://login.taobao.com/member/login.jhtml")) {
break;
}
}
} catch (InterruptedException e) {
e.printStackTrace();
}
//获取cookie,上面一跳出循环我认为就登录成功了,当然上面的判断不太严格,可以再进行修改
StringBuffer cookieStr = new StringBuffer();
Set<Cookie> cookies = webDriver.manage().getCookies();
for (Cookie cookie : cookies) {
cookieStr.append(cookie.getName() + "=" + cookie.getValue() + "; ");
}
logger.info("账号 "+username+" ,用户登录成功");
return cookieStr.toString();
} catch (Exception e) {
// TODO: handle exception
e.printStackTrace();
logger.info("账号 "+username+" ,用户登录失败,可能被校验码拦截");
logger.error(e.getMessage());
return null;
}
}
public String getOrderUrl(String cookie){
try {
Response orderResp = Jsoup.connect("https://trade.taobao.com/trade/itemlist/list_export_order.htm?page_no=1")
.header("Host", "login.taobao.com")
.header("Connection", "keep-alive")
.header("Cache-Control", "max-age=0")
.header("Accept", "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*;q=0.8")
.header("Origin", "https://login.taobao.com")
.header("Upgrade-Insecure-Requests", "1")
.userAgent("Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.93 Safari/537.36")
.header("Content-Type", "application/x-www-form-urlencoded")
.header("Accept-Encoding", "gzip, deflate")
.header("Accept-Language", "zh-CN,zh;q=0.8")
.cookie("Cookie", cookie)
.execute();
logger.info("请求订单页返回的code: "+orderResp.statusCode()+", msg: "+orderResp.statusMessage());
Document doc = orderResp.parse();
Element orderEle = doc.getElementsByAttributeValue("title", "下载订单报表").get(0); //获取第一个
String orderUrl = orderEle.attr("href");
logger.info("订单下载地址:"+orderUrl);
return orderUrl;
} catch (Exception e) {
// TODO: handle exception
e.printStackTrace();
logger.error(e.getMessage());
return null;
}
}
public static void main(String[] args){
TaobaoCrawler crawler = new TaobaoCrawler();
Map<String, String> map = FileUtil.propToMap();
String cookie = crawler.login(map.get("username"), map.get("password"));
String orderUrl = crawler.getOrderUrl(cookie);
}
}
普通验证码是可以获取的,但是通过以拖动滑块来验证用户身份,这种情况就很难解决了。希望大家有空能试下,多提供些宝贵意见。。。
先这样吧,不太会写文章,希望大家海涵。