爬trip advisor景点评论(一)

第一次学习异步加载的网页如何找出真实网页,看了一下午,实在是有点困难。但是就是有这么个毛病,越是找不到的就越想找到。

《爬trip advisor景点评论(一)》 Paste_Image.png

到现在终于找到了我要的真实网址,泪奔。。。
我们以黄山为例:在输入黄山之后,得到的评论如下图所示:

《爬trip advisor景点评论(一)》 Paste_Image.png

什么叫异步加载,就是我在选取评论语言的时候,上面的网址是不会变的,说明有猫腻。

《爬trip advisor景点评论(一)》 Paste_Image.png

我在首先明白了什么叫抓包,以及怎么去抓包之后就开始了漫长的找包之旅,过程就不赘述了,

首先发现在起始网页中加入浏览器信息的时候是可以解析出英文界面的,但是!!!
《爬trip advisor景点评论(一)》 Paste_Image.png

在这里有一个更多,又是一个异步加载!还得接着找。
在开发者工具里点击 clear

《爬trip advisor景点评论(一)》 Paste_Image.png

在多次点击更多之后,发现出来一个这个玩意

《爬trip advisor景点评论(一)》 Paste_Image.png 教训告诉我们,看名字很重要,名字已经告诉我们这是一个扩展。果然,在把找到的URL打开之后发现,终于评论的全文出来了:
《爬trip advisor景点评论(一)》 Paste_Image.png
《爬trip advisor景点评论(一)》 Paste_Image.png

到此结束了?

肯定并没有,那些一长串的数字是怎么来的? 下一篇再介绍。 to be continue…

《爬trip advisor景点评论(一)》 Paste_Image.png

照例,附上单独解析的代码:


import requests
from lxml import etree
url='http://www.tripadvisor.cn/ExpandedUserReviews-g303685-d550738?target=410115359&context=1&reviews=410115359,409344604,407255372,401140048,400179383,398229741,396111020,395334568,394200191,393782571&servlet=Attraction_Review&expand=1'
headers = {'Accept': '*/*',
           'Accept-Encoding': 'gzip, deflate, sdch',
           'Accept-Language': 'zh-CN,zh;q=0.8',
           'Connection': 'keep-alive',
           'Cookie': 'ServerPool=X; TATravelInfo=V2*A.2*MG.-1*HP.2*FL.3*RVL.550738_100*RS.1; TASSK=enc%3AAGMMZ%2Bwe98u9po0Y%2FIY8pNbyuAGi9fbnqnNLKXa4%2BK5cWP0RMuCHTRZhu0uFf1yydRIPPAQ%2FpF7EdW0NLOpBZZId19ek1a9GHWZKvnuTIJ0QcXx1ULQXtiMx%2F%2BHhNCUrIg%3D%3D; TAUnique=%1%enc%3AjrXWw0qqncCEQMzfl5keG315t9yL8iOg6jLwcPiP6q8%3D; _jzqckmp=1; bdshare_firstime=1491815789350; __gads=ID=e5060e1a6b1ed08f:T=1491815796:S=ALNI_MbFkpxx2-zq7ubsIoe4wvdJnbQWoA; TALanguage=en; TAReturnTo=%1%%2FAttraction_Review-g303685-d550738-Reviews-Mt_Huangshan_Yellow_Mountain-Huangshan_Anhui.html; TASession=%1%V2ID.DA0C735ECBB05FFBD2F31EA11943410C*SQ.15*LP.%2FAttraction_Review-g303685-d550738-Reviews-Mt_Huangshan_Yellow_Mountain-Huangshan_Anhui%5C.html*LS.Attraction_Review*GR.70*TCPAR.53*TBR.19*EXEX.62*ABTR.65*PHTB.78*FS.82*CPU.26*HS.popularity*ES.popularity*AS.popularity*DS.5*SAS.popularity*FPS.oldFirst*LF.en*FA.1*DF.0*MS.-1*RMS.-1*FLO.550738*TRA.false*LD.550738; CM=%1%HanaPersist%2C%2C-1%7CPremiumMobSess%2C%2C-1%7Ct4b-pc%2C%2C-1%7CHanaSession%2C%2C-1%7CRCPers%2C%2C-1%7CWShadeSeen%2C%2C-1%7CFtrPers%2C%2C-1%7CTheForkMCCPers%2C%2C-1%7CHomeASess%2C%2C-1%7CPremiumSURPers%2C%2C-1%7CPremiumMCSess%2C%2C-1%7Csesscoestorem%2C%2C-1%7CCpmPopunder_1%2C1%2C1491902222%7CCCSess%2C%2C-1%7CCpmPopunder_2%2C1%2C-1%7CViatorMCPers%2C%2C-1%7Csesssticker%2C%2C-1%7C%24%2C%2C-1%7CPremiumORSess%2C%2C-1%7Ct4b-sc%2C%2C-1%7CMC_IB_UPSELL_IB_LOGOS2%2C%2C-1%7Cb2bmcpers%2C%2C-1%7CMC_IB_UPSELL_IB_LOGOS%2C%2C-1%7CPremMCBtmSess%2C%2C-1%7CPremiumSURSess%2C%2C-1%7CLaFourchette+Banners%2C%2C-1%7Csess_rev%2C%2C-1%7Csessamex%2C%2C-1%7Cperscoestorem%2C%2C-1%7CPremiumRRSess%2C%2C-1%7CSaveFtrPers%2C%2C-1%7CTheForkRRSess%2C%2C-1%7Cpers_rev%2C%2C-1%7CMetaFtrSess%2C%2C-1%7CRBAPers%2C%2C-1%7CWAR_RESTAURANT_FOOTER_PERSISTANT%2C%2C-1%7CFtrSess%2C%2C-1%7CHomeAPers%2C%2C-1%7CPremiumMobPers%2C%2C-1%7CRCSess%2C%2C-1%7CLaFourchette+MC+Banners%2C%2C-1%7Cbookstickcook%2C%2C-1%7Csh%2C%2C-1%7CLastPopunderId%2C137-1859-null%2C-1%7Cpssamex%2C%2C-1%7CTheForkMCCSess%2C%2C-1%7C2016sticksess%2C%2C-1%7CCCPers%2C%2C-1%7CWAR_RESTAURANT_FOOTER_SESSION%2C%2C-1%7Cb2bmcsess%2C%2C-1%7C2016stickpers%2C%2C-1%7CViatorMCSess%2C%2C-1%7CPremiumMCPers%2C%2C-1%7CPremiumRRPers%2C%2C-1%7CPremMCBtmPers%2C%2C-1%7CTheForkRRPers%2C%2C-1%7CSaveFtrSess%2C%2C-1%7CPremiumORPers%2C%2C-1%7CRBASess%2C%2C-1%7Cbookstickpers%2C%2C-1%7Cperssticker%2C%2C-1%7CMetaFtrPers%2C%2C-1%7C; TAUD=LA-1491815815299-1*LG-14277644-2.1.F.*LD-14277645-.....; roybatty=TNI1625!AP9YRq1oHIHfPtXcJCINRrDe7hLPCe8L8uurjbOYo996M1NrdEF3UC8F2w%2BA%2FvgIK20Ptfm2qFK2Y7gBNq3fPyswrYVGd%2BwBp%2FhQTse54C7MDQU3%2FCl9pe%2FrrYw8WiSNYgQ6pewgJ',
           'Host': 'www.tripadvisor.cn',
           'Referer': 'http://www.tripadvisor.cn/Attraction_Review-g303685-d550738-Reviews-Mt_Huangshan_Yellow_Mountain-Huangshan_Anhui.html',
           'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36',
           }
html=requests.post(url,headers=headers).content
selector=etree.HTML(html)
infos = selector.xpath('//div[@class="entry"]')
print(len(infos))
for info in infos:
    comment = info.xpath('p/text()')[0]
    print(comment)
点赞

发表评论

电子邮件地址不会被公开。 必填项已用*标注