[TOC]
1. XML包网页抓取
如果网页上有很多容易读取的表格,可以很方便利用XML
包实现网页抓取。
网页语言最好为英文。XML
包对于中文显示乱码。
An example:
library(XML)
# load the website,
# and analyze tables contained in this website
u <- "url"
tbls <- readHTMLTable(u)
# For a website may contain a large number of tables.
# Identify the table that we need through the identification
# of the row number of the tables.
sapply(tbls,nrow)
#Read the first table of the website "u"
pop<-readHTMLTable(u,which=1)
#Export data to local disk
write.csv(pop,file="FilePath")
[1]http://blog.sina.com.cn/s/blog_ebf594400102v3am.html
[2]http://stackoverflow.com/questions/23584514/error-xml-content-does-not-seem-to-be-xml-r-3-1-0
2. XML对于https的缺点
如果出现Error: XML Content does not seem to be XML | R 3.1.0
,原因为XML
包不支持https
网页的抓取[3]。
解决方法[3, 4]:
library (RCurl)
library (XML)
curlVersion()$features
curlVersion()$protocol
## These should show ssl and https.
## I can see these on windows 8.1 at least.
## It may differ on other OSes.
temp <- getURL("https://websiteurl",
ssl.verifyPeer=FALSE)
DFX <- xmlTreeParse(temp,useInternal = TRUE)
[3] http://stackoverflow.com/questions/23584514/error-xml-content-does-not-seem-to-be-xml-r-3-1-0
[4] http://www.omegahat.net/RCurl/installed/RCurl/html/getURL.html