利用R的XML`实现简单网页表格抓取

2023年12月1日 276次阅读来源: 六价铬

[TOC]

1. XML包网页抓取

如果网页上有很多容易读取的表格，可以很方便利用XML包实现网页抓取。

网页语言最好为英文。XML包对于中文显示乱码。

An example:

library(XML)
 
# load the website,
# and analyze tables contained in this website
u <- "url"
tbls <- readHTMLTable(u)
 
# For a website may contain a large number of tables.
# Identify the table that we need through the identification
# of the row number of the tables.
sapply(tbls,nrow)
 
#Read the first table of the website "u"
pop<-readHTMLTable(u,which=1)
 
#Export data to local disk
write.csv(pop,file="FilePath")

[1]http://blog.sina.com.cn/s/blog_ebf594400102v3am.html

[2]http://stackoverflow.com/questions/23584514/error-xml-content-does-not-seem-to-be-xml-r-3-1-0

2. XML对于https的缺点

如果出现Error: XML Content does not seem to be XML | R 3.1.0,原因为XML包不支持https网页的抓取[3]。

解决方法[3, 4]：

library (RCurl)
library (XML)
curlVersion()$features
curlVersion()$protocol
## These should show ssl and https.
## I can see these on windows 8.1 at least.
## It may differ on other OSes.
temp <- getURL("https://websiteurl",
                ssl.verifyPeer=FALSE)
DFX <- xmlTreeParse(temp,useInternal = TRUE)

[3] http://stackoverflow.com/questions/23584514/error-xml-content-does-not-seem-to-be-xml-r-3-1-0
[4] http://www.omegahat.net/RCurl/installed/RCurl/html/getURL.html

    原文作者：六价铬
    原文地址: https://www.jianshu.com/p/f3a650286fe0
    本文转自网络文章，转载此文章仅为分享知识，如有侵权，请联系博主进行删除。