R XML – 无法从内存中删除内部C节点

2023年3月19日 154次阅读

我必须解析~2000 xml文档,从每个文档中提取某些节点,将它们添加到单个文档中,然后保存.我正在使用内部C节点,以便我可以使用XPath.问题是,当我遍历文档时,我无法从内存中删除内部C对象,最终使用> 4GB的已用内存.我知道问题不在于加载的树(我只是加载并删除每个文档的哈希树来运行循环),但是使用过滤的节点或根节点.

这是我正在使用的代码.我错过了什么,所以我可以在每次迭代结束时清除内存？

xmlDoc <- xmlHashTree()
rootNode <- newXMLNode("root")

for (i in seq_along(all.docs)){

  # Read in the doc, filter out nodes, remove temp doc
  temp.xml <- xmlParse(all.docs[i])
  filteredNodes <- newXMLNode(all.docs[i],
                   xpathApply(temp.xml,"//my.node[@my.attr='my.value'"))
  free(temp.xml)
  rm(temp.xml)

  # Add filtered nodes to root node and get rid of them.
  addChildren(rootNode, filteredNodes)
  removeNodes(filteredNodes, free = TRUE)
  rm(filteredNodes)

}
# Add root node to doc and save that new log.
xmlDoc <- addChildren(root)
saveXML(xmlDoc, "MergedDocs.xml")

谢谢您的帮助

最佳答案所以我发现无法使用’XML’来做到这一点,没有内存泄漏和大量的处理时间.

幸运的是,’xml2’现在可以处理创建文档和节点.为了完整起见,这里是使用’xml2’的解决方案.如果有人知道使用’XML’的方法,请执行chime.

xmlDoc <- xml_new_document() %>% xml_add_child("root")

for (i in seq_along(all.docs)){
 # Read in the log.
 rawXML <- read_xml(all.docs[i])

 # Filter relevant nodes and cast them to a list of children.
 tempNodes   <- xml_find_all(rawXML, "//my.node[@my.attr='my.value'")
 theChildren <- xml_children(tempNodes)

 # Get rid of the temp doc.
 rm(rawXML)

 # Add the filtered nodes to the log under a node named after the file name
 xmlDoc %>%
  xml_add_child(all.docs[i]  %>%
  xml_add_child(theChildren[[1]]) %>%
  invisible()

 # Remove the temp objects
 rm(tempNodes); rm(theChildren)
}