如何将coreNLP生成的解析树转换为data.tree R包

2024年2月4日 270次阅读

我想将R包coreNLP生成的解析树转换为data.tree R包格式.使用以下代码生成解析树：

 options( java.parameters = "-Xmx2g" ) 
library(NLP)
library(coreNLP)
#initCoreNLP() # change this if downloaded to non-standard location
initCoreNLP(annotators = "tokenize,ssplit,pos,lemma,parse")
## Some text.
s <- c("A rare black squirrel has become a regular visitor to a suburban garden.")
s <- as.String(s)


anno<-annotateString(s)
parse_tree <- getParse(anno)
parse_tree

The output parse tree is as follows:
> parse_tree
[1] "(ROOT\r\n  (S\r\n    (NP (DT A) (JJ rare) (JJ black) (NN squirrel))\r\n    (VP (VBZ has)\r\n      (VP (VBN become)\r\n        (NP (DT a) (JJ regular) (NN visitor))\r\n        (PP (TO to)\r\n          (NP (DT a) (JJ suburban) (NN garden)))))\r\n    (. .)))\r\n\r\n"

我发现发布Visualize Parse Tree Structure之后
它将openNLP包生成的解析树转换为树格式.但是解析树与coreNLP生成的解析树不同,并且解决方案也没有转换为我想要的data.tree格式.

编辑
通过添加以下2行,我们可以使用发布Visualize Parse Tree Structure中提供的功能

# this step modifies coreNLP parse tree to mimic openNLP parse tree
parse_tree <- gsub("[\r\n]", "", parse_tree)
parse_tree <- gsub("ROOT", "TOP", parse_tree)

library(igraph)
library(NLP)

parse2graph(parse_tree,  # plus optional graphing parameters
            title = sprintf("'%s'", x), margin=-0.05,
            vertex.color=NA, vertex.frame.color=NA,
            vertex.label.font=2, vertex.label.cex=1.5, asp=0.5,
            edge.width=1.5, edge.color='black', edge.arrow.size=0)

但我想要的是将解析树转换为data.tree包提供的data.tree格式

最佳答案获得边缘列表后,转换为data.tree是微不足道的.仅替换parse2graph函数的最后一位,并将样式移出函数：

parse2tree <- function(ptext) {
  stopifnot(require(NLP) && require(igraph))

  ## Replace words with unique versions
  ms <- gregexpr("[^() ]+", ptext)                                      # just ignoring spaces and brackets?
  words <- regmatches(ptext, ms)[[1]]                                   # just words
  regmatches(ptext, ms) <- list(paste0(words, seq.int(length(words))))  # add id to words

  ## Going to construct an edgelist and pass that to igraph
  ## allocate here since we know the size (number of nodes - 1) and -1 more to exclude 'TOP'
  edgelist <- matrix('', nrow=length(words)-2, ncol=2)

  ## Function to fill in edgelist in place
  edgemaker <- (function() {
    i <- 0                                       # row counter
    g <- function(node) {                        # the recursive function
      if (inherits(node, "Tree")) {            # only recurse subtrees
        if ((val <- node$value) != 'TOP1') { # skip 'TOP' node (added '1' above)
          for (child in node$children) {
            childval <- if(inherits(child, "Tree")) child$value else child
            i <<- i+1
            edgelist[i,1:2] <<- c(val, childval)
          }
        }
        invisible(lapply(node$children, g))
      }
    }
  })()

  ## Create the edgelist from the parse tree
  edgemaker(Tree_parse(ptext))
  tree <- FromDataFrameNetwork(as.data.frame(edgelist))
  return (tree)
}


parse_tree <- "(ROOT\r\n  (S\r\n    (NP (DT A) (JJ rare) (JJ black) (NN squirrel))\r\n    (VP (VBZ has)\r\n      (VP (VBN become)\r\n        (NP (DT a) (JJ regular) (NN visitor))\r\n        (PP (TO to)\r\n          (NP (DT a) (JJ suburban) (NN garden)))))\r\n    (. .)))\r\n\r\n"
parse_tree <- gsub("[\r\n]", "", parse_tree)
parse_tree <- gsub("ROOT", "TOP", parse_tree)

library(data.tree)

tree <- parse2tree(parse_tree)
tree
SetNodeStyle(tree, style = "filled,rounded", shape = "box", fillcolor = "GreenYellow")
plot(tree)