对于XPath专家来说,这是一个简单的观点! 🙂
文件结构:
<tokens>
<token>
<word>Newt</word><entityType>PROPER_NOUN</entityType>
</token>
<token>
<word>Gingrich</word><entityType>PROPER_NOUN</entityType>
</token>
<token>
<word>admires</word><entityType>VERB</entityType>
</token>
<token>
<word>Garry</word><entityType>PROPER_NOUN</entityType>
</token>
<token>
<word>Trudeau</word><entityType>PROPER_NOUN</entityType>
</token>
</tokens>
忽略文档的语义不可能性,我想拉出[[“Newt”,“Gingrich”],[“Garry”,“Trudeau”]],即:当连续两个标记的entityTypes为PROPER_NOUN时,我想从这两个令牌中提取单词.
我已经达到了:
"//token[entityType='PROPER_NOUN']/following-sibling::token[1][entityType='PROPER_NOUN']"
…它找到了两个连续的PROPER_NOUN标记中的第二个,但我不知道如何让它随之发出第一个标记.
一些说明:
>如果这简化了问题,我不介意对NodeSet进行更高级别的处理(例如在Ruby / Nokogiri中).
>如果有三个或更多个连续的PROPER_NOUN标记(称为A,B,C),理想情况下我想发出[A,B],[B,C].
更新
这是我使用更高级别Ruby功能的解决方案.但是我厌倦了那些在我脸上踢沙子的XPath恶霸,我想知道REAL XPath程序员这样做的方式!
def extract(doc)
names = []
sentences = doc.xpath("//tokens")
sentences.each do |sentence|
tokens = sentence.xpath("token")
prev = nil
tokens.each do |token|
name = token.xpath("word").text if token.xpath("entityType").text == "PROPER_NOUN"
names << [prev, name] if (name && prev)
prev = name
end
end
names
end
最佳答案 我分两步完成.第一步是选择一组节点:
//token[entityType='PROPER_NOUN' and following-sibling::token[1][entityType='PROPER_NOUN']]
这将为您提供启动双字对的所有令牌.然后获取实际对,迭代节点列表并提取./word和follow-sibling :: token [1] / word
使用XmlStarlet(http://xmlstar.sourceforge.net/ – 用于快速xml操作的强大工具)命令行是
xml sel -t -m "//token[entityType='PROPER_NOUN' and following-sibling::token[1][entityType='PROPER_NOUN']]" -v word -o "," -v "following-sibling::token[1]/word" -n /tmp/tok.xml
给
Newt,Gingrich
Garry,Trudeau
XmlStarlet还会将该命令行编译为xslt,相关位是
<xsl:for-each select="//token[entityType='PROPER_NOUN' and following-sibling::token[1][entityType='PROPER_NOUN']]">
<xsl:value-of select="word"/>
<xsl:value-of select="','"/>
<xsl:value-of select="following-sibling::token[1]/word"/>
<xsl:value-of select="' '"/>
</xsl:for-each>
使用Nokogiri它可能看起来像:
#parse the document
doc = Nokogiri::XML(the_document_string)
#select all tokens that start 2-word pair
pair_starts = doc.xpath '//token[entityType = "PROPER_NOUN" and following-sibling::token[1][entityType = "PROPER_NOUN"]]'
#extract each word and the following one
result = pair_starts.each_with_object([]) do |node, array|
array << [node.at_xpath('word').text, node.at_xpath('following-sibling::token[1]/word').text]
end