我有一个PENN-Syntax-Tree,我想以递归方式获取该树包含的所有规则.
(ROOT
(S
(NP (NN Carnac) (DT the) (NN Magnificent))
(VP (VBD gave) (NP ((DT a) (NN talk))))
)
)
我的目标是获得如下的语法规则:
ROOT --> S
S --> NP VP
NP --> NN
...
正如我所说,我需要递归地执行此操作,而无需NLTK包或任何其他模块或正则表达式.这是我到目前为止所拥有的.参数树是在每个空间上分割的Penn-Tree.
def extract_rules(tree):
tree = tree[1:-1]
print("\n\n")
if len(tree) == 0:
return
root_node = tree[0]
print("Current Root: "+root_node)
remaining_tree = tree[1:]
right_side = []
temp_tree = list(remaining_tree)
print("remaining_tree: ", remaining_tree)
symbol = remaining_tree.pop(0)
print("Symbol: "+symbol)
if symbol not in ["(", ")"]:
print("CASE: No Brackets")
print("Rule: "+root_node+" --> "+str(symbol))
right_side.append(symbol)
elif symbol == "(":
print("CASE: Opening Bracket")
print("Temp Tree: ", temp_tree)
cursubtree_end = bracket_depth(temp_tree)
print("Subtree ends at position "+str(cursubtree_end)+" and Element is "+temp_tree[cursubtree_end])
cursubtree_start = temp_tree.index(symbol)
cursubtree = temp_tree[cursubtree_start:cursubtree_end+1]
print("Subtree: ", cursubtree)
rnode = extract_rules(cursubtree)
if rnode:
right_side.append(rnode)
print("Rule: "+root_node+" --> "+str(rnode))
print(right_side)
return root_node
def bracket_depth(tree):
counter = 0
position = 0
subtree = []
for i, char in enumerate(tree):
if char == "(":
counter = counter + 1
if char == ")":
counter = counter - 1
if counter == 0 and i != 0:
counter = i
position = i
break
subtree = tree[0:position+1]
return position
目前它适用于S的第一个子树,但所有其他子树都不会被递归解析.很高兴得到任何帮助..
最佳答案 我倾向于保持尽可能简单,而不是试图重新发明你目前不允许使用的解析模块.就像是:
string = '''
(ROOT
(S
(NP (NN Carnac) (DT the) (NN Magnificent))
(VP (VBD gave) (NP (DT a) (NN talk)))
)
)
'''
def is_symbol_char(character):
'''
Predicate to test if a character is valid
for use in a symbol, extend as needed.
'''
return character.isalpha() or character in '-=$!?.'
def tokenize(characters):
'''
Process characters into a nested structure. The original string
'(DT the)' is passed in as ['(', 'D', 'T', ' ', 't', 'h', 'e', ')']
'''
tokens = []
while characters:
character = characters.pop(0)
if character.isspace():
pass # nothing to do, ignore it
elif character == '(': # signals start of recursive analysis (push)
characters, result = tokenize(characters)
tokens.append(result)
elif character == ')': # signals end of recursive analysis (pop)
break
elif is_symbol_char(character):
# if it looks like a symbol, collect all
# subsequents symbol characters
symbol = ''
while is_symbol_char(character):
symbol += character
character = characters.pop(0)
# push unused non-symbol character back onto characters
characters.insert(0, character)
tokens.append(symbol)
# Return whatever tokens we collected and any characters left over
return characters, tokens
def extract_rules(tokens):
''' Recursively walk tokenized data extracting rules. '''
head, *tail = tokens
print(head, '-->', *[x[0] if isinstance(x, list) else x for x in tail])
for token in tail: # recurse
if isinstance(token, list):
extract_rules(token)
characters, tokens = tokenize(list(string))
# After a successful tokenization, all the characters should be consumed
assert not characters, "Didn't consume all the input!"
print('Tokens:', tokens[0], 'Rules:', sep='\n\n', end='\n\n')
extract_rules(tokens[0])
OUTPUT
Tokens:
['ROOT', ['S', ['NP', ['NN', 'Carnac'], ['DT', 'the'], ['NN', 'Magnificent']], ['VP', ['VBD', 'gave'], ['NP', ['DT', 'a'], ['NN', 'talk']]]]]
Rules:
ROOT --> S
S --> NP VP
NP --> NN DT NN
NN --> Carnac
DT --> the
NN --> Magnificent
VP --> VBD NP
VBD --> gave
NP --> DT NN
DT --> a
NN --> talk
注意
我更改了原始树作为此子句:
(NP ((DT a) (NN talk)))
似乎不正确,因为它在网络上可用的语法树grapher上生成一个空节点,所以我简化为:
(NP (DT a) (NN talk))
根据需要调整.