python – 文本分段：将输入与字典中最长的单词匹配的算法

2023年6月18日 270次阅读

我需要将一个字符串拆分成单词,这样每个单词都来自字典.还要确保选择左侧最长的单词.于是

thisisinsane => this is insane (correct as longest possible word from left)
thisisinsane => this is in sane(wrong) 
Assuming 'this', 'is', 'in', 'insane' are all words in the dictionary.

我设法通过从字符串的末尾遍历到匹配最长字的开头来解决这个问题.但问题开始让我们因为这些问题而出现问题.

shareasale => share as ale(wrong as 'ale' not in the dictionary)
shareasale => share a sale(correct)
Assuming 'share', 'a', 'sale' are all words in the dictionary unlike 'ale'.

我试图通过删除在遇到错误之前找到的有效段来解决此问题,即

shareasale => 'share' and 'as' (error = 'ale')

并从字典中删除一次,然后解决问题.所以

shareasale => no valid segments when share is removed
shareasale => share a sale (when 'as' is removed from the dictionary.

因此我设法解决了这个问题.但后来我无法解决这个问题

asignas => 'as' ( error = 'ignas')

然后我的解决方案将从字典中删除’as’并尝试解决它

asignas => 'a' 'sign' (error = 'as')

因为在新的递归调用’as’已从字典中删除.我写的函数是in this link.我希望有人可以通过它并帮助我找到一个更好的算法来解决这个问题,建议修改我现有的算法.

最佳答案基本上你的问题是树问题,在每个级别,形成树前缀的所有单词形成分支.不留下字符串一部分的分支是正确的解决方案.

                      thisisinsane
                          |
                          |
                     (this)isinsane
                     /            \
                    /              \
          (this,i)sinsane         (this,is)insane
              /                     /           \
             /                     /             \
  (this,i,sin)ane          (this,is,in)sane    (this,is,insane)
                                /
                               /
                       (this,is,in,sane)

因此,在此示例中有两个解决方案,但我们希望使用最长的单词选择解决方案,即我们希望使用深度优先搜索策略从右侧探索树.

所以我们的算法应该：

>按降序长度排序字典.
>查找当前字符串的所有前缀.如果没有,则返回False.
>将前缀设置为最长的未探索前缀.
>将其从字符串中删除.如果字符串为空,我们找到了一个解决方案,返回所有前缀的列表.
>递归到2.
>此分支失败,返回False.

此解决方案的示例实现：

def segment(string,wset):
    """Segments a string into words prefering longer words givens
    a dictionary wset."""
    # Sort wset in decreasing string order
    wset.sort(key=len, reverse=True)
    result = tokenize(string, wset, "")
    if result:
        result.pop() # Remove the empty string token
        result.reverse() # Put the list into correct order
        return result
    else:
        raise Exception("No possible segmentation!")

def tokenize(string, wset, token):
    """Returns either false if the string can't be segmented by 
    the current wset or a list of words that segment the string
    in reverse order."""
    # Are we done yet?
    if string == "":
        return [token]
    # Find all possible prefixes
    for pref in wset:
        if string.startswith(pref):
            res = tokenize(string.replace(pref, '', 1), wset, pref)
            if res:
                res.append(token)
                return res
    # Not possible
    return False

print segment("thisisinsane", ['this', 'is', 'in', 'insane']) # this is insane
print segment("shareasale", ['share', 'a', 'sale', 'as'])     # share a sale
print segment("asignas", ['as', 'sign', 'a'])                 # a sign as