algorithm – 在字符串中有效地查找给定的子序列,最大化连续字符的数量

2023年9月22日 212次阅读

问题描述很长

模糊字符串匹配器实用程序(如fzf或CtrlP)会过滤具有给定搜索字符串作为子序列的字符串列表.
例如,考虑用户想要在文件列表中搜索特定照片.要查找文件

/home/user/photos/2016/pyongyang_photo1.png

输入ph2016png就足够了,因为这个搜索字符串是这个文件名的子序列. (请注意,这不是LCS.整个搜索字符串必须是文件名的子序列.)

检查给定的搜索字符串是否是另一个字符串的子序列是微不足道的,但我想知道如何有效地获得最佳匹配：在上面的例子中,有多个可能的匹配.一个是

/home/user/photos/2016/pyongyang_photo1.png

但是用户可能想到的那个是

/home/user/photos/2016/pyongyang_photo1.png

为了形式化,我将“最佳”匹配定义为由最小数量的子串组成的匹配.第一个示例匹配的数字为5,第二个匹配的数字为3.

我想出了这个,因为获得最佳匹配来为每个结果分配分数,进行排序会很有趣.我对近似解决方案不感兴趣,我对这个问题的兴趣主要是学术性的.

tl; dr问题描述

给定字符串s和t,在t的子序列中找到等于s的子序列,其最大化t中连续的元素对的数量.

到目前为止我尝试过的

为了讨论,让我们调用搜索查询s和字符串来测试t.问题的解决方案表示为模糊(s,t).我将使用Python的字符串切片表示法.最简单的方法如下：

由于任何解决方案必须按顺序使用s中的所有字符,因此可以通过在t中搜索第一次出现的s [0](使用索引i)来开始解决此问题的算法,然后使用两个解决方案中较好的一个

t[:i+1] + fuzzy(s[1:], t[i+1:])    # Use the character
t[:i]   + fuzzy(s,     t[i+1:])    # Skip it and use the next occurence 
                                   # of s[0] in t instead

这显然不是解决这个问题的最佳方案.相反,这是明显的蛮力之一. (我已经玩过同时搜索s [-1]的最后一次出现并在此问题的早期版本中使用此信息,但事实证明这种方法不起作用.)

→我的问题是：这个问题最有效的解决方案是什么？

最佳答案我建议创建一个搜索树,其中每个节点代表大海捞针中与其中一个针字符匹配的字符位置.

顶部节点是兄弟姐妹,代表大海捞针中第一个针字符的出现.

父节点的子节点是表示大海捞针中下一个针字符出现的节点,但只是那些位于该父节点表示的位置之后的节点.

这在逻辑上意味着一些孩子由几个父母共享,因此这个结构实际上不是一棵树,而是一个有向无环图.有些兄弟姐妹甚至可能有完全相同的孩子.其他父母可能根本没有孩子：他们是死路一条,除非他们位于图表的底部,其中叶子代表最后一个针头角色的位置.

一旦设置了该图,其中的深度优先搜索可以容易地从某个节点开始导出仍然需要的段的数量,然后最小化其中的替代.

我在下面的Python代码中添加了一些注释.此代码可能仍在改进,但与您的解决方案相比,它似乎已经非常高效.

def fuzzy_trincot(haystack, needle, returnSegments = False):
    inf = float('inf')

    def getSolutionAt(node, depth, optimalCount = 2):
        if not depth: # reached end of needle
            node['count'] = 0
            return
        minCount = inf # infinity ensures also that incomplete branches are pruned
        child = node['child']
        i = node['i']+1
        # Optimisation: optimalCount gives the theoretical minimum number of  
        # segments needed for any solution. If we find such case, 
        # there is no need to continue the search.
        while child and minCount > optimalCount:
            # If this node was already evaluated, don't lose time recursing again.
            # It works without this condition, but that is less optimal.
            if 'count' not in child:
                getSolutionAt(child, depth-1, 1)
            count = child['count'] + (i < child['i'])
            if count < minCount:
                minCount = count
            child = child['sibling']
        # Store the results we found in this node, so if ever we come here again,
        # we don't need to recurse the same sub-tree again.
        node['count'] = minCount

    # Preprocessing: build tree
    # A node represents a needle character occurrence in the haystack.
    # A node can have these keys:
    #   i:       index in haystack where needle character occurs
    #   child:   node that represents a match, at the right of this index, 
    #            for the next needle character
    #   sibling: node that represents the next match for this needle character
    #   count:   the least number of additional segments needed for matching the 
    #            remaining needle characters (only; so not counting the segments
    #            already taken at the left)
    root = { 'i': -2, 'child': None, 'sibling': None }
    # Take a short-cut for when needle is a substring of haystack
    if haystack.find(needle) != -1:
        root['count'] = 1
    else:
        parent = root
        leftMostIndex = 0
        rightMostIndex = len(haystack)-len(needle)
        for j, c in enumerate(needle):
            sibling = None
            child = None
            # Use of leftMostIndex is an optimisation; it works without this argument
            i = haystack.find(c, leftMostIndex)
            # Use of rightMostIndex is an optimisation; it works without this test
            while 0 <= i <= rightMostIndex:
                node = { 'i': i, 'child': None, 'sibling': None }
                while parent and parent['i'] < i:
                    parent['child'] = node
                    parent = parent['sibling']
                if sibling: # not first child
                    sibling['sibling'] = node
                else: # first child
                    child = node
                    leftMostIndex = i+1
                sibling = node
                i = haystack.find(c, i+1)
            if not child: return False
            parent = child
            rightMostIndex += 1
        getSolutionAt(root, len(needle))

    count = root['count']
    if not returnSegments:
        return count

    # Use the `returnSegments` option when you need the character content 
    # of the segments instead of only the count. It runs in linear time.

    if count == 1: # Deal with short-cut case 
        return [needle]
    segments = []
    node = root['child']
    i = -2
    start = 0
    for end, c in enumerate(needle):
        i += 1
        # Find best child among siblings
        while (node['count'] > count - (i < node['i'])):
            node = node['sibling']
        if count > node['count']:
            count = node['count']
            if end:
                segments.append(needle[start:end])
                start = end
        i = node['i']
        node = node['child']
    segments.append(needle[start:])
    return segments

可以使用可选的第三个参数调用该函数：

haystack = "/home/user/photos/2016/pyongyang_photo1.png"
needle = "ph2016png"

print (fuzzy_trincot(haystack, needle))

print (fuzzy_trincot(haystack, needle, True))

输出：

3
['ph', '2016', 'png']

由于该函数被优化为仅返回计数,因此第二次调用将在执行时添加一些位.