浅谈KMP算法及实现

2024年5月16日 249次阅读

题目描述

这里通过lintcode上的字符串查找这道题，引入字符串匹配问题。

对于一个给定的 source 字符串和一个 target 字符串，你应该在 source 字符串中找
出 target 字符串出现的第一个位置(从0开始)。如果不存在，则返回 -1。

样例
如果 source = "source" 和 target = "target"，返回 -1。
如果 source = "abcdabcdefg" 和 target = "bcd"，返回 1。

BF匹配算法(蛮力匹配算法)

算法思想

从主串S的第 pos 开始，和模式串T的第一个字符进行比较，若相等，则继续逐个比较后续的字符;
否则回溯到主串S的第 pos+1 字符位置处重新与模式串T进行比较。
直到模式串T中的每一个字符依次与主串S的一个连续字符序列完全相同，则称匹配成功，此时
返回模式字符串T的第一个字符在主串S中的位置;否则匹配失败，返回-1。

时间复杂度

n,m分别是主串和模式串的长度
最坏情况下:O(n*m)

代码实现

class Solution {
    public int strStr(String source, String target) {
        if(source==null || target==null)
            return -1;

        //这里规定下标从1开始，当然也可以从0开始
        int i=1,j=1;
        int slen=source.length();
        int tlen=target.length();

        if(tlen==0)
            return 0;
        if(slen==0 || slen<tlen)
            return -1;

        while(i<=slen && j<=tlen){
            if(source.charAt(i-1)==target.charAt(j-1)){
                i++;
                j++;
            }else{
                //计算匹配失败后回溯到的主串S位置(pos+1)
                //下标从0开始为: i=i-j+1;
                i=i-j+2;
                j=1;
            }
        }
        //匹配成功，返回出现下标
        //为满足题目输出要求，下标从0开始。
        if(j>tlen)      return i-tlen-1;    
        else            return -1;
    }
}

KMP匹配算法

算法思想

与BF算法相比，KMP算法消除了主串S匹配失败时的指针回溯。

KMP算法当匹配失败时，主串S中的i指针不需回溯，而是根据已经得到的部分匹配结果将模式串尽可能远的向右滑动，然后继续进行比较。

匹配过程

假设主串S为: acabaabaabcacaabc
模式串为:abaabcac

这里需要使用到一个next数组(后面会提到，现在只需理解匹配过程)

j	1	2	3	4	5	6	7	8
next[j]	0	1	1	2	2	3	1	2

(1)第一次匹配

i   1  2    
    a  c  a  b  a  a  b  a  a  b  c  a  c  a  a  b  c
    a  b  a  a  b  c  a  c
       ^
j   1  2

当i=2时主串与j=2时模式串不匹配，查表next[2]=1;则需要将模式串中第一个字符与i=2位置的字符进行匹配，即模式串后移一位。

(2)第二次匹配

i   1  2    
    a  c  a  b  a  a  b  a  a  b  c  a  c  a  a  b  c
       a  b  a  a  b  c  a  c
       ^
j      1

next[1]=0,此时需要将主串和模式串都向后移动一位(此时j=0,移动一位即是模式串第一个字符)，即从i=3与模式串T1重新比较

(3)第三次匹配   

i   1  2  3      ...     8  
    a  c  a  b  a  a  b  a  a  b  c  a  c  a  a  b  c
          a  b  a  a  b  c  a  c
                         ^
j         1      ...     6

next[6]=3,则需要将模式串中第3个字符与i=8位置的字符进行匹配，即模式串后移3位。

(4)第四次匹配

i   1  2  3      ...     8       ...       14   
    a  c  a  b  a  a  b  a  a  b  c  a  c  a  a  b  c
                   a  b  a  a  b  c  a  c
                         ^
                j  1  2  3       ...       9

关于next数组

next[j]表明当模式串中第j个字符与主串中相应的字符不相等时，在模式串中需要重新和主串中该字符进行比较的字符位置。

计算next数组

当next函数中定义的集合不为空时，next[j]的值等于串"T[1]T[2]...T[j-1]"的
真前缀子串和真后缀子串相等时的最大子串长度+1。

那么什么是真前(后)缀子串呢:就是不包含自身的前(后)缀子串。
如:aba
真前缀子串: a  ab
真后缀子串: a  ba

当j=1时,串不存在,next[1]=0;
当j=2时,规定next[2]=1;
要求next[j+1],串为"T[1]T[2]...T[j]",要找该串的真前缀子串等于该串的真后缀子串，
即只需比较 T[j] 和 T[k] 是否相等(k=next[j]):
如相等,next[j+1]=next[j]+1;
否则,继续比较 T[j] 和 T[k'] 是否相等(k'=next[k])?
   如相等,next[j+1]=next[k]+1;
   否则,继续比较 T[j] 和 T[ next[k'] ] 是否相等?
   ...
   如果直到 next[k*]=0 都不相等，则 next[j+1]=1;

算法时间复杂度

n,m分别是主串和模式串的长度
时间复杂度为：O(n+m)

代码实现

class Solution {
    //求next数组的过程
    public static int[] getNext(String str){
        int len=str.length();
        int[] next=new int[len+1];
        next[1]=0;
        int j=1,k=0;
        while(j<len){
            //要计算next[j+1]:比较T[j]和T[next[j]]
            if(k==0 || str.charAt(j-1)==str.charAt(k-1)){
                ++j;
                ++k;
                next[j]=k;
            }
            else k=next[k];
        }
        return next;
    }

    public static int strStr(String source, String target) {
        if(source==null || target==null)
            return -1;

        int slen=source.length();
        int tlen=target.length();

        if(tlen==0)
            return 0;
        if(slen==0 || slen<tlen)
            return -1;

        int[] next=getNext(target);
        int i=1,j=1;
        while(i<=slen && j<=tlen){
            if(j==0 || source.charAt(i-1)==target.charAt(j-1)){
                i++;
                j++;
            }
            else j=next[j];
        }
        if(j>tlen)      return i-tlen-1;
        else            return -1;
    }

}

KMP代码的优化

举个栗子

试着考虑这个问题，如果主串为"aaabaaaab",模式串为"aaaab"?

模式串对应的next函数值如下:

j	1	2	3	4	5
next[j]	0	1	2	3	4

这个效率有点低，所以我们引入 nextval 数组:

其计算方式如下:
nextval[1]=0;
如计算nextval[j],则比较 T[j] 与 T[k](k=next[j])?
如相等:  nextval[j]=nextval[k];
否则:    nextval[j]=next[j];

j	1	2	3	4	5
next[j]	0	1	2	3	4
nextval[j]	0	0	0	0	4

代码如下

class Solution {
    public static int[] getNext(String str){
        int len=str.length();
        int[] next=new int[len+1];
        next[1]=0;
        int j=1,k=0;
        while(j<len){
            if(k==0 || str.charAt(j-1)==str.charAt(k-1)){
                ++j;
                ++k;
                next[j]=k;
            }
            else k=next[k];
        }
        return next;
    }
    //这里根据得到的next数组来计算nextval数组
    public static int[] getNextVal(String str){
        int len=str.length();
        int[] nextval=new int[len+1];

        int j=2,k=0;
        nextval[1]=0;
        int[] next=getNext(str);
        while(j<=len){
            k=next[j];
            if(str.charAt(j-1)==str.charAt(k-1))
                nextval[j]=nextval[k];
            else
                nextval[j]=next[j];
            j++;
        }
        return nextval;
    }

    public static int strStr(String source, String target) {
        if(source==null || target==null)
            return -1;

        int slen=source.length();
        int tlen=target.length();

        if(tlen==0)
            return 0;
        if(slen==0 || slen<tlen)
            return -1;

        int[] nextval=getNextVal(target);
        int i=1,j=1;
        while(i<=slen && j<=tlen){
            if(j==0 || source.charAt(i-1)==target.charAt(j-1)){
                i++;
                j++;
            }
            else j=nextval[j];
        }
        if(j>tlen)      return i-tlen-1;
        else            return -1;
    }

}

两种算法的比较

BF算法时间复杂度为O（n*m）,但实际执行近似与O（n+m），因此仍被使用。
KMP算法仅当模式串与主串之间存在许多部分匹配情况下，才会比BF算法快。