字符串匹配&Rabin-Karp算法讲解

2019年1月26日 216次阅读来源: Roni_i

问题描述：

Rabin-Karp的预处理时间是O(m)，匹配时间O( ( n – m + 1 ) m )既然与朴素算法的匹配时间一样，而且还多了一些预处理时间，那为什么我们还要学习这个算法呢？虽然Rain-Karp在最坏的情况下与朴素匹配一样，但是实际应用中往往比朴素算法快很多。而且该算法的期望匹配时间是O(n)【参照《算法导论》】，但是Rabin-Karp算法需要进行数值运算，速度必然不会比KMP算法快，那我们有了KMP算法以后为什么还要学习Rabin-Karp算法呢？个人认为学习的是一种思想，一种解题的思路，当我们见识的越多，眼界也就也开阔，面对实际问题的时候，就能找到更加合适的算法。比如二维模式匹配，Rabin-Karp就是一种好的选择。

而且Rabin-Karp算法非常有趣，将字符当作数字来处理，基本思路：如果Tm是一个长度为 |P| 的T的子串，且转换为数值后模上一个数（一般为素数）与模式字符串P转换成数值后模上同一个数的值相同，则Tm可能是一个合法的匹配。

 Rabin-Karp字符串匹配算法和前面介绍的《朴素字符串匹配算法》类似，也是对应每一个字符进行比较，不同的是Rabin-Karp采用了把字符进行预处理，也就是对每个字符进行对应进制数并取模运算，类似于通过某种函数计算其函数值，比较的是每个字符的函数值。预处理时间O(m)，匹配时间是O((n-m+1)m)。

Rabin-Karp算法的思想：

假设待匹配字符串的长度为M，目标字符串的长度为N（N>M）；
首先计算待匹配字符串的hash值，计算目标字符串前M个字符的hash值；
比较前面计算的两个hash值，比较次数N-M+1：
若hash值不相等，则继续计算目标字符串的下一个长度为M的字符子串的hash值
若hash值相同，则需要使用朴素算法再次判断是否为相同的字串；

《字符串匹配&Rabin-Karp算法讲解》

We can compute p in time O(m) using Horner's rule (see Section 32.1):

p = P[m] + 10 (P[m - 1] + 10(P[m - 2] + . . . + 10(P[2] + 10P[1]) . . . )).
The value t0 can be similarly computed from T[1 . . m] in time O(m).

To compute the remaining values t1, t2, . . . , tn-m in time O(n - m), it suffices to observe that ts + 1 can be computed from ts in constant time, since


ts + 1   =   10(ts - 10m - 1T[s + 1]) + T[s + m + 1].

(34.1)
For example, if m= 5 and ts = 31415, then we wish to remove the high-order digit T[s + 1] = 3 and bring in the new low-order digit (suppose it is T[s + 5 + 1] = 2) to obtain


ts+1 = 10(31415 - 10000.3) + 2

= 14152 .

http://net.pku.edu.cn/~course/cs101/2007/resource/Intro2Algorithm/book6/chap34.htm

以上算法很简单，但是当模式字符串P的长度达到7以后就要出错了，即使将t，p定义为long unsigned int型也解决不了大问题，也就是说上面代码没什么用。

　　其中b是基数，相当于把字符串看作b进制数。这样，字符串S=s1s2s3…sn从位置k+1开始长度为m的字符串子串S[k+1…k+m]的哈希值，就可以利用从位置k开始的字符串子串S[k…k+m-1]的哈希值，直接进行如下计算：H(S[k+1…k+m])=（H(S[k…k+m-1]）* b – sk*b^m + s(k+m)） mod h

该算法的难点就在于p和t的值可能很大，导致不能方便的对其进行处理。对这个问题有一个简单的补救办法，用一个合适的数q来计算p和t的模。每个字符其实十一个十进制的整数，所以p，t以及递归式都可以对模q进行，所以可以在O(m)的时间里计算出模q的p值，在O（n – m + 1）时间内计算出模q的所有t值。参见《算法导论》或http://net.pku.edu.cn/~course/cs101/2007/resource/Intro2Algorithm/book6/chap34.htm

递推式是如下这个式子：

ts+1 = (d ( ts-T[s + 1]h) + T[s + m + 1 ] ) mod q

例如，如果d = 10 （十进制）m= 5, ts = 31415,我们希望去掉最高位数字T[s + 1] = 3,再加入一个低位数字（假定 T[s+5+1] = 2)就得到：

ts+1 = 10(31415 – 10003) +2 = 14152

于是，只要不断这样计算开始位置右移一位后的字符串子串的哈希值，就可以在O（n）时间内得到所有位置对应的哈希值，从而可以在O（n+m）时间内完成字符串匹配。在实现时，可以用64位无符号整数计算哈希值，并取h等于2^64，通过自然溢出省去求模运算。

《字符串匹配&Rabin-Karp算法讲解》

typedef unsigned long long ull;
const ull b=100000007;//哈希的基数；
//a是否在b中出现
bool contain(string C,string S)
{
     int m=C.length(),n=S.length();
     if(m>n)  return false;
 
     //计算b的m次方
     ull t=1;
     for(int i=0;i<m;i++)   t*=b;
 
     //计算C和S长度为m的前缀对应的哈希值
     ull Chash=0,Shash=0;
     for(int i=0;i<m;i++)   Chash=Chash*b+C[i];
     for(int i=0;i<m;i++)   Shash=Shash*b+S[i];
 
     //对S不断右移一位，更新哈希值并判断
     for(int i=0;i+m<=n;i++){
          if(Chash==Shash)  return true;//S从位置i开始长度为m的字符串子串等于C；
          if(i+m<n)  Shash=Shash*b-S[i]*t+S[i+m];
      }
      return false;
}

滚动哈希（Rabin-Karp算法）

hash( txt[s+1 .. s+m] ) = ( d ( hash( txt[s .. s+m-1]) – txt[s]*h ) + txt[s + m] ) mod q

hash( txt[s .. s+m-1] ) : Hash value at shift s.
hash( txt[s+1 .. s+m] ) : Hash value at next shift (or shift s+1)
d: Number of characters in the alphabet
q: A prime number
h: d^(m-1)

/* Following program is a C implementation of Rabin Karp
Algorithm given in the CLRS book */
#include<stdio.h>
#include<string.h>
 
// d is the number of characters in the input alphabet
#define d 256
 
/* pat -> pattern
    txt -> text
    q -> A prime number
*/
void search(char pat[], char txt[], int q)
{
    int M = strlen(pat);
    int N = strlen(txt);
    int i, j;
    int p = 0; // hash value for pattern
    int t = 0; // hash value for txt
    int h = 1;
 
    // The value of h would be "pow(d, M-1)%q"
    for (i = 0; i < M-1; i++)
        h = (h*d)%q;
 
    // Calculate the hash value of pattern and first
    // window of text
    for (i = 0; i < M; i++)
    {
        p = (d*p + pat[i])%q;
        t = (d*t + txt[i])%q;
    }
 
    // Slide the pattern over text one by one
    for (i = 0; i <= N - M; i++)
    {
 
        // Check the hash values of current window of text
        // and pattern. If the hash values match then only
        // check for characters on by one
        if ( p == t )
        {
            /* Check for characters one by one */
            for (j = 0; j < M; j++)
            {
                if (txt[i+j] != pat[j])
                    break;
            }
 
            // if p == t and pat[0...M-1] = txt[i, i+1, ...i+M-1]
            if (j == M)
                printf("Pattern found at index %d \n", i);
        }
 
        // Calculate hash value for next window of text: Remove
        // leading digit, add trailing digit
        if ( i < N-M )
        {
            t = (d*(t - txt[i]*h) + txt[i+M])%q;
 
            // We might get negative value of t, converting it
            // to positive
            if (t < 0)
            t = (t + q);
        }
    }
}
 
/* Driver program to test above function */
int main()
{
    char txt[] = "GEEKS FOR GEEKS";
    char pat[] = "GEEK";
    int q = 101; // A prime number
    search(pat, txt, q);
    return 0;
}

参考资料：http://www.geeksforgeeks.org/archives/11937

参考资料：http://net.pku.edu.cn/~course/cs101/2007/resource/Intro2Algorithm/book6/chap34.htm

http://www.cnblogs.com/feature/articles/1813967.html （翻译PKU

    原文作者：Roni_i
    原文地址: https://www.cnblogs.com/Roni-i/p/9447409.html
    本文转自网络文章，转载此文章仅为分享知识，如有侵权，请联系博主进行删除。

ts+1 = (d *( ts-T[s + 1]*h) + T[s + m + 1 ] ) mod q

ts+1 = 10*(31415 – 1000*3) +2 = 14152

ts+1 = (d ( ts-T[s + 1]h) + T[s + m + 1 ] ) mod q

ts+1 = 10(31415 – 10003) +2 = 14152