关于字符串中回文子串的思考

碰巧在Codeforces中遇到有关字符串回文子串的题。

可能原题中遇到的问题并不需要太高要求的回文子串算法,还是趁此机会总结下下吧。

查阅了很多方面的资料,主要来自于wiki、stackoverflow以及leetcode.


好了,开始吧。先介绍背景。

Longest Palindromic Substring Part I

November 20, 2011 in dynamic programmingstring

Given a string S, find the longest palindromic substring in S.


This interesting problem has been featured in the famous Greplin programming challenge, and is asked quite often in the interviews. Why? Because this problem can be attacked in so many ways. There are five different solutions that I am aware of. Are you up to the challenge?

Head over to Online Judge to solve it now! (you may submit either C++ or Java solution)

Hint:
First, make sure you understand what a palindrome means. A palindrome is a string which reads the same in both directions. For example, “aba” is a palindome, “abc” is not.

A common mistake:
Some people will be tempted to come up with a quick solution, which is unfortunately flawed (however can be corrected easily):

Reverse S and become S’. Find the 
longest common substring between S and S’, which must also be the longest palindromic substring.

This seemed to work, let’s see some examples below.

For example,
S = “caba”, S’ = “abac”.
The longest common substring between S and S’ is “aba”, which is the answer.

Let’s try another example:
S = “abacdfgdcaba”, S’ = “abacdgfdcaba”.
The longest common substring between S and S’ is “abacd”. Clearly, this is not a valid palindrome.

We could see that the longest common substring method fails when there exists a reversed copy of a non-palindromic substring in some other part of S. To rectify this, each time we find a longest common substring candidate, we check if the substring’s indices are the same as the reversed substring’s original indices. If it is, then we attempt to update the longest palindrome found so far; if not, we skip this and find the next candidate.

This gives us a O(N2) DP solution which uses O(N2) space (could be improved to use O(N) space). Please read more about Longest Common Substring here.

Brute force solution, O(N3):
The obvious brute force solution is to pick all possible starting and ending positions for a substring, and verify if it is a palindrome. There are a total of C(N, 2) such substrings (excluding the trivial solution where a character itself is a palindrome).

Since verifying each substring takes O(N) time, the run time complexity is O(N3).

暴力解法就不贴代码详细介绍,没太大的意思。

Dynamic programming solution, O(N2) time and O(N2) space:
To improve over the brute force solution from a DP approach, first think how we can avoid unnecessary re-computation in validating palindromes. Consider the case “ababa”. If we already knew that “bab” is a palindrome, it is obvious that “ababa” must be a palindrome since the two left and right end letters are the same.

Stated more formally below:

Define P[ i, j ] ← true 
iff the substring S
i … S
j is a palindrome, otherwise false.

Therefore,

P[ i, j ] ← ( P[ i+1, j-1 ] 
and S
i = S
j )

The base cases are:

P[ i, i ] ← true

P[ i, i+1 ] ← ( S
i = S
i+1 )

This yields a straight forward DP solution, which we first initialize the one and two letters palindromes, and work our way up finding all three letters palindromes, and so on… 

This gives us a run time complexity of O(N2) and uses O(N2) space to store the table.

代码如下:

#include <iostream>
#include <string>
#define MAX 1010
using namespace std;

bool dp[MAX][MAX] = {false};

string longestPalindromeDP(string s)
{
	int slen = s.length();
	int index = -1;
	int maxlen = 0;

	for (int i = 0; i < slen; i++)
	{
		dp[i][i] = true;
		index = i;
		maxlen = 1;
	}
	for (int i = 0; i < slen-1; i++)
	{
		dp[i][i+1] = true;
		index = i;
		maxlen = 2;
	}

	for (int sublen = 3; sublen <= slen; sublen++)
	{
		for (int i = 0; i+sublen-1 < slen; i++)
		{
			int j = i+sublen-1;
			if (s[i]==s[j] && dp[i+1][j-1]==true)
			{
				dp[i][j] = true;
				index = i;
				maxlen = sublen;
			}
		}
	}

	return s.substr(index, maxlen);
}

int main()
{
	string str;
	cin >> str;

	cout << longestPalindromeDP(str) << endl;

	system("pause");
	return 0;
}

Longest Palindromic Substring Part II

November 20, 2011 in string

Given a string S, find the longest palindromic substring in S.

Note:
This is Part II of the article: Longest Palindromic Substring. Here, we describe an algorithm (Manacher’s algorithm) which finds the longest palindromic substring in linear time. Please read Part I for more background information.

In my previous post we discussed a total of four different methods, among them there’s a pretty simple algorithm with O(N2) run time and constant space complexity. Here, we discuss an algorithm that runs in O(N) time and O(N) space, also known as Manacher’s algorithm.

Hint:
Think how you would improve over the simpler O(N2) approach. Consider the worst case scenarios. The worst case scenarios are the inputs with multiple palindromes overlapping each other. For example, the inputs: “aaaaaaaaa” and “cabcbabcbabcba”. In fact, we could take advantage of the palindrome’s symmetric property and avoid some of the unnecessary computations.

An O(N) Solution (Manacher’s Algorithm):
First, we transform the input string, S, to another string T by inserting a special character ‘#’ in between letters. The reason for doing so will be immediately clear to you soon.

For example: S = “abaaba”, T = “#a#b#a#a#b#a#”.

To find the longest palindromic substring, we need to expand around each Ti such that Ti-d … Ti+d forms a palindrome. You should immediately see that d is the length of the palindrome itself centered at Ti.

We store intermediate result in an array P, where P[ i ] equals to the length of the palindrome centers at Ti. The longest palindromic substring would then be the maximum element in P.

Using the above example, we populate P as below (from left to right):

T = # a # b # a # a # b # a #
P = 0 1 0 3 0 1 6 1 0 3 0 1 0

Looking at P, we immediately see that the longest palindrome is “abaaba”, as indicated by P6 = 6.

Did you notice by inserting special characters (#) in between letters, both palindromes of odd and even lengths are handled graciously? (Please note: This is to demonstrate the idea more easily and is not necessarily needed to code the algorithm.)

Now, imagine that you draw an imaginary vertical line at the center of the palindrome “abaaba”. Did you notice the numbers in P are symmetric around this center? That’s not only it, try another palindrome “aba”, the numbers also reflect similar symmetric property. Is this a coincidence? The answer is yes and no. This is only true subjected to a condition, but anyway, we have great progress, since we can eliminate recomputing part of P[ i ]‘s.

Let us move on to a slightly more sophisticated example with more some overlapping palindromes, where S = “babcbabcbaccba”.

《关于字符串中回文子串的思考》

Above image shows T transformed from S = “babcbabcbaccba”. Assumed that you reached a state where table P is partially completed. The solid vertical line indicates the center (C) of the palindrome “abcbabcba”. The two dotted vertical line indicate its left (L) and right (R) edges respectively. You are at index i and its mirrored index around C is i’. How would you calculate P[ i ] efficiently?

Assume that we have arrived at index i = 13, and we need to calculate P[ 13 ] (indicated by the question mark ?). We first look at its mirrored index i’ around the palindrome’s center C, which is index i’ = 9.

《关于字符串中回文子串的思考》

The two green solid lines above indicate the covered region by the two palindromes centered at i and i’. We look at the mirrored index of i around C, which is index i’. P[ i’ ] = P[ 9 ] = 1. It is clear that P[ i ] must also be 1, due to the symmetric property of a palindrome around its center.

As you can see above, it is very obvious that P[ i ] = P[ i’ ] = 1, which must be true due to the symmetric property around a palindrome’s center. In fact, all three elements after C follow the symmetric property (that is, P[ 12 ] = P[ 10 ] = 0, P[ 13 ] = P[ 9 ] = 1, P[ 14 ] = P[ 8 ] = 0).

《关于字符串中回文子串的思考》

Now we are at index i = 15, and its mirrored index around C is i’ = 7. Is P[ 15 ] = P[ 7 ] = 7?

Now we are at index i = 15. What’s the value of P[ i ]? If we follow the symmetric property, the value of P[ i ]should be the same as P[ i’ ] = 7. But this is wrong. If we expand around the center at T15, it forms the palindrome “a#b#c#b#a”, which is actually shorter than what is indicated by its symmetric counterpart. Why?

《关于字符串中回文子串的思考》

Colored lines are overlaid around the center at index i and i’. Solid green lines show the region that must match for both sides due to symmetric property around C. Solid red lines show the region that might not match for both sides. Dotted green lines show the region that crosses over the center.

It is clear that the two substrings in the region indicated by the two solid green lines must match exactly. Areas across the center (indicated by dotted green lines) must also be symmetric. Notice carefully that P[ i ‘ ] is 7 and it expands all the way across the left edge (L) of the palindrome (indicated by the solid red lines), which does not fall under the symmetric property of the palindrome anymore. All we know is P[ i ] ≥ 5, and to find the real value of P[ i ] we have to do character matching by expanding past the right edge (R). In this case, since P[ 21 ] ≠ P[ 1 ], we conclude that P[ i ] = 5.

Let’s summarize the key part of this algorithm as below:

if P[ i’ ] ≤ R – i,

then P[ i ] ← P[ i’ ]

else P[ i ] ≥ P[ i’ ]. (Which we have to expand past the right edge (R) to find P[ i ].

See how elegant it is? If you are able to grasp the above summary fully, you already obtained the essence of this algorithm, which is also the hardest part.

The final part is to determine when should we move the position of C together with R to the right, which is easy:

If the palindrome centered at i does expand past R, we update C to i, (the center of this new palindrome), and extend R to the new palindrome’s right edge.

In each step, there are two possibilities. If P[ i ] ≤ R – i, we set P[ i ] to P[ i’ ] which takes exactly one step. Otherwise we attempt to change the palindrome’s center to i by expanding it starting at the right edge, R. Extending R (the inner while loop) takes at most a total of N steps, and positioning and testing each centers take a total of N steps too. Therefore, this algorithm guarantees to finish in at most 2*N steps, giving a linear time solution.

附代码:

#include <iostream>
#include <string>
#include <cstring>
#define MAX 1010*2
using namespace std;

// Transform S into T.
// For example, S = "abba", T = "^#a#b#b#a#$".
// ^ and $ signs are sentinels appended to each end to avoid bounds checking
string preProcess(string s) {
	int n = s.length();
	if (n == 0) return "^$";
	string ret = "^";
	for (int i = 0; i < n; i++)
		ret += "#" + s.substr(i, 1);

	ret += "#$";
	return ret;
}

string longestPalindrome(string s)
{
	string t = preProcess(s);  // 预处理s, 插入'#'
	int p[MAX];
	memset(p, 0, sizeof(p));

	int center = 0, right = 0;
	int index = -1, maxlen = 0;
	for (int i = 1; i < t.length()-1; i++)
	{
		int i_mirror = 2 * center - i;
		// s      babcbabcbaccba
		// index  0,  1,  2,  3,  4,  5,  6,  7,  8,  9,  10,  11,  12,  13,  14,  15,  16,  17,  18,  19,  20
		// t      #,  b,  #,  a,  #,  b,  #,  c,  #,  b,   #,   a,   #,   b,   #,   c,   #,  ...
		// p      0,  1,  0,  3,  0,  1,  0,  7,  0,  1,   0,   9,   0,   1,   0,   ?, ...
		p[i] = i < right ? min(right-i, p[i_mirror]) : 0;
		while (t[i+1+p[i]] == t[i-1-p[i]])
			p[i]++;
		if (i+p[i] > right)
		{
			center = i;
			right = i + p[i];
		}

		if (p[i] > maxlen)
		{
			index = i;
			maxlen = p[i];
		}
	}
	return s.substr((index-maxlen-1)/2, maxlen);
}

int main()
{
	string str;
	cin >> str;
	cout << longestPalindrome(str) << endl;

	system("pause");
	return 0;
}

Useful Links:

» 
Manacher’s Algorithm O(N) 时间求字符串的最长回文子串
 (Best explanation if you can read Chinese)

» 
A simple linear time algorithm for finding longest palindrome sub-string

» 
Finding Palindromes

» 
Finding the Longest Palindromic Substring in Linear Time

» 
Wikipedia: Longest Palindromic Substring

点赞