Understanding the prefix-function of the Knuth-Morris-Pratt algorithm

Intuition is nothing but the outcome of earlier intellectual experience.    – Albert Einstein

There are plenty of tutorials about the ideas of the KMP algorithm, leaving us the wonder and owe of its magnificent simplicity and effectiveness. But one important part is frequently missing – why the prefix function can be computed in that particular way; what is the rationale behind the computation of the prefix function. This post aims to demystify the prefix function and elaborate on the proofs to understand why it works.

Let’s first review the computation of the prefix function, as shown below.
Screen Shot 2016-04-20 at 5.37.41 PM
Here I will briefly go through the algorithm and point out its challenges for understanding.  Line 1 until line 4 are the preparation of the algorithm. As an example, for string “ababaca”, the result after the first four lines is shown below. The red arrow represents the value of q and the blue arrow represents the value of k. So far, so good.
Screen Shot 2016-04-20 at 5.48.41 PM
The rest part, from line 5 to 10, is where the mystery and fun comes from. This part consists of two loops, one “for” loop at line 5 and one “while” loop at line 6. The job of the first for loop is to iterate until the end of the string, to populate proper values of $latex \pi[q]$, one at a time. And the aim of the “while” loop is to take steps back when a mismatch is present. The outer loop is easy to understand, but the inner loop will require some effort to comprehend. The following figures demonstrate the outcomes of the outer for loop when q is iterating from 2 to 5. As we can see, as q progresses, k keeps track of the current maximal match size of the prefix and suffix, during which the inner while loop is never executed because of the loop condition – mismatch when k is positive – is never satisfied.
01234
The next is the tricky part, when q is equal to 6, there is a mismatch between P[k+1] and P[q], the inner while loop must execute. In the while loop, the value of q maintain the same while the value of k steps back until a match is found or k is zero.
Why changing the value of k to $latex \pi[k]$?
Indeed, what this step trying to do, is to iteratively find smaller and smaller matches of prefixes and suffixes to proceed string matching, as shown in Figures below. While q=5, the biggest length of matching prefix and suffix is 3, in the sense that the the first three characters “aba” matches the last three characters “aba”. However, this is not the only match, since the first character “a” also matches the last character. Therefore, we iteratively test each matching to find the proper continuation of the next character P[q+1].
Screen Shot 2016-04-20 at 6.30.05 PM
Screen Shot 2016-04-20 at 6.38.18 PM
Screen Shot 2016-04-20 at 6.40.07 PM

The next question is, why by iteratively applying $latex k=\pi[k]$, we can progressively obtain the next smaller match of the prefix and suffix. The prefix-function iteration lemma is vital for understanding this problem. The lemma is defined as follows:

Lemma 1
Let P be a pattern of length m with prefix function $latex \pi$. Then, for q = 1,2,…,m, we have $latex \pi^{*}[q]=\{k:k<q\ and\ P_{k}\sqsupset P_{q}\}$.

Here the prefix function $latex \pi&fg=0000ff $ is defined as $latex \pi \left [ q \right ] = max \left \{   k < q \ and \ P_{k} \sqsupset P_{q} \right \}  $, meaning the length of the longest prefix of $latex P$ that is also a suffix of $latex P_{q}$; $latex \pi^{*}[q]&fg=0000ff$ is defined in terms of a collection of functional iterations that takes the form $latex \pi^{*}[q]=\{\pi[q], \pi^{(2)}[q],\pi^{(3)}[q],\dots, \pi^{(t)}[q]\}$ where $latex \pi^{0}[q]=q$, $latex \pi^{i}[q]=\pi \left [ \pi^{(i-1)} \left [ q \right ] \right ]$ and the sequence stops upon reaching $latex \pi^{(t)}[q] = 0$; $latex P_{k} \sqsupset P_{q}&fg=0000ff $ denotes that string $latex P_{k}$ is a suffix of string $latex P_{q}$, meaning the k-character prefix of $latex P \left[ 1 .. k \right ]$ is a suffix of $latex P_{q}$.

Too mathematical? In layman’s terms, the lemma simply means, $latex \pi^{*}[q]$ is a set in which all the values $latex k$ satisfies the condition that the prefix $latex P_{k}$ is also a suffix of $latex P_{q}$. In other words, $latex \pi^{*}[q]$ includes a collection of sequence lengths where the prefix and suffix are matching. As an example, for string “abacaba”, we find two matches between prefixes and suffixes: “abacaba” and “abacaba“, therefore $latex \pi^{*}\left [ 7  \right ] = \left \{  1, 3 \right \} $ where $latex q=7$ which is the length of the string. For the same string “abacaba”, $latex \pi^{*}\left [ 3  \right ] = \left \{  1 \right \} $, meaning the only match is “abacaba” for the first three characters “aha”. And $latex \pi[q]$ is the largest value in $latex \pi^{*}[q]$, shown in the Figure below.

Screen Shot 2016-04-20 at 3.51.24 PMFigure 1. The coloured lines represent the matches of prefix and suffix

Proof

To prove $latex \pi^{*}[q]=\{k:k<q\ and\ P_{k}\sqsupset P_{q}\}$, we need to prove $latex \pi^{*}[q] \subseteq \{k:k<q\ and\ P_{k}\sqsupset P_{q}\}$ and $latex \pi^{*}[q] \supseteq  \{k:k<q\ and\ P_{k}\sqsupset P_{q}\}$ respectively.

Step 1

We first prove that $latex \pi^{*}[q] \subseteq \{k:k<q\ and\ P_{k}\sqsupset P_{q}\}$, that is $latex i\in \pi^{*}[q]\ implies\ P_{i} \sqsupset P_{q}$, meaning all the values in $\pi^{*}$ can form proper matches between the prefix and suffix.
If $latex i\in \pi^{*}[q]$, then $latex i=\pi^{u}[q]$ for some $latex u > 0$. Therefore, to prove $latex i\in \pi^{*}[q]\ implies\ P_{i} \sqsupset P_{q}$, we need to prove for some $latex u > 0$, $latex P_{\pi^{u}[q]} \sqsupset P_{q}$.We prove this by induction on u.

  • the base case
    For $latex u=1$, we have $latex \pi^{u}[q]=\pi[q]$. Because the prefix function $latex \pi$ is the length of the longest prefix of $latex P$ that is also a suffix of $latex P_{q}$, $latex \ P_{\pi[q]} \sqsupset P_{q}$ holds.
  • induction step
    If the statement holds for some natural number $latex u$, that is $latex P_{\pi^{u}\left [q \right ]} \sqsupset P_{q}$, meaning the u-th iteration of $latex \pi^{*}$ is a proper prefix. The (u+1)-th iteration $latex \pi^{u+1}[q]$ should also be a proper prefix that matches the suffix.
    This can  be demonstrated using the figure below. If the maximum prefix A and suffix B of a string are identical, the $latex \pi$ value by the end of B will be the length of B, which is equivalent to the size of A, let’s denote it by $latex \pi^{u}[q]$.  Since  $latex \pi^{u}[q]$ is the length of prefix string A, the value of $latex \pi^{u+1}[q]=\pi[\pi^{u}[q]]$, can be interpreted as the value of prefix function for string A. Suppose A1 and A2 are identical, the value of $latex \pi^{u+1}[q]$ will be the length of A1 and A2. Because A and B are already identical, which implies A1, A2, B1 and B2 are both equivalent. Therefore A1 and B2 are a pair of matching prefix and suffix of the total string. Thus, the induction step holds.

Screen Shot 2016-04-19 at 7.46.21 PM

So far, we have proved that all the values provided by  $latex \pi^{*}[q]$ – the iterative results of the prefix function – are proper matches of prefixes and suffixes, we will prove the completeness of $latex \pi^{*}[q]$, that is, $latex \pi^{*}[q]$ includes all possible matches of prefixes and suffixes.

Step 2

Next we prove $latex \pi^{*}[q] \supseteq  \{k:k<q\ and\ P_{k}\sqsupset P_{q}\}$ – meaning all the matches of prefixes and suffixes are included in $latex \pi^{*}[q]$ – by contradiction. Suppose to the contrary, not all the matches of prefixes and suffixes are included in $latex \pi^{*}[q]$, the set difference between $latex \{k:k<q\ and\ P_{k}\sqsupset P_{q}\}$ and $latex \pi^{*}[q]$ will be non-empty, that is, $latex \{k:k<q\ and\ P_{k}\sqsupset P_{q}\}-\pi^{*}[q]$  is non-empty. Let $latex j$ be the largest number in this set. Because $latex \pi[q]$ is the largest value for matching prefixes and suffixes, we must have $latex j < \pi[q]$, meaning the set difference must not include the longest size of matching pairs. So we let $latex {j}’$ denote the smallest integer in $latex \pi^{*}[q]$ that is greater than $latex j$. We have $latex P_{j}\sqsupset P_{q}$, and because $latex {j}’\in \pi^{*}[q]$ we have $latex P_{{j}’}\sqsupset P_{q}$. Thus $latex P_{j}\sqsupset P_{{j}’}$. Because $latex j$ is the largest value less than $latex {j}’$ with this property, we must have $latex \pi[{j}’]=j$. And since $latex {j}’\in \pi^{*}[q]$, we must have $latex j\in \pi^{*}[q]$ as well. This contradiction proves the lemma.

Until now, we have proved the prefix function iteration lemma, which explains by iteratively apply $latex \pi[q]$, we can get the next smaller size of prefix that is a match with the suffix. Therefore the correctiveness of the algorithm if proven.

References:
[1] Introduction to Algorithms

Leave a Reply

Your email address will not be published. Required fields are marked *