filmov
tv
Knuth Morris Pratt (KMP) String Search Algorithm - tutorial with failure function in Java

Показать описание
This tutorial explains how the Knuth-Morris-Pratt (KMP) pattern matching algorithm works. Animated examples are used to quickly visualize the basic concept. Then the source code of an implementation written in Java is discussed, along with its running time complexity.
There are two parts to the algorithm. The first one creates a "failure function", which is nothing more than an array that holds the longest prefixes as they occur through out the pattern string. The main algorithm uses this prefix length array to calculate how much forward to shift the pattern string.
Source code written in Java:
The implementation is based on the pseudo-code taken from Wikipedia:
FAST PATTERN MATCHING IN STRINGS
DONALD E. KNUTH, JAMES H. MORRIS, JR. AND VAUGHAN R. PRATT
Written and narrated by Andre Violentyev
-----------------------------------------------------------------------------------------------------------------------------------------------
Errata (credit to @Achyut Sapkota):
My animation at 6:04 should have moved not by 2 but by 3 characters. By the way, at that point we'd also increment the pointer in the text to the next character. The source code (see above) is correct - once we get to index zero of the prefixLen array, which stores -1, we increment both the text and the pattern string pointers:
p = prefixLen[p];
if (p 'less than' 0) {
t++;
p++;
}
In the above example once we shift by 3, we essentially go "out of bounds" for the pattern string. The illustrations show that the first character, 'a', is at index 1 of the prefixLen array (not index 0, which is probably what is causing confusion since most programming languages use 0 based indexes for strings). So the formula works, it's just that, once we detect the "out of bounds" condition, we reset the pattern string pointer to the first character, and increment the text pointer by 1. In other words, we shift the pattern string so that its first character aligns with the next character in the text. This is what insures that each character in the text is examined at most twice.
There are two parts to the algorithm. The first one creates a "failure function", which is nothing more than an array that holds the longest prefixes as they occur through out the pattern string. The main algorithm uses this prefix length array to calculate how much forward to shift the pattern string.
Source code written in Java:
The implementation is based on the pseudo-code taken from Wikipedia:
FAST PATTERN MATCHING IN STRINGS
DONALD E. KNUTH, JAMES H. MORRIS, JR. AND VAUGHAN R. PRATT
Written and narrated by Andre Violentyev
-----------------------------------------------------------------------------------------------------------------------------------------------
Errata (credit to @Achyut Sapkota):
My animation at 6:04 should have moved not by 2 but by 3 characters. By the way, at that point we'd also increment the pointer in the text to the next character. The source code (see above) is correct - once we get to index zero of the prefixLen array, which stores -1, we increment both the text and the pattern string pointers:
p = prefixLen[p];
if (p 'less than' 0) {
t++;
p++;
}
In the above example once we shift by 3, we essentially go "out of bounds" for the pattern string. The illustrations show that the first character, 'a', is at index 1 of the prefixLen array (not index 0, which is probably what is causing confusion since most programming languages use 0 based indexes for strings). So the formula works, it's just that, once we detect the "out of bounds" condition, we reset the pattern string pointer to the first character, and increment the text pointer by 1. In other words, we shift the pattern string so that its first character aligns with the next character in the text. This is what insures that each character in the text is examined at most twice.
Комментарии