Skip to content

Commit ba79246

Browse files
authored
Merge pull request #62 from 4ndrelim/branch-RefactorKMP
Branch refactor kmp
2 parents d3f6b76 + 5aeb10b commit ba79246

File tree

3 files changed

+104
-70
lines changed

3 files changed

+104
-70
lines changed

src/main/java/algorithms/patternFinding/KMP.java

Lines changed: 62 additions & 61 deletions
Original file line numberDiff line numberDiff line change
@@ -6,111 +6,112 @@
66
/**
77
* Implementation of KMP.
88
* <p>
9-
* Illustration of getPrefixIndices: with pattern ABCABCNOABCABCA
10-
* Here we make a distinction between position and index. The position is basically 1-indexed.
11-
* Note the return indices are still 0-indexed of the pattern string.
9+
* Illustration of getPrefixTable: with pattern ABCABCNOABCABCA
10+
* We consider 1-indexed positions. Position 0 will be useful later in as a trick to inform that are no prefix matches
1211
* Position: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
13-
* Pattern: A B C A B C N O A B C A B C A ...
14-
* Return: -1 0 0 0 1 2 3 0 0 1 2 3 4 5 6 4 ...
15-
* Read: ^ an indexing trick; consider 1-indexed characters for clarity and simplicity in the main algor
16-
* Read: ^ 'A' is the first character of the pattern string,
17-
* there is no prefix ending before its index, 0, that can be matched with.
18-
* Read: ^ ^ 'B' and 'C' cannot be matched with any prefix which are just 'A' and 'AB' respectively.
19-
* Read: ^ Can be matched with an earlier 'A'. So we store 1.
20-
* Prefix is the substring from idx 0 to 1 (exclusive). Note consider prefix from 0-indexed.
21-
* Realise 1 can also be interpreted as the index of the next character to match against!
22-
* Read: ^ ^ Similarly, continue matching
23-
* Read: ^ ^ No matches, so 0
24-
* Read: ^ ^ ^ ^ ^ ^ Match with prefix until position 6!
25-
* Read: ^ where the magic happens, we can't match 'N'
26-
* at position 7 with 'A' at position 15, but
27-
* we know ABC of position 1-3 (or index 0-2)
28-
* exists and can 'restart' from there.
29-
* <p>
30-
* <p>
12+
* Pattern: A B C A B C N O A B C A B C A ...
13+
* Return: -1 0 0 0 1 2 3 0 0 1 2 3 4 5 6 4 ... CAN BE READ AS NUM OF CHARS MATCHED
14+
* Read: ^ -1 can be interpreted as invalid number of chars matched but exploited for simplicity in the main algor.
15+
* Read: ^ 'A' is the first character of the pattern, there is no prefix ending before itself, to match.
16+
* Read: ^ ^ 'B' and 'C' cannot be matched with any prefix which are just 'A' and 'AB' respectively.
17+
* Read: ^ can be matched with an earlier prefix, 'A'. So we store 1, the number of chars matched.
18+
* Realise 1 can also be interpreted as the index of the next character to match against!
19+
* Read: ^ ^ Similarly, continue matching
20+
* Read: ^ ^ No matches, so 0
21+
* Read: ^ ^ ^ ^ ^ ^ Match with prefix, "ABCABC", until 6th char
22+
* of pattern string.
23+
* Read: ^ where the magic happens, we can't match 'N'
24+
* at position 7 with 'A' at position 15, but
25+
* we know "ABC" exists as an earlier sub-pattern
26+
* from 1st to 3rd and start matching the 4th
27+
* char onwards.
3128
* <p>
3229
* Illustration of main logic:
3330
* Pattern: ABABAB
3431
* String : ABABCABABABAB
3532
* <p>
36-
* A B A B C A B A B A B A B
37-
* Read: ^ to ^ Continue matching where possible, leading to Pattern[0:4] matched.
38-
* unable to match Pattern[4]. But notice that last two characters of String[0:4]
39-
* form a sub-pattern with Pattern[0:2] Maybe Pattern[2] == 'C' and we can 're-use' Pattern[0:2]
40-
* Read: ^ try ^ by checking if Pattern[2] == 'C'
33+
* A B A B C A B A B A B A B
34+
* Read: ^ to ^ Continue matching where possible, leading to 1st 4 characters matched.
35+
* unable to match Pattern[4]. But notice that last two characters
36+
* form a sub-pattern with the 1st 2, Maybe Pattern[2] == 'C' and we can 're-use' "AB"
37+
* Read: ^ ^ check if Pattern[2] == 'C'
4138
* Read: Turns out no. No previously identified sub-pattern with 'C'. Restart matching Pattern.
42-
* Read: ^ to ^ Found complete match! But rather than restart, notice that last 4 characters
43-
* Read: form a prefix sub-pattern of Pattern, which is Pattern[0:4] = "ABAB", so,
44-
* Read: ^ ^ Start matching from Pattern[4] and finally Pattern[5]
39+
* Read: ^ ^ Found complete match! But rather than restart, notice that last 4 characters
40+
* Read: of "ABABAB" form a prefix sub-pattern of Pattern, which is "ABAB", so,
41+
* Read: ^ reuse ^ ^ then match 5th and 6th char of pattern which happens to be "AB"
4542
*/
4643
public class KMP {
4744
/**
48-
* Find and indicate all suffix that match with a prefix.
45+
* Captures the longest prefix which is also a suffix for some substring ending at each index, starting from 0.
46+
* Does this by tracking the number of characters (of the prefix and suffix) matched.
4947
*
5048
* @param pattern to search
51-
* @return an array of indices where the suffix ending at each position of they array can be matched with
52-
* corresponding a prefix of the pattern ending before the specified index
49+
* @return an array of indices
5350
*/
54-
private static int[] getPrefixIndices(String pattern) {
51+
private static int[] getPrefixTable(String pattern) {
52+
// 1-indexed implementation
5553
int len = pattern.length();
56-
int[] prefixIndices = new int[len + 1];
57-
prefixIndices[0] = -1;
58-
prefixIndices[1] = 0; // 1st character has no prefix to match with
54+
int[] numCharsMatched = new int[len + 1];
55+
numCharsMatched[0] = -1;
56+
numCharsMatched[1] = 0; // 1st character has no prefix to match with
5957

6058
int currPrefixMatched = 0; // num of chars of prefix pattern currently matched
61-
int pos = 2; // Starting from the 2nd character, recall 1-indexed
59+
int pos = 2; // Starting from the 2nd character
6260
while (pos <= len) {
6361
if (pattern.charAt(pos - 1) == pattern.charAt(currPrefixMatched)) {
6462
currPrefixMatched += 1;
6563
// note, the line below can also be interpreted as the index of the next char to match
66-
prefixIndices[pos] = currPrefixMatched; // an indexing trick, store at the pos, num of chars matched
64+
numCharsMatched[pos] = currPrefixMatched;
6765
pos += 1;
6866
} else if (currPrefixMatched > 0) {
6967
// go back to a previous known match and try to match again
70-
currPrefixMatched = prefixIndices[currPrefixMatched];
68+
currPrefixMatched = numCharsMatched[currPrefixMatched];
7169
} else {
7270
// unable to match, time to move on
73-
prefixIndices[pos] = 0;
71+
numCharsMatched[pos] = 0;
7472
pos += 1;
7573
}
7674
}
77-
return prefixIndices;
75+
return numCharsMatched;
7876
}
7977

8078
/**
81-
* Main logic of KMP. Iterate the sequence, looking for patterns. If a difference is found, resume matching from
82-
* a previously identified sub-pattern, if possible. Length of pattern should be at least one.
83-
*
79+
* Main logic of KMP. Iterate the sequence, looking for patterns. If a mismatch is found, resume matching from
80+
* a previously identified sub-pattern, if possible. Here we assume length of pattern is at least one.
8481
* @param sequence to search against
8582
* @param pattern to search for
8683
* @return start indices of all occurrences of pattern found
8784
*/
8885
public static List<Integer> findOccurrences(String sequence, String pattern) {
89-
assert pattern.length() >= 1 : "Pattern length cannot be 0!";
90-
9186
int sLen = sequence.length();
9287
int pLen = pattern.length();
93-
int[] prefixIndices = getPrefixIndices(pattern);
88+
int[] prefixTable = getPrefixTable(pattern);
9489
List<Integer> indicesFound = new ArrayList<>();
9590

96-
int s = 0;
97-
int p = 0;
91+
int sTrav = 0;
92+
int pTrav = 0;
9893

99-
while (s < sLen) {
100-
if (pattern.charAt(p) == sequence.charAt(s)) {
101-
p += 1;
102-
s += 1;
103-
if (p == pLen) {
104-
// occurrence found
105-
indicesFound.add(s - pLen); // start index of this occurrence
106-
p = prefixIndices[p]; // reset
94+
while (sTrav < sLen) {
95+
if (pattern.charAt(pTrav) == sequence.charAt(sTrav)) {
96+
pTrav += 1;
97+
sTrav += 1;
98+
if (pTrav == pLen) { // matched a complete pattern string
99+
indicesFound.add(sTrav - pLen); // start index of this occurrence
100+
// recall the number of chars matched in p can be read as the index of the next char in p to match
101+
pTrav = prefixTable[pTrav]; // start matching from a repeated sub-pattern, if possible
107102
}
108103
} else {
109-
p = prefixIndices[p];
110-
if (p < 0) { // move on
111-
p += 1;
112-
s += 1;
104+
pTrav = prefixTable[pTrav];
105+
if (pTrav < 0) { // move on; using -1 trick
106+
pTrav += 1;
107+
sTrav += 1;
113108
}
109+
// ALTERNATIVELY
110+
// if pTrav == 0 i.e. nothing matched, move on
111+
// sTrav += 1
112+
// continue
113+
//
114+
// pTrav = prefixTable[pTrav]
114115
}
115116
}
116117
return indicesFound;

src/main/java/algorithms/patternFinding/README.md

Lines changed: 33 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,9 @@
11
# Knuth-Moris-Pratt Algorithm
22

3-
KMP match is a type of pattern-searching algorithm.
3+
## Background
4+
KMP match is a type of pattern-searching algorithm that improves the efficiency of naive search by avoiding unnecessary
5+
comparisons. It is most notable when the pattern has repeating sub-patterns.
6+
<br>
47
Pattern-searching problems is prevalent across many fields of CS, for instance,
58
in text editors when searching for a pattern, in computational biology sequence matching problems,
69
in NLP problems, and even for looking for file patterns for effective file management.
@@ -11,9 +14,31 @@ Typically, the algorithm returns a list of indices that denote the start of each
1114
![KMP](../../../../../docs/assets/images/kmp.png)
1215
Image Source: GeeksforGeeks
1316

14-
## Analysis
17+
### Intuition
18+
It's efficient because it utilizes the information gained from previous character comparisons. When a mismatch occurs,
19+
the algorithm uses this information to skip over as many characters as possible.
1520

16-
**Time complexity**:
21+
Considering the string pattern: <br>
22+
<div style="text-align: center;">
23+
"XYXYCXYXYF"
24+
</div>
25+
and string:
26+
<div style="text-align: center;">
27+
XYXYCXYXYCXYXYFGABC
28+
</div>
29+
30+
KMP has, during its initial processing of the pattern, identified that "XYXY" is a repeating sub-pattern.
31+
This means when the mismatch at F (10th character of the pattern) and C (10th character of the string) occurs,
32+
KMP doesn't need to start matching again from the very beginning of the pattern. <br>
33+
Instead, it leverages the information that "XYXY" has already been matched.
34+
35+
Therefore, the algorithm continues matching from the 5th character of the pattern string (C in "XYXYCXYXYF"). <br>
36+
It checks this against the 10th character of the string (C in "XYXYCXYXYCXYXYFGABC"). <br>
37+
Since they match, the algorithm continues from there without re-checking the initial "XYXY".
38+
39+
## Complexity Analysis
40+
Let k be the length of the pattern and n be the length of the string to match against.
41+
**Time complexity**: O(n+k)
1742

1843
Naively, we can look for patterns in a given sequence in O(nk) where n is the length of the sequence and k
1944
is the length of the pattern. We do this by iterating every character of the sequence, and look at the
@@ -27,7 +52,10 @@ O(n) traversal of the sequence. More details found in the src code.
2752
**Space complexity**: O(k) auxiliary space to store suffix that matches with prefix of the pattern string
2853

2954
## Notes
30-
31-
A detailed illustration of how the algorithm works is shown in the code.
55+
1. A detailed illustration of how the algorithm works is shown in the code.
3256
But if you have trouble understanding the implementation,
3357
here is a good [video](https://www.youtube.com/watch?v=EL4ZbRF587g) as well.
58+
2. A subroutine to find Longest Prefix Suffix (LPS) is commonly involved in the preprocessing step of KMP.
59+
It may be useful to interpret these numbers as the number of characters matched between the suffix and prefix. <br>
60+
Knowing the number of characters of prefix would help in informing the position of the next character of the pattern to
61+
match.

src/test/java/algorithms/patternFinding/KmpTest.java

Lines changed: 9 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -34,11 +34,16 @@ public void testEmptySequence_findOccurrences_shouldReturnStartIndices() {
3434
@Test
3535
public void testNoOccurence_findOccurrences_shouldReturnStartIndices() {
3636
String seq = "abcabcabc";
37-
String pattern = "noway";
37+
String patternOne = "noway";
38+
String patternTwo = "cbc";
3839

39-
List<Integer> indices = KMP.findOccurrences(seq, pattern);
40-
List<Integer> expected = new ArrayList<>();
41-
Assert.assertEquals(expected, indices);
40+
List<Integer> indicesOne = KMP.findOccurrences(seq, patternOne);
41+
List<Integer> expectedOne = new ArrayList<>();
42+
Assert.assertEquals(expectedOne, indicesOne);
43+
44+
List<Integer> indicesTwo = KMP.findOccurrences(seq, patternTwo);
45+
List<Integer> expectedTwo = new ArrayList<>();
46+
Assert.assertEquals(expectedTwo, indicesTwo);
4247
}
4348

4449
@Test

0 commit comments

Comments
 (0)