Merge pull request #62 from 4ndrelim/branch-RefactorKMP

4ndrelim · web-flow · commit ba79246af8a9 · 2024-02-10T23:08:02.000+08:00
Branch refactor kmp
diff --git a/src/main/java/algorithms/patternFinding/KMP.java b/src/main/java/algorithms/patternFinding/KMP.java
@@ -6,111 +6,112 @@
 /**
  * Implementation of KMP.
  * <p>
- * Illustration of getPrefixIndices: with pattern ABCABCNOABCABCA
- * Here we make a distinction between position and index. The position is basically 1-indexed.
- * Note the return indices are still 0-indexed of the pattern string.
+ * Illustration of getPrefixTable: with pattern ABCABCNOABCABCA
+ * We consider 1-indexed positions. Position 0 will be useful later in as a trick to inform that are no prefix matches
  * Position:  0   1   2   3   4   5   6   7   8   9   10  11  12  13  14  15
- * Pattern:      A   B   C   A   B   C   N   O   A   B   C   A   B   C   A ...
- * Return: -1   0   0   0   1   2   3   0   0   1   2   3   4   5   6   4 ...
- * Read:  ^  an indexing trick; consider 1-indexed characters for clarity and simplicity in the main algor
- * Read:      ^ 'A' is the first character of the pattern string,
- * there is no prefix ending before its index, 0, that can be matched with.
- * Read:          ^   ^ 'B' and 'C' cannot be matched with any prefix which are just 'A' and 'AB' respectively.
- * Read:                  ^ Can be matched with an earlier 'A'. So we store 1.
- * Prefix is the substring from idx 0 to 1 (exclusive). Note consider prefix from 0-indexed.
- * Realise 1 can also be interpreted as the index of the next character to match against!
- * Read:                      ^   ^ Similarly, continue matching
- * Read:                               ^  ^ No matches, so 0
- * Read:                                      ^   ^   ^   ^   ^   ^ Match with prefix until position 6!
- * Read:                                                              ^ where the magic happens, we can't match 'N'
- * at position 7 with 'A' at position 15, but
- * we know ABC of position 1-3 (or index 0-2)
- * exists and can 'restart' from there.
- * <p>
- * <p>
+ *  Pattern:      A   B   C   A   B   C   N   O   A   B   C   A   B   C   A ...
+ *   Return: -1   0   0   0   1   2   3   0   0   1   2   3   4   5   6   4 ...   CAN BE READ AS NUM OF CHARS MATCHED
+ *     Read:  ^ -1 can be interpreted as invalid number of chars matched but exploited for simplicity in the main algor.
+ *     Read:      ^ 'A' is the first character of the pattern, there is no prefix ending before itself, to match.
+ *     Read:          ^   ^ 'B' and 'C' cannot be matched with any prefix which are just 'A' and 'AB' respectively.
+ *     Read:                  ^ can be matched with an earlier prefix, 'A'. So we store 1, the number of chars matched.
+ *                            Realise 1 can also be interpreted as the index of the next character to match against!
+ *     Read:                      ^   ^ Similarly, continue matching
+ *     Read:                               ^  ^ No matches, so 0
+ *     Read:                                      ^   ^   ^   ^   ^   ^   Match with prefix, "ABCABC", until 6th char
+ *                                                                        of pattern string.
+ *     Read:                                                              ^ where the magic happens, we can't match 'N'
+ *                                                                        at position 7 with 'A' at position 15, but
+ *                                                                        we know "ABC" exists as an earlier sub-pattern
+ *                                                                        from 1st to 3rd and start matching the 4th
+ *                                                                        char onwards.
  * <p>
  * Illustration of main logic:
  * Pattern: ABABAB
  * String : ABABCABABABAB
  * <p>
- * A  B  A  B  C  A  B  A  B  A  B  A  B
- * Read:  ^    to  ^ Continue matching where possible, leading to Pattern[0:4] matched.
- * unable to match Pattern[4]. But notice that last two characters of String[0:4]
- * form a sub-pattern with Pattern[0:2] Maybe Pattern[2] == 'C' and we can 're-use' Pattern[0:2]
- * Read:        ^ try ^ by checking if Pattern[2] == 'C'
+ *        A  B  A  B  C  A  B  A  B  A  B  A  B
+ * Read:  ^    to  ^ Continue matching where possible, leading to 1st 4 characters matched.
+ *        unable to match Pattern[4]. But notice that last two characters
+ *        form a sub-pattern with the 1st 2, Maybe Pattern[2] == 'C' and we can 're-use' "AB"
+ * Read:        ^     ^ check if Pattern[2] == 'C'
  * Read:              Turns out no. No previously identified sub-pattern with 'C'. Restart matching Pattern.
- * Read:                 ^      to      ^ Found complete match! But rather than restart, notice that last 4 characters
- * Read:                 form a prefix sub-pattern of Pattern, which is Pattern[0:4] = "ABAB", so,
- * Read:                       ^               ^ Start matching from Pattern[4] and finally Pattern[5]
+ * Read:                 ^              ^ Found complete match! But rather than restart, notice that last 4 characters
+ * Read:                 of "ABABAB" form a prefix sub-pattern of Pattern, which is "ABAB", so,
+ * Read:                       ^  reuse  ^     ^ then match 5th and 6th char of pattern which happens to be "AB"
  */
 public class KMP {
     /**
-     * Find and indicate all suffix that match with a prefix.
+     * Captures the longest prefix which is also a suffix for some substring ending at each index, starting from 0.
+     * Does this by tracking the number of characters (of the prefix and suffix) matched.
      *
      * @param pattern to search
-     * @return an array of indices where the suffix ending at each position of they array can be matched with
-     *     corresponding a prefix of the pattern ending before the specified index
+     * @return an array of indices
      */
-    private static int[] getPrefixIndices(String pattern) {
+    private static int[] getPrefixTable(String pattern) {
+        // 1-indexed implementation
         int len = pattern.length();
-        int[] prefixIndices = new int[len + 1];
-        prefixIndices[0] = -1;
-        prefixIndices[1] = 0; // 1st character has no prefix to match with
+        int[] numCharsMatched = new int[len + 1];
+        numCharsMatched[0] = -1;
+        numCharsMatched[1] = 0; // 1st character has no prefix to match with
 
         int currPrefixMatched = 0; // num of chars of prefix pattern currently matched
-        int pos = 2; // Starting from the 2nd character, recall 1-indexed
+        int pos = 2; // Starting from the 2nd character
         while (pos <= len) {
             if (pattern.charAt(pos - 1) == pattern.charAt(currPrefixMatched)) {
                 currPrefixMatched += 1;
                 // note, the line below can also be interpreted as the index of the next char to match
-                prefixIndices[pos] = currPrefixMatched; // an indexing trick, store at the pos, num of chars matched
+                numCharsMatched[pos] = currPrefixMatched;
                 pos += 1;
             } else if (currPrefixMatched > 0) {
                 // go back to a previous known match and try to match again
-                currPrefixMatched = prefixIndices[currPrefixMatched];
+                currPrefixMatched = numCharsMatched[currPrefixMatched];
             } else {
                 // unable to match, time to move on
-                prefixIndices[pos] = 0;
+                numCharsMatched[pos] = 0;
                 pos += 1;
             }
         }
-        return prefixIndices;
+        return numCharsMatched;
     }
 
     /**
-     * Main logic of KMP. Iterate the sequence, looking for patterns. If a difference is found, resume matching from
-     * a previously identified sub-pattern, if possible. Length of pattern should be at least one.
-     *
+     * Main logic of KMP. Iterate the sequence, looking for patterns. If a mismatch is found, resume matching from
+     * a previously identified sub-pattern, if possible. Here we assume length of pattern is at least one.
      * @param sequence to search against
      * @param pattern  to search for
      * @return start indices of all occurrences of pattern found
      */
     public static List<Integer> findOccurrences(String sequence, String pattern) {
-        assert pattern.length() >= 1 : "Pattern length cannot be 0!";
-
         int sLen = sequence.length();
         int pLen = pattern.length();
-        int[] prefixIndices = getPrefixIndices(pattern);
+        int[] prefixTable = getPrefixTable(pattern);
         List<Integer> indicesFound = new ArrayList<>();
 
-        int s = 0;
-        int p = 0;
+        int sTrav = 0;
+        int pTrav = 0;
 
-        while (s < sLen) {
-            if (pattern.charAt(p) == sequence.charAt(s)) {
-                p += 1;
-                s += 1;
-                if (p == pLen) {
-                    // occurrence found
-                    indicesFound.add(s - pLen); // start index of this occurrence
-                    p = prefixIndices[p]; // reset
+        while (sTrav < sLen) {
+            if (pattern.charAt(pTrav) == sequence.charAt(sTrav)) {
+                pTrav += 1;
+                sTrav += 1;
+                if (pTrav == pLen) { // matched a complete pattern string
+                    indicesFound.add(sTrav - pLen); // start index of this occurrence
+                    // recall the number of chars matched in p can be read as the index of the next char in p to match
+                    pTrav = prefixTable[pTrav]; // start matching from a repeated sub-pattern, if possible
                 }
             } else {
-                p = prefixIndices[p];
-                if (p < 0) { // move on
-                    p += 1;
-                    s += 1;
+                pTrav = prefixTable[pTrav];
+                if (pTrav < 0) { // move on; using -1 trick
+                    pTrav += 1;
+                    sTrav += 1;
                 }
+                // ALTERNATIVELY
+                // if pTrav == 0 i.e. nothing matched, move on
+                //    sTrav += 1
+                //    continue
+                //
+                // pTrav = prefixTable[pTrav]
             }
         }
         return indicesFound;
diff --git a/src/main/java/algorithms/patternFinding/README.md b/src/main/java/algorithms/patternFinding/README.md
@@ -1,6 +1,9 @@
 # Knuth-Moris-Pratt Algorithm
 
-KMP match is a type of pattern-searching algorithm.
+## Background
+KMP match is a type of pattern-searching algorithm that improves the efficiency of naive search by avoiding unnecessary
+comparisons. It is most notable when the pattern has repeating sub-patterns.
+<br>
 Pattern-searching problems is prevalent across many fields of CS, for instance,
 in text editors when searching for a pattern, in computational biology sequence matching problems,
 in NLP problems, and even for looking for file patterns for effective file management.
@@ -11,9 +14,31 @@ Typically, the algorithm returns a list of indices that denote the start of each
 ![KMP](../../../../../docs/assets/images/kmp.png)
 Image Source: GeeksforGeeks
 
-## Analysis
+### Intuition
+It's efficient because it utilizes the information gained from previous character comparisons. When a mismatch occurs, 
+the algorithm uses this information to skip over as many characters as possible.
 
-**Time complexity**:
+Considering the string pattern: <br>
+<div style="text-align: center;">
+                "XYXYCXYXYF" 
+</div>
+and string: 
+<div style="text-align: center;">
+                XYXYCXYXYCXYXYFGABC
+</div>
+
+KMP has, during its initial processing of the pattern, identified that "XYXY" is a repeating sub-pattern. 
+This means when the mismatch at F (10th character of the pattern) and C (10th character of the string) occurs, 
+KMP doesn't need to start matching again from the very beginning of the pattern. <br>
+Instead, it leverages the information that "XYXY" has already been matched.
+
+Therefore, the algorithm continues matching from the 5th character of the pattern string (C in "XYXYCXYXYF"). <br> 
+It checks this against the 10th character of the string (C in "XYXYCXYXYCXYXYFGABC"). <br>
+Since they match, the algorithm continues from there without re-checking the initial "XYXY".
+
+## Complexity Analysis
+Let k be the length of the pattern and n be the length of the string to match against.
+**Time complexity**: O(n+k)
 
 Naively, we can look for patterns in a given sequence in O(nk) where n is the length of the sequence and k
 is the length of the pattern. We do this by iterating every character of the sequence, and look at the
@@ -27,7 +52,10 @@ O(n) traversal of the sequence. More details found in the src code.
 **Space complexity**: O(k) auxiliary space to store suffix that matches with prefix of the pattern string
 
 ## Notes
-
-A detailed illustration of how the algorithm works is shown in the code.
+1. A detailed illustration of how the algorithm works is shown in the code.
 But if you have trouble understanding the implementation,
 here is a good [video](https://www.youtube.com/watch?v=EL4ZbRF587g) as well. 
+2. A subroutine to find Longest Prefix Suffix (LPS) is commonly involved in the preprocessing step of KMP. 
+It may be useful to interpret these numbers as the number of characters matched between the suffix and prefix. <br>
+Knowing the number of characters of prefix would help in informing the position of the next character of the pattern to
+match.
diff --git a/src/test/java/algorithms/patternFinding/KmpTest.java b/src/test/java/algorithms/patternFinding/KmpTest.java
@@ -34,11 +34,16 @@ public void testEmptySequence_findOccurrences_shouldReturnStartIndices() {
     @Test
     public void testNoOccurence_findOccurrences_shouldReturnStartIndices() {
         String seq = "abcabcabc";
-        String pattern = "noway";
+        String patternOne = "noway";
+        String patternTwo = "cbc";
 
-        List<Integer> indices = KMP.findOccurrences(seq, pattern);
-        List<Integer> expected = new ArrayList<>();
-        Assert.assertEquals(expected, indices);
+        List<Integer> indicesOne = KMP.findOccurrences(seq, patternOne);
+        List<Integer> expectedOne = new ArrayList<>();
+        Assert.assertEquals(expectedOne, indicesOne);
+
+        List<Integer> indicesTwo = KMP.findOccurrences(seq, patternTwo);
+        List<Integer> expectedTwo = new ArrayList<>();
+        Assert.assertEquals(expectedTwo, indicesTwo);
     }
 
     @Test