jacksonpradolima · jacksonpradolima · Dec 21, 2025 · Dec 21, 2025 · Dec 21, 2025 · Dec 21, 2025
@@ -44,14 +44,15 @@ principles**. Using support thresholds, GSP identifies frequent sequences of ite
 
 ### Key Features:
 
+- **Ordered (non-contiguous) matching**: Detects patterns where items appear in order but not necessarily adjacent, following standard GSP semantics. For example, the pattern `('A', 'C')` is found in the sequence `['A', 'B', 'C']`.
 - **Support-based pruning**: Only retains sequences that meet the minimum support threshold.
 - **Candidate generation**: Iteratively generates candidate sequences of increasing length.
 - **General-purpose**: Useful in retail, web analytics, social networks, temporal sequence mining, and more.
 
 For example:
 
-- In a shopping dataset, GSP can identify patterns like "Customers who buy bread and milk often purchase diapers next."
-- In a website clickstream, GSP might find patterns like "Users visit A, then go to B, and later proceed to C."
+- In a shopping dataset, GSP can identify patterns like "Customers who buy bread and milk often purchase diapers next" - even if other items appear between bread and milk.
+- In a website clickstream, GSP might find patterns like "Users visit A, then eventually go to C" - capturing user journeys with intermediate steps.
 
 ---
 
@@ -367,24 +368,57 @@ Sample Output:
 ```python
 [
     {('Bread',): 4, ('Milk',): 4, ('Diaper',): 4, ('Beer',): 3, ('Coke',): 2},
-    {('Bread', 'Milk'): 3, ('Milk', 'Diaper'): 3, ('Diaper', 'Beer'): 3},
-    {('Bread', 'Milk', 'Diaper'): 2, ('Milk', 'Diaper', 'Beer'): 2}
+    {('Bread', 'Milk'): 3, ('Bread', 'Diaper'): 3, ('Bread', 'Beer'): 2, ('Milk', 'Diaper'): 3, ('Milk', 'Beer'): 2, ('Milk', 'Coke'): 2, ('Diaper', 'Beer'): 3, ('Diaper', 'Coke'): 2},
+    {('Bread', 'Milk', 'Diaper'): 2, ('Bread', 'Diaper', 'Beer'): 2, ('Milk', 'Diaper', 'Beer'): 2, ('Milk', 'Diaper', 'Coke'): 2}
 ]
 ```
 
 - The **first dictionary** contains single-item sequences with their frequencies (e.g., `('Bread',): 4` means "Bread"
   appears in 4 transactions).
 - The **second dictionary** contains 2-item sequential patterns (e.g., `('Bread', 'Milk'): 3` means the sequence "
-  Bread → Milk" appears in 3 transactions).
+  Bread → Milk" appears in 3 transactions). Note that patterns like `('Bread', 'Beer')` are detected even when they don't appear adjacent in transactions - they just need to appear in order.
 - The **third dictionary** contains 3-item sequential patterns (e.g., `('Bread', 'Milk', 'Diaper'): 2` means the
   sequence "Bread → Milk → Diaper" appears in 2 transactions).
 
 > [!NOTE]
-> The **support** of a sequence is calculated as the fraction of transactions containing the sequence, e.g.,
-`[Bread, Milk]` appears in 3 out of 5 transactions → Support = `3 / 5 = 0.6` (60%).
+> The **support** of a sequence is calculated as the fraction of transactions containing the sequence **in order** (not necessarily contiguously), e.g.,
+`('Bread', 'Milk')` appears in 3 out of 5 transactions → Support = `3 / 5 = 0.6` (60%).
 > This insight helps identify frequently occurring sequential patterns in datasets, such as shopping trends or user
 > behavior.
 
+> [!IMPORTANT]
+> **Non-contiguous (ordered) matching**: GSP-Py detects patterns where items appear in the specified order but not necessarily adjacent. For example, the pattern `('Bread', 'Beer')` matches the transaction `['Bread', 'Milk', 'Diaper', 'Beer']` because Bread appears before Beer, even though they are not adjacent. This follows the standard GSP algorithm semantics for sequential pattern mining.
+
+### Understanding Non-Contiguous Pattern Matching
+
+GSP-Py follows the standard GSP algorithm semantics by detecting **ordered (non-contiguous)** subsequences. This means:
+
+- ✅ **Order matters**: Items must appear in the specified sequence order
+- ✅ **Gaps allowed**: Items don't need to be adjacent
+- ❌ **Wrong order rejected**: Items appearing in different order won't match
+
+**Example:**
+
+```python
+from gsppy.gsp import GSP
+
+sequences = [
+    ['a', 'b', 'c'],  # Contains: (a,b), (a,c), (b,c), (a,b,c)
+    ['a', 'c'],       # Contains: (a,c)
+    ['b', 'c', 'a'],  # Contains: (b,c), (b,a), (c,a)
+    ['a', 'b', 'c', 'd'],  # Contains: (a,b), (a,c), (a,d), (b,c), (b,d), (c,d), etc.
+]
+
+gsp = GSP(sequences)
+result = gsp.search(min_support=0.5)  # Need at least 2/4 sequences
+
+# Pattern ('a', 'c') is found with support=3 because:
+# - It appears in ['a', 'b', 'c'] (with 'b' in between)
+# - It appears in ['a', 'c'] (adjacent)
+# - It appears in ['a', 'b', 'c', 'd'] (with 'b' in between)
+# Total: 3 out of 4 sequences = 75% support ✅
+```
+
 
 > [!TIP]
 > For more complex examples, find example scripts in the [`gsppy/tests`](gsppy/tests) folder.

@@ -5,14 +5,14 @@
 
 The key functionalities include:
 1. Splitting a list of items into smaller batches for easier processing.
-2. Checking for the existence of a contiguous subsequence within a sequence,
+2. Checking for the existence of an ordered (non-contiguous) subsequence within a sequence,
    with caching to optimize repeated comparisons.
 3. Generating candidate patterns from a dictionary of frequent patterns
    to support pattern generation tasks in algorithms like sequence mining.
 
 Main functionalities:
 - `split_into_batches`: Splits a list of items into smaller batches based on a specified batch size.
-- `is_subsequence_in_list`: Determines if a subsequence exists within another sequence,
+- `is_subsequence_in_list`: Determines if a subsequence exists within another sequence in order,
   using caching to improve performance.
 - `generate_candidates_from_previous`: Generates candidate patterns by joining previously
   identified frequent patterns.
@@ -46,27 +46,44 @@ def split_into_batches(
 @lru_cache(maxsize=None)
 def is_subsequence_in_list(subsequence: Tuple[str, ...], sequence: Tuple[str, ...]) -> bool:
     """
-    Check if a subsequence exists within a sequence as a contiguous subsequence.
+    Check if a subsequence exists within a sequence as an ordered (non-contiguous) subsequence.
+
+    This function implements the standard GSP semantics where items in the subsequence
+    must appear in the same order in the sequence, but not necessarily contiguously.
 
     Parameters:
         subsequence: (tuple): The sequence to search for.
         sequence (tuple): The sequence to search within.
 
     Returns:
         bool: True if the subsequence is found, False otherwise.
+
+    Examples:
+        >>> is_subsequence_in_list(('a', 'c'), ('a', 'b', 'c'))
+        True
+        >>> is_subsequence_in_list(('a', 'c'), ('c', 'a'))
+        False
+        >>> is_subsequence_in_list(('a', 'b'), ('a', 'b', 'c'))
+        True
     """
     # Handle the case where the subsequence is empty - it should not exist in any sequence
     if not subsequence:
         return False
 
     len_sub, len_seq = len(subsequence), len(sequence)
 
-    # Return False if the sequence is longer than the list
+    # Return False if the subsequence is longer than the sequence
     if len_sub > len_seq:
         return False
 
-    # Use any to check if any slice matches the sequence
-    return any(sequence[i : i + len_sub] == subsequence for i in range(len_seq - len_sub + 1))
+    # Use two-pointer approach to check if subsequence exists in order
+    sub_idx = 0
+    for seq_idx in range(len_seq):
+        if sequence[seq_idx] == subsequence[sub_idx]:
+            sub_idx += 1
+            if sub_idx == len_sub:
+                return True
+    return False
 
 
 def generate_candidates_from_previous(prev_patterns: Dict[Tuple[str, ...], int]) -> List[Tuple[str, ...]]:

@@ -168,13 +168,28 @@ def test_frequent_patterns(supermarket_transactions: List[List[str]]) -> None:
 
     Asserts:
         - The frequent patterns should match the expected result.
+        - Non-contiguous patterns are correctly detected.
     """
     gsp = GSP(supermarket_transactions)
     result = gsp.search(min_support=0.3)
     expected = [
         {("Bread",): 4, ("Milk",): 4, ("Diaper",): 4, ("Beer",): 3, ("Coke",): 2},
-        {("Bread", "Milk"): 3, ("Milk", "Diaper"): 3, ("Diaper", "Beer"): 3},
-        {("Bread", "Milk", "Diaper"): 2, ("Milk", "Diaper", "Beer"): 2},
+        {
+            ("Bread", "Milk"): 3,
+            ("Bread", "Diaper"): 3,
+            ("Bread", "Beer"): 2,
+            ("Milk", "Diaper"): 3,
+            ("Milk", "Beer"): 2,
+            ("Milk", "Coke"): 2,
+            ("Diaper", "Beer"): 3,
+            ("Diaper", "Coke"): 2,
+        },
+        {
+            ("Bread", "Milk", "Diaper"): 2,
+            ("Bread", "Diaper", "Beer"): 2,
+            ("Milk", "Diaper", "Beer"): 2,
+            ("Milk", "Diaper", "Coke"): 2,
+        },
     ]
     assert result == expected, "Frequent patterns do not match expected results."
 
@@ -231,6 +246,131 @@ def test_partial_match(supermarket_transactions: List[List[str]]) -> None:
         assert result_level_2 >= expected_patterns_level_2, f"Level 2 patterns mismatch. Got {result_level_2}"
 
 
+def test_non_contiguous_subsequences() -> None:
+    """
+    Test the GSP algorithm correctly detects non-contiguous subsequences (Issue #115).
+
+    This test validates that patterns like ('a', 'c') are detected even when
+    they appear with gaps in sequences like ['a', 'b', 'c'].
+
+    Asserts:
+        - Non-contiguous patterns are correctly identified with proper support.
+    """
+    sequences = [
+        ["a", "b", "c"],
+        ["a", "c"],
+        ["b", "c", "a"],
+        ["a", "b", "c", "d"],
+    ]
+
+    gsp = GSP(sequences)
+    result = gsp.search(min_support=0.5)
+
+    # Expected: ('a', 'c') should be found with support = 3
+    # It appears in: ['a', 'b', 'c'], ['a', 'c'], ['a', 'b', 'c', 'd']
+    assert len(result) >= 2, "Expected at least 2 levels of patterns"
+
+    level_2_patterns = result[1]
+    assert ("a", "c") in level_2_patterns, f"Pattern ('a', 'c') not found in level 2. Got {level_2_patterns}"
+    assert level_2_patterns[("a", "c")] == 3, f"Expected support 3 for ('a', 'c'), got {level_2_patterns[('a', 'c')]}"
+
+
+def test_contiguous_vs_non_contiguous_patterns() -> None:
+    """
+    Comprehensive test demonstrating the difference between contiguous and non-contiguous patterns.
+
+    This test shows patterns that would ONLY be found in non-contiguous matching (current implementation)
+    vs patterns that would be found in BOTH contiguous and non-contiguous matching.
+
+    The current implementation uses non-contiguous (ordered) matching, which is the standard GSP behavior.
+    """
+    sequences = [
+        ["X", "Y", "Z"],  # Contains X->Y, Y->Z, X->Z (contiguous: X->Y, Y->Z only)
+        ["X", "Z"],  # Contains X->Z (contiguous: X->Z)
+        ["Y", "Z", "X"],  # Contains Y->Z, Y->X, Z->X (contiguous: Y->Z, Z->X only)
+        ["X", "Y", "Z", "W"],  # Contains many patterns
+    ]
+
+    gsp = GSP(sequences)
+    result = gsp.search(min_support=0.5)  # Need at least 2/4 sequences
+
+    # Level 2 patterns
+    level_2_patterns = result[1] if len(result) >= 2 else {}
+
+    # Patterns that would be found in BOTH contiguous and non-contiguous:
+    # ('X', 'Y') appears contiguously in: ['X', 'Y', 'Z'], ['X', 'Y', 'Z', 'W']
+    # ('Y', 'Z') appears contiguously in: ['X', 'Y', 'Z'], ['Y', 'Z', 'X'], ['X', 'Y', 'Z', 'W']
+    assert ("X", "Y") in level_2_patterns, "('X', 'Y') should be found (contiguous in 2 sequences)"
+    assert ("Y", "Z") in level_2_patterns, "('Y', 'Z') should be found (contiguous in 3 sequences)"
+
+    # Pattern that would ONLY be found in non-contiguous matching:
+    # ('X', 'Z') appears with gap in: ['X', 'Y', 'Z'], ['X', 'Y', 'Z', 'W']
+    # and contiguously in: ['X', 'Z']
+    # Total support = 3 (>= 2 threshold)
+    assert ("X", "Z") in level_2_patterns, (
+        "('X', 'Z') should be found with non-contiguous matching. "
+        "This pattern has gaps in some sequences but is still ordered."
+    )
+    assert level_2_patterns[("X", "Z")] == 3, f"Expected support 3 for ('X', 'Z'), got {level_2_patterns[('X', 'Z')]}"
+
+
+def test_non_contiguous_with_longer_gaps() -> None:
+    """
+    Test non-contiguous matching with longer gaps between elements.
+
+    This demonstrates that the algorithm correctly finds patterns even when
+    there are multiple elements between the pattern elements.
+    """
+    sequences = [
+        ["A", "B", "C", "D", "E"],  # Contains A->E with 3 elements in between
+        ["A", "X", "Y", "Z", "E"],  # Contains A->E with 3 different elements in between
+        ["A", "E"],  # Contains A->E with no gap
+        ["E", "A"],  # Does NOT contain A->E (wrong order)
+    ]
+
+    gsp = GSP(sequences)
+    result = gsp.search(min_support=0.5)  # Need at least 2/4 sequences
+
+    # ('A', 'E') should be found with support = 3
+    level_2_patterns = result[1] if len(result) >= 2 else {}
+    assert ("A", "E") in level_2_patterns, "('A', 'E') should be found despite large gaps"
+    assert level_2_patterns[("A", "E")] == 3, f"Expected support 3 for ('A', 'E'), got {level_2_patterns[('A', 'E')]}"
+
+    # ('E', 'A') should NOT be found (wrong order)
+    assert ("E", "A") not in level_2_patterns, "('E', 'A') should not be found (wrong order)"
+
+
+def test_order_sensitivity() -> None:
+    """
+    Test that the algorithm is sensitive to order - patterns must appear in sequence order.
+
+    This verifies that even with non-contiguous matching, the order of elements matters.
+    """
+    sequences = [
+        ["P", "Q", "R"],  # Contains P->Q, P->R, Q->R
+        ["P", "R", "Q"],  # Contains P->R, P->Q, R->Q
+        ["Q", "P", "R"],  # Contains Q->P, Q->R, P->R
+        ["R", "Q", "P"],  # Contains R->Q, R->P, Q->P
+    ]
+
+    gsp = GSP(sequences)
+    result = gsp.search(min_support=0.5)  # Need at least 2/4 sequences
+
+    level_2_patterns = result[1] if len(result) >= 2 else {}
+
+    # ('P', 'R') appears in correct order in: ['P', 'Q', 'R'], ['P', 'R', 'Q'], ['Q', 'P', 'R']
+    assert ("P", "R") in level_2_patterns, "('P', 'R') should be found (support = 3)"
+    assert level_2_patterns[("P", "R")] == 3
+
+    # ('Q', 'P') appears in correct order in: ['Q', 'P', 'R'], ['R', 'Q', 'P']
+    assert ("Q", "P") in level_2_patterns, "('Q', 'P') should be found (support = 2)"
+    assert level_2_patterns[("Q", "P")] == 2
+
+    # ('R', 'P') appears in correct order in: ['R', 'Q', 'P']
+    # Support = 1, below threshold of 2
+    assert ("R", "P") not in level_2_patterns, "('R', 'P') should not be found (support = 1, below threshold)"
+
+
 @pytest.mark.parametrize("min_support", [0.1, 0.2, 0.3, 0.4, 0.5])
 def test_benchmark(benchmark: BenchmarkFixture, supermarket_transactions: List[List[str]], min_support: float) -> None:
     """

@@ -45,13 +45,19 @@ def test_is_subsequence_in_list():
     """
     Test the `is_subsequence_in_list` utility function.
     """
-    # Test when the subsequence is present
-    assert is_subsequence_in_list((1, 2), (0, 1, 2, 3)), "Failed to find subsequence"
+    # Test when the subsequence is present (contiguous)
+    assert is_subsequence_in_list((1, 2), (0, 1, 2, 3)), "Failed to find contiguous subsequence"
     assert is_subsequence_in_list((3,), (0, 1, 2, 3)), "Failed single-element subsequence"
 
-    # Test when the subsequence is not present
-    assert not is_subsequence_in_list((1, 3), (0, 1, 2, 3)), "Incorrectly found non-contiguous subsequence"
+    # Test when the subsequence is present (non-contiguous)
+    assert is_subsequence_in_list((1, 3), (0, 1, 2, 3)), "Failed to find non-contiguous subsequence"
+    assert is_subsequence_in_list((0, 2), (0, 1, 2, 3)), "Failed to find non-contiguous subsequence"
+    assert is_subsequence_in_list((0, 3), (0, 1, 2, 3)), "Failed to find non-contiguous subsequence"
+
+    # Test when the subsequence is not present (wrong order or missing elements)
+    assert not is_subsequence_in_list((3, 1), (0, 1, 2, 3)), "Incorrectly found reversed subsequence"
     assert not is_subsequence_in_list((4,), (0, 1, 2, 3)), "Incorrectly found non-existent subsequence"
+    assert not is_subsequence_in_list((2, 1), (0, 1, 2, 3)), "Incorrectly found out-of-order subsequence"
 
     # Test when input sequence or subsequence is empty
     assert not is_subsequence_in_list((), (0, 1, 2, 3)), "Incorrect positive result for empty subsequence"
@@ -61,6 +67,64 @@ def test_is_subsequence_in_list():
     assert not is_subsequence_in_list((1, 2, 3, 4), (1, 2, 3)), "Failed to reject long subsequence"
 
 
+def test_is_subsequence_contiguous_vs_non_contiguous():
+    """
+    Test cases that demonstrate the difference between contiguous and non-contiguous matching.
+
+    The current implementation uses non-contiguous (ordered) matching.
+    This test documents patterns that would differ between the two approaches.
+    """
+    # Pattern that appears with gaps (non-contiguous)
+    # In contiguous mode: would NOT match
+    # In non-contiguous mode: DOES match
+    assert is_subsequence_in_list(("a", "c"), ("a", "b", "c")), (
+        "Non-contiguous: ('a', 'c') should match in ('a', 'b', 'c')"
+    )
+    assert is_subsequence_in_list(("a", "d"), ("a", "b", "c", "d")), (
+        "Non-contiguous: ('a', 'd') should match in ('a', 'b', 'c', 'd')"
+    )
+    assert is_subsequence_in_list((1, 4), (1, 2, 3, 4, 5)), (
+        "Non-contiguous: (1, 4) should match in (1, 2, 3, 4, 5)"
+    )
+
+    # Pattern that appears contiguously (would match in both modes)
+    assert is_subsequence_in_list(("a", "b"), ("a", "b", "c")), (
+        "Contiguous: ('a', 'b') should match in ('a', 'b', 'c')"
+    )
+    assert is_subsequence_in_list((2, 3), (1, 2, 3, 4)), (
+        "Contiguous: (2, 3) should match in (1, 2, 3, 4)"
+    )
+
+    # Pattern with wrong order (would NOT match in either mode)
+    assert not is_subsequence_in_list(("c", "a"), ("a", "b", "c")), (
+        "Wrong order: ('c', 'a') should NOT match in ('a', 'b', 'c')"
+    )
+    assert not is_subsequence_in_list((3, 1), (1, 2, 3, 4)), (
+        "Wrong order: (3, 1) should NOT match in (1, 2, 3, 4)"
+    )
+
+
+def test_is_subsequence_with_gaps():
+    """
+    Test non-contiguous matching with various gap sizes.
+    """
+    # Small gap
+    assert is_subsequence_in_list(("x", "z"), ("x", "y", "z")), "Failed with 1 element gap"
+
+    # Medium gap
+    assert is_subsequence_in_list(("a", "e"), ("a", "b", "c", "d", "e")), "Failed with 3 element gap"
+
+    # Large gap
+    assert is_subsequence_in_list((1, 10), (1, 2, 3, 4, 5, 6, 7, 8, 9, 10)), "Failed with 8 element gap"
+
+    # Multiple gaps in longer pattern
+    assert is_subsequence_in_list((1, 3, 5), (1, 2, 3, 4, 5)), "Failed with multiple gaps"
+    assert is_subsequence_in_list(("a", "c", "e"), ("a", "b", "c", "d", "e")), "Failed with multiple gaps"
+
+    # No gap (adjacent elements still work)
+    assert is_subsequence_in_list((1, 2), (1, 2, 3)), "Failed with no gap (contiguous)"
+
+
 def test_generate_candidates_from_previous():
     """
     Test the `generate_candidates_from_previous` utility function.