Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
48 changes: 41 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -44,14 +44,15 @@ principles**. Using support thresholds, GSP identifies frequent sequences of ite

### Key Features:

- **Ordered (non-contiguous) matching**: Detects patterns where items appear in order but not necessarily adjacent, following standard GSP semantics. For example, the pattern `('A', 'C')` is found in the sequence `['A', 'B', 'C']`.
- **Support-based pruning**: Only retains sequences that meet the minimum support threshold.
- **Candidate generation**: Iteratively generates candidate sequences of increasing length.
- **General-purpose**: Useful in retail, web analytics, social networks, temporal sequence mining, and more.

For example:

- In a shopping dataset, GSP can identify patterns like "Customers who buy bread and milk often purchase diapers next."
- In a website clickstream, GSP might find patterns like "Users visit A, then go to B, and later proceed to C."
- In a shopping dataset, GSP can identify patterns like "Customers who buy bread and milk often purchase diapers next" - even if other items appear between bread and milk.
- In a website clickstream, GSP might find patterns like "Users visit A, then eventually go to C" - capturing user journeys with intermediate steps.

---

Expand Down Expand Up @@ -367,24 +368,57 @@ Sample Output:
```python
[
{('Bread',): 4, ('Milk',): 4, ('Diaper',): 4, ('Beer',): 3, ('Coke',): 2},
{('Bread', 'Milk'): 3, ('Milk', 'Diaper'): 3, ('Diaper', 'Beer'): 3},
{('Bread', 'Milk', 'Diaper'): 2, ('Milk', 'Diaper', 'Beer'): 2}
{('Bread', 'Milk'): 3, ('Bread', 'Diaper'): 3, ('Bread', 'Beer'): 2, ('Milk', 'Diaper'): 3, ('Milk', 'Beer'): 2, ('Milk', 'Coke'): 2, ('Diaper', 'Beer'): 3, ('Diaper', 'Coke'): 2},
{('Bread', 'Milk', 'Diaper'): 2, ('Bread', 'Diaper', 'Beer'): 2, ('Milk', 'Diaper', 'Beer'): 2, ('Milk', 'Diaper', 'Coke'): 2}
]
```

- The **first dictionary** contains single-item sequences with their frequencies (e.g., `('Bread',): 4` means "Bread"
appears in 4 transactions).
- The **second dictionary** contains 2-item sequential patterns (e.g., `('Bread', 'Milk'): 3` means the sequence "
Bread → Milk" appears in 3 transactions).
Bread → Milk" appears in 3 transactions). Note that patterns like `('Bread', 'Beer')` are detected even when they don't appear adjacent in transactions - they just need to appear in order.
- The **third dictionary** contains 3-item sequential patterns (e.g., `('Bread', 'Milk', 'Diaper'): 2` means the
sequence "Bread → Milk → Diaper" appears in 2 transactions).

> [!NOTE]
> The **support** of a sequence is calculated as the fraction of transactions containing the sequence, e.g.,
`[Bread, Milk]` appears in 3 out of 5 transactions → Support = `3 / 5 = 0.6` (60%).
> The **support** of a sequence is calculated as the fraction of transactions containing the sequence **in order** (not necessarily contiguously), e.g.,
`('Bread', 'Milk')` appears in 3 out of 5 transactions → Support = `3 / 5 = 0.6` (60%).
> This insight helps identify frequently occurring sequential patterns in datasets, such as shopping trends or user
> behavior.

> [!IMPORTANT]
> **Non-contiguous (ordered) matching**: GSP-Py detects patterns where items appear in the specified order but not necessarily adjacent. For example, the pattern `('Bread', 'Beer')` matches the transaction `['Bread', 'Milk', 'Diaper', 'Beer']` because Bread appears before Beer, even though they are not adjacent. This follows the standard GSP algorithm semantics for sequential pattern mining.

### Understanding Non-Contiguous Pattern Matching

GSP-Py follows the standard GSP algorithm semantics by detecting **ordered (non-contiguous)** subsequences. This means:

- ✅ **Order matters**: Items must appear in the specified sequence order
- ✅ **Gaps allowed**: Items don't need to be adjacent
- ❌ **Wrong order rejected**: Items appearing in different order won't match

**Example:**

```python
from gsppy.gsp import GSP

sequences = [
['a', 'b', 'c'], # Contains: (a,b), (a,c), (b,c), (a,b,c)
['a', 'c'], # Contains: (a,c)
['b', 'c', 'a'], # Contains: (b,c), (b,a), (c,a)
['a', 'b', 'c', 'd'], # Contains: (a,b), (a,c), (a,d), (b,c), (b,d), (c,d), etc.
]

gsp = GSP(sequences)
result = gsp.search(min_support=0.5) # Need at least 2/4 sequences

# Pattern ('a', 'c') is found with support=3 because:
# - It appears in ['a', 'b', 'c'] (with 'b' in between)
# - It appears in ['a', 'c'] (adjacent)
# - It appears in ['a', 'b', 'c', 'd'] (with 'b' in between)
# Total: 3 out of 4 sequences = 75% support ✅
```


> [!TIP]
> For more complex examples, find example scripts in the [`gsppy/tests`](gsppy/tests) folder.
Expand Down
29 changes: 23 additions & 6 deletions gsppy/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,14 +5,14 @@

The key functionalities include:
1. Splitting a list of items into smaller batches for easier processing.
2. Checking for the existence of a contiguous subsequence within a sequence,
2. Checking for the existence of an ordered (non-contiguous) subsequence within a sequence,
with caching to optimize repeated comparisons.
3. Generating candidate patterns from a dictionary of frequent patterns
to support pattern generation tasks in algorithms like sequence mining.

Main functionalities:
- `split_into_batches`: Splits a list of items into smaller batches based on a specified batch size.
- `is_subsequence_in_list`: Determines if a subsequence exists within another sequence,
- `is_subsequence_in_list`: Determines if a subsequence exists within another sequence in order,
using caching to improve performance.
- `generate_candidates_from_previous`: Generates candidate patterns by joining previously
identified frequent patterns.
Expand Down Expand Up @@ -46,27 +46,44 @@ def split_into_batches(
@lru_cache(maxsize=None)
def is_subsequence_in_list(subsequence: Tuple[str, ...], sequence: Tuple[str, ...]) -> bool:
"""
Check if a subsequence exists within a sequence as a contiguous subsequence.
Check if a subsequence exists within a sequence as an ordered (non-contiguous) subsequence.

This function implements the standard GSP semantics where items in the subsequence
must appear in the same order in the sequence, but not necessarily contiguously.

Parameters:
subsequence: (tuple): The sequence to search for.
sequence (tuple): The sequence to search within.

Returns:
bool: True if the subsequence is found, False otherwise.

Examples:
>>> is_subsequence_in_list(('a', 'c'), ('a', 'b', 'c'))
True
>>> is_subsequence_in_list(('a', 'c'), ('c', 'a'))
False
>>> is_subsequence_in_list(('a', 'b'), ('a', 'b', 'c'))
True
"""
# Handle the case where the subsequence is empty - it should not exist in any sequence
if not subsequence:
return False

len_sub, len_seq = len(subsequence), len(sequence)

# Return False if the sequence is longer than the list
# Return False if the subsequence is longer than the sequence
if len_sub > len_seq:
return False

# Use any to check if any slice matches the sequence
return any(sequence[i : i + len_sub] == subsequence for i in range(len_seq - len_sub + 1))
# Use two-pointer approach to check if subsequence exists in order
sub_idx = 0
for seq_idx in range(len_seq):
if sequence[seq_idx] == subsequence[sub_idx]:
sub_idx += 1
if sub_idx == len_sub:
return True
return False


def generate_candidates_from_previous(prev_patterns: Dict[Tuple[str, ...], int]) -> List[Tuple[str, ...]]:
Expand Down
144 changes: 142 additions & 2 deletions tests/test_gsp.py
Original file line number Diff line number Diff line change
Expand Up @@ -168,13 +168,28 @@ def test_frequent_patterns(supermarket_transactions: List[List[str]]) -> None:

Asserts:
- The frequent patterns should match the expected result.
- Non-contiguous patterns are correctly detected.
"""
gsp = GSP(supermarket_transactions)
result = gsp.search(min_support=0.3)
expected = [
{("Bread",): 4, ("Milk",): 4, ("Diaper",): 4, ("Beer",): 3, ("Coke",): 2},
{("Bread", "Milk"): 3, ("Milk", "Diaper"): 3, ("Diaper", "Beer"): 3},
{("Bread", "Milk", "Diaper"): 2, ("Milk", "Diaper", "Beer"): 2},
{
("Bread", "Milk"): 3,
("Bread", "Diaper"): 3,
("Bread", "Beer"): 2,
("Milk", "Diaper"): 3,
("Milk", "Beer"): 2,
("Milk", "Coke"): 2,
("Diaper", "Beer"): 3,
("Diaper", "Coke"): 2,
},
{
("Bread", "Milk", "Diaper"): 2,
("Bread", "Diaper", "Beer"): 2,
("Milk", "Diaper", "Beer"): 2,
("Milk", "Diaper", "Coke"): 2,
},
]
assert result == expected, "Frequent patterns do not match expected results."

Expand Down Expand Up @@ -231,6 +246,131 @@ def test_partial_match(supermarket_transactions: List[List[str]]) -> None:
assert result_level_2 >= expected_patterns_level_2, f"Level 2 patterns mismatch. Got {result_level_2}"


def test_non_contiguous_subsequences() -> None:
"""
Test the GSP algorithm correctly detects non-contiguous subsequences (Issue #115).

This test validates that patterns like ('a', 'c') are detected even when
they appear with gaps in sequences like ['a', 'b', 'c'].

Asserts:
- Non-contiguous patterns are correctly identified with proper support.
"""
sequences = [
["a", "b", "c"],
["a", "c"],
["b", "c", "a"],
["a", "b", "c", "d"],
]

gsp = GSP(sequences)
result = gsp.search(min_support=0.5)

# Expected: ('a', 'c') should be found with support = 3
# It appears in: ['a', 'b', 'c'], ['a', 'c'], ['a', 'b', 'c', 'd']
assert len(result) >= 2, "Expected at least 2 levels of patterns"

level_2_patterns = result[1]
assert ("a", "c") in level_2_patterns, f"Pattern ('a', 'c') not found in level 2. Got {level_2_patterns}"
assert level_2_patterns[("a", "c")] == 3, f"Expected support 3 for ('a', 'c'), got {level_2_patterns[('a', 'c')]}"


def test_contiguous_vs_non_contiguous_patterns() -> None:
"""
Comprehensive test demonstrating the difference between contiguous and non-contiguous patterns.

This test shows patterns that would ONLY be found in non-contiguous matching (current implementation)
vs patterns that would be found in BOTH contiguous and non-contiguous matching.

The current implementation uses non-contiguous (ordered) matching, which is the standard GSP behavior.
"""
sequences = [
["X", "Y", "Z"], # Contains X->Y, Y->Z, X->Z (contiguous: X->Y, Y->Z only)
["X", "Z"], # Contains X->Z (contiguous: X->Z)
["Y", "Z", "X"], # Contains Y->Z, Y->X, Z->X (contiguous: Y->Z, Z->X only)
["X", "Y", "Z", "W"], # Contains many patterns
]

gsp = GSP(sequences)
result = gsp.search(min_support=0.5) # Need at least 2/4 sequences

# Level 2 patterns
level_2_patterns = result[1] if len(result) >= 2 else {}

# Patterns that would be found in BOTH contiguous and non-contiguous:
# ('X', 'Y') appears contiguously in: ['X', 'Y', 'Z'], ['X', 'Y', 'Z', 'W']
# ('Y', 'Z') appears contiguously in: ['X', 'Y', 'Z'], ['Y', 'Z', 'X'], ['X', 'Y', 'Z', 'W']
assert ("X", "Y") in level_2_patterns, "('X', 'Y') should be found (contiguous in 2 sequences)"
assert ("Y", "Z") in level_2_patterns, "('Y', 'Z') should be found (contiguous in 3 sequences)"

# Pattern that would ONLY be found in non-contiguous matching:
# ('X', 'Z') appears with gap in: ['X', 'Y', 'Z'], ['X', 'Y', 'Z', 'W']
# and contiguously in: ['X', 'Z']
# Total support = 3 (>= 2 threshold)
assert ("X", "Z") in level_2_patterns, (
"('X', 'Z') should be found with non-contiguous matching. "
"This pattern has gaps in some sequences but is still ordered."
)
assert level_2_patterns[("X", "Z")] == 3, f"Expected support 3 for ('X', 'Z'), got {level_2_patterns[('X', 'Z')]}"


def test_non_contiguous_with_longer_gaps() -> None:
"""
Test non-contiguous matching with longer gaps between elements.

This demonstrates that the algorithm correctly finds patterns even when
there are multiple elements between the pattern elements.
"""
sequences = [
["A", "B", "C", "D", "E"], # Contains A->E with 3 elements in between
["A", "X", "Y", "Z", "E"], # Contains A->E with 3 different elements in between
["A", "E"], # Contains A->E with no gap
["E", "A"], # Does NOT contain A->E (wrong order)
]

gsp = GSP(sequences)
result = gsp.search(min_support=0.5) # Need at least 2/4 sequences

# ('A', 'E') should be found with support = 3
level_2_patterns = result[1] if len(result) >= 2 else {}
assert ("A", "E") in level_2_patterns, "('A', 'E') should be found despite large gaps"
assert level_2_patterns[("A", "E")] == 3, f"Expected support 3 for ('A', 'E'), got {level_2_patterns[('A', 'E')]}"

# ('E', 'A') should NOT be found (wrong order)
assert ("E", "A") not in level_2_patterns, "('E', 'A') should not be found (wrong order)"


def test_order_sensitivity() -> None:
"""
Test that the algorithm is sensitive to order - patterns must appear in sequence order.

This verifies that even with non-contiguous matching, the order of elements matters.
"""
sequences = [
["P", "Q", "R"], # Contains P->Q, P->R, Q->R
["P", "R", "Q"], # Contains P->R, P->Q, R->Q
["Q", "P", "R"], # Contains Q->P, Q->R, P->R
["R", "Q", "P"], # Contains R->Q, R->P, Q->P
]

gsp = GSP(sequences)
result = gsp.search(min_support=0.5) # Need at least 2/4 sequences

level_2_patterns = result[1] if len(result) >= 2 else {}

# ('P', 'R') appears in correct order in: ['P', 'Q', 'R'], ['P', 'R', 'Q'], ['Q', 'P', 'R']
assert ("P", "R") in level_2_patterns, "('P', 'R') should be found (support = 3)"
assert level_2_patterns[("P", "R")] == 3

# ('Q', 'P') appears in correct order in: ['Q', 'P', 'R'], ['R', 'Q', 'P']
assert ("Q", "P") in level_2_patterns, "('Q', 'P') should be found (support = 2)"
assert level_2_patterns[("Q", "P")] == 2

# ('R', 'P') appears in correct order in: ['R', 'Q', 'P']
# Support = 1, below threshold of 2
assert ("R", "P") not in level_2_patterns, "('R', 'P') should not be found (support = 1, below threshold)"


@pytest.mark.parametrize("min_support", [0.1, 0.2, 0.3, 0.4, 0.5])
def test_benchmark(benchmark: BenchmarkFixture, supermarket_transactions: List[List[str]], min_support: float) -> None:
"""
Expand Down
72 changes: 68 additions & 4 deletions tests/test_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -45,13 +45,19 @@ def test_is_subsequence_in_list():
"""
Test the `is_subsequence_in_list` utility function.
"""
# Test when the subsequence is present
assert is_subsequence_in_list((1, 2), (0, 1, 2, 3)), "Failed to find subsequence"
# Test when the subsequence is present (contiguous)
assert is_subsequence_in_list((1, 2), (0, 1, 2, 3)), "Failed to find contiguous subsequence"
assert is_subsequence_in_list((3,), (0, 1, 2, 3)), "Failed single-element subsequence"

# Test when the subsequence is not present
assert not is_subsequence_in_list((1, 3), (0, 1, 2, 3)), "Incorrectly found non-contiguous subsequence"
# Test when the subsequence is present (non-contiguous)
assert is_subsequence_in_list((1, 3), (0, 1, 2, 3)), "Failed to find non-contiguous subsequence"
assert is_subsequence_in_list((0, 2), (0, 1, 2, 3)), "Failed to find non-contiguous subsequence"
assert is_subsequence_in_list((0, 3), (0, 1, 2, 3)), "Failed to find non-contiguous subsequence"

# Test when the subsequence is not present (wrong order or missing elements)
assert not is_subsequence_in_list((3, 1), (0, 1, 2, 3)), "Incorrectly found reversed subsequence"
assert not is_subsequence_in_list((4,), (0, 1, 2, 3)), "Incorrectly found non-existent subsequence"
assert not is_subsequence_in_list((2, 1), (0, 1, 2, 3)), "Incorrectly found out-of-order subsequence"

# Test when input sequence or subsequence is empty
assert not is_subsequence_in_list((), (0, 1, 2, 3)), "Incorrect positive result for empty subsequence"
Expand All @@ -61,6 +67,64 @@ def test_is_subsequence_in_list():
assert not is_subsequence_in_list((1, 2, 3, 4), (1, 2, 3)), "Failed to reject long subsequence"


def test_is_subsequence_contiguous_vs_non_contiguous():
"""
Test cases that demonstrate the difference between contiguous and non-contiguous matching.

The current implementation uses non-contiguous (ordered) matching.
This test documents patterns that would differ between the two approaches.
"""
# Pattern that appears with gaps (non-contiguous)
# In contiguous mode: would NOT match
# In non-contiguous mode: DOES match
assert is_subsequence_in_list(("a", "c"), ("a", "b", "c")), (
"Non-contiguous: ('a', 'c') should match in ('a', 'b', 'c')"
)
assert is_subsequence_in_list(("a", "d"), ("a", "b", "c", "d")), (
"Non-contiguous: ('a', 'd') should match in ('a', 'b', 'c', 'd')"
)
assert is_subsequence_in_list((1, 4), (1, 2, 3, 4, 5)), (
"Non-contiguous: (1, 4) should match in (1, 2, 3, 4, 5)"
)

# Pattern that appears contiguously (would match in both modes)
assert is_subsequence_in_list(("a", "b"), ("a", "b", "c")), (
"Contiguous: ('a', 'b') should match in ('a', 'b', 'c')"
)
assert is_subsequence_in_list((2, 3), (1, 2, 3, 4)), (
"Contiguous: (2, 3) should match in (1, 2, 3, 4)"
)

# Pattern with wrong order (would NOT match in either mode)
assert not is_subsequence_in_list(("c", "a"), ("a", "b", "c")), (
"Wrong order: ('c', 'a') should NOT match in ('a', 'b', 'c')"
)
assert not is_subsequence_in_list((3, 1), (1, 2, 3, 4)), (
"Wrong order: (3, 1) should NOT match in (1, 2, 3, 4)"
)


def test_is_subsequence_with_gaps():
"""
Test non-contiguous matching with various gap sizes.
"""
# Small gap
assert is_subsequence_in_list(("x", "z"), ("x", "y", "z")), "Failed with 1 element gap"

# Medium gap
assert is_subsequence_in_list(("a", "e"), ("a", "b", "c", "d", "e")), "Failed with 3 element gap"

# Large gap
assert is_subsequence_in_list((1, 10), (1, 2, 3, 4, 5, 6, 7, 8, 9, 10)), "Failed with 8 element gap"

# Multiple gaps in longer pattern
assert is_subsequence_in_list((1, 3, 5), (1, 2, 3, 4, 5)), "Failed with multiple gaps"
assert is_subsequence_in_list(("a", "c", "e"), ("a", "b", "c", "d", "e")), "Failed with multiple gaps"

# No gap (adjacent elements still work)
assert is_subsequence_in_list((1, 2), (1, 2, 3)), "Failed with no gap (contiguous)"


def test_generate_candidates_from_previous():
"""
Test the `generate_candidates_from_previous` utility function.
Expand Down
Loading