Skip to content

Commit 8367b3f

Browse files
Update README to document non-contiguous pattern matching behavior
- Added ordered (non-contiguous) matching to Key Features section - Updated sample output to reflect additional patterns detected - Added detailed "Understanding Non-Contiguous Pattern Matching" section with example - Updated notes to clarify patterns are matched in order, not contiguously - Clarified examples to explain gap tolerance in pattern matching Co-authored-by: jacksonpradolima <7774063+jacksonpradolima@users.noreply.github.com>
1 parent dc88d84 commit 8367b3f

File tree

1 file changed

+41
-7
lines changed

1 file changed

+41
-7
lines changed

README.md

Lines changed: 41 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -44,14 +44,15 @@ principles**. Using support thresholds, GSP identifies frequent sequences of ite
4444

4545
### Key Features:
4646

47+
- **Ordered (non-contiguous) matching**: Detects patterns where items appear in order but not necessarily adjacent, following standard GSP semantics. For example, the pattern `('A', 'C')` is found in the sequence `['A', 'B', 'C']`.
4748
- **Support-based pruning**: Only retains sequences that meet the minimum support threshold.
4849
- **Candidate generation**: Iteratively generates candidate sequences of increasing length.
4950
- **General-purpose**: Useful in retail, web analytics, social networks, temporal sequence mining, and more.
5051

5152
For example:
5253

53-
- In a shopping dataset, GSP can identify patterns like "Customers who buy bread and milk often purchase diapers next."
54-
- In a website clickstream, GSP might find patterns like "Users visit A, then go to B, and later proceed to C."
54+
- In a shopping dataset, GSP can identify patterns like "Customers who buy bread and milk often purchase diapers next" - even if other items appear between bread and milk.
55+
- In a website clickstream, GSP might find patterns like "Users visit A, then eventually go to C" - capturing user journeys with intermediate steps.
5556

5657
---
5758

@@ -367,24 +368,57 @@ Sample Output:
367368
```python
368369
[
369370
{('Bread',): 4, ('Milk',): 4, ('Diaper',): 4, ('Beer',): 3, ('Coke',): 2},
370-
{('Bread', 'Milk'): 3, ('Milk', 'Diaper'): 3, ('Diaper', 'Beer'): 3},
371-
{('Bread', 'Milk', 'Diaper'): 2, ('Milk', 'Diaper', 'Beer'): 2}
371+
{('Bread', 'Milk'): 3, ('Bread', 'Diaper'): 3, ('Bread', 'Beer'): 2, ('Milk', 'Diaper'): 3, ('Milk', 'Beer'): 2, ('Milk', 'Coke'): 2, ('Diaper', 'Beer'): 3, ('Diaper', 'Coke'): 2},
372+
{('Bread', 'Milk', 'Diaper'): 2, ('Bread', 'Diaper', 'Beer'): 2, ('Milk', 'Diaper', 'Beer'): 2, ('Milk', 'Diaper', 'Coke'): 2}
372373
]
373374
```
374375

375376
- The **first dictionary** contains single-item sequences with their frequencies (e.g., `('Bread',): 4` means "Bread"
376377
appears in 4 transactions).
377378
- The **second dictionary** contains 2-item sequential patterns (e.g., `('Bread', 'Milk'): 3` means the sequence "
378-
Bread → Milk" appears in 3 transactions).
379+
Bread → Milk" appears in 3 transactions). Note that patterns like `('Bread', 'Beer')` are detected even when they don't appear adjacent in transactions - they just need to appear in order.
379380
- The **third dictionary** contains 3-item sequential patterns (e.g., `('Bread', 'Milk', 'Diaper'): 2` means the
380381
sequence "Bread → Milk → Diaper" appears in 2 transactions).
381382

382383
> [!NOTE]
383-
> The **support** of a sequence is calculated as the fraction of transactions containing the sequence, e.g.,
384-
`[Bread, Milk]` appears in 3 out of 5 transactions → Support = `3 / 5 = 0.6` (60%).
384+
> The **support** of a sequence is calculated as the fraction of transactions containing the sequence **in order** (not necessarily contiguously), e.g.,
385+
`('Bread', 'Milk')` appears in 3 out of 5 transactions → Support = `3 / 5 = 0.6` (60%).
385386
> This insight helps identify frequently occurring sequential patterns in datasets, such as shopping trends or user
386387
> behavior.
387388
389+
> [!IMPORTANT]
390+
> **Non-contiguous (ordered) matching**: GSP-Py detects patterns where items appear in the specified order but not necessarily adjacent. For example, the pattern `('Bread', 'Beer')` matches the transaction `['Bread', 'Milk', 'Diaper', 'Beer']` because Bread appears before Beer, even though they are not adjacent. This follows the standard GSP algorithm semantics for sequential pattern mining.
391+
392+
### Understanding Non-Contiguous Pattern Matching
393+
394+
GSP-Py follows the standard GSP algorithm semantics by detecting **ordered (non-contiguous)** subsequences. This means:
395+
396+
-**Order matters**: Items must appear in the specified sequence order
397+
-**Gaps allowed**: Items don't need to be adjacent
398+
-**Wrong order rejected**: Items appearing in different order won't match
399+
400+
**Example:**
401+
402+
```python
403+
from gsppy.gsp import GSP
404+
405+
sequences = [
406+
['a', 'b', 'c'], # Contains: (a,b), (a,c), (b,c), (a,b,c)
407+
['a', 'c'], # Contains: (a,c)
408+
['b', 'c', 'a'], # Contains: (b,c), (b,a), (c,a)
409+
['a', 'b', 'c', 'd'], # Contains: (a,b), (a,c), (a,d), (b,c), (b,d), (c,d), etc.
410+
]
411+
412+
gsp = GSP(sequences)
413+
result = gsp.search(min_support=0.5) # Need at least 2/4 sequences
414+
415+
# Pattern ('a', 'c') is found with support=3 because:
416+
# - It appears in ['a', 'b', 'c'] (with 'b' in between)
417+
# - It appears in ['a', 'c'] (adjacent)
418+
# - It appears in ['a', 'b', 'c', 'd'] (with 'b' in between)
419+
# Total: 3 out of 4 sequences = 75% support ✅
420+
```
421+
388422

389423
> [!TIP]
390424
> For more complex examples, find example scripts in the [`gsppy/tests`](gsppy/tests) folder.

0 commit comments

Comments
 (0)