You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Update README to document non-contiguous pattern matching behavior
- Added ordered (non-contiguous) matching to Key Features section
- Updated sample output to reflect additional patterns detected
- Added detailed "Understanding Non-Contiguous Pattern Matching" section with example
- Updated notes to clarify patterns are matched in order, not contiguously
- Clarified examples to explain gap tolerance in pattern matching
Co-authored-by: jacksonpradolima <7774063+jacksonpradolima@users.noreply.github.com>
Copy file name to clipboardExpand all lines: README.md
+41-7Lines changed: 41 additions & 7 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -44,14 +44,15 @@ principles**. Using support thresholds, GSP identifies frequent sequences of ite
44
44
45
45
### Key Features:
46
46
47
+
-**Ordered (non-contiguous) matching**: Detects patterns where items appear in order but not necessarily adjacent, following standard GSP semantics. For example, the pattern `('A', 'C')` is found in the sequence `['A', 'B', 'C']`.
47
48
-**Support-based pruning**: Only retains sequences that meet the minimum support threshold.
48
49
-**Candidate generation**: Iteratively generates candidate sequences of increasing length.
49
50
-**General-purpose**: Useful in retail, web analytics, social networks, temporal sequence mining, and more.
50
51
51
52
For example:
52
53
53
-
- In a shopping dataset, GSP can identify patterns like "Customers who buy bread and milk often purchase diapers next."
54
-
- In a website clickstream, GSP might find patterns like "Users visit A, then go to B, and later proceed to C."
54
+
- In a shopping dataset, GSP can identify patterns like "Customers who buy bread and milk often purchase diapers next" - even if other items appear between bread and milk.
55
+
- In a website clickstream, GSP might find patterns like "Users visit A, then eventually go to C" - capturing user journeys with intermediate steps.
- The **first dictionary** contains single-item sequences with their frequencies (e.g., `('Bread',): 4` means "Bread"
376
377
appears in 4 transactions).
377
378
- The **second dictionary** contains 2-item sequential patterns (e.g., `('Bread', 'Milk'): 3` means the sequence "
378
-
Bread → Milk" appears in 3 transactions).
379
+
Bread → Milk" appears in 3 transactions). Note that patterns like `('Bread', 'Beer')` are detected even when they don't appear adjacent in transactions - they just need to appear in order.
379
380
- The **third dictionary** contains 3-item sequential patterns (e.g., `('Bread', 'Milk', 'Diaper'): 2` means the
380
381
sequence "Bread → Milk → Diaper" appears in 2 transactions).
381
382
382
383
> [!NOTE]
383
-
> The **support** of a sequence is calculated as the fraction of transactions containing the sequence, e.g.,
384
-
`[Bread, Milk]` appears in 3 out of 5 transactions → Support = `3 / 5 = 0.6` (60%).
384
+
> The **support** of a sequence is calculated as the fraction of transactions containing the sequence**in order** (not necessarily contiguously), e.g.,
385
+
`('Bread', 'Milk')` appears in 3 out of 5 transactions → Support = `3 / 5 = 0.6` (60%).
385
386
> This insight helps identify frequently occurring sequential patterns in datasets, such as shopping trends or user
386
387
> behavior.
387
388
389
+
> [!IMPORTANT]
390
+
> **Non-contiguous (ordered) matching**: GSP-Py detects patterns where items appear in the specified order but not necessarily adjacent. For example, the pattern `('Bread', 'Beer')` matches the transaction `['Bread', 'Milk', 'Diaper', 'Beer']` because Bread appears before Beer, even though they are not adjacent. This follows the standard GSP algorithm semantics for sequential pattern mining.
391
+
392
+
### Understanding Non-Contiguous Pattern Matching
393
+
394
+
GSP-Py follows the standard GSP algorithm semantics by detecting **ordered (non-contiguous)** subsequences. This means:
395
+
396
+
- ✅ **Order matters**: Items must appear in the specified sequence order
397
+
- ✅ **Gaps allowed**: Items don't need to be adjacent
398
+
- ❌ **Wrong order rejected**: Items appearing in different order won't match
0 commit comments