Skip to content

Commit 90805b1

Browse files
feat: add itemset support for co-occurrence semantics in sequence mining
2 parents 5ed3d9e + 100b540 commit 90805b1

File tree

7 files changed

+1215
-115
lines changed

7 files changed

+1215
-115
lines changed

README.md

Lines changed: 145 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -39,6 +39,7 @@ Sequence Pattern (GSP)** algorithm. Ideal for market basket analysis, temporal m
3939
- [✅ Example: Analyzing Sales Data](#example-analyzing-sales-data)
4040
- [📊 Explanation: Support and Results](#explanation-support-and-results)
4141
- [📊 DataFrame Input Support](#dataframe-input-support)
42+
- [🔗 Itemset Support](#itemset-support)
4243
- [⏱️ Temporal Constraints](#temporal-constraints)
4344
7. [⌨️ Typing](#typing)
4445
8. [🌟 Planned Features](#planned-features)
@@ -907,6 +908,150 @@ For complete examples and edge cases, see:
907908

908909
---
909910

911+
## 🔗 Itemset Support
912+
913+
GSP-Py supports **itemsets** within sequence elements, enabling you to capture **co-occurrence** of multiple items at the same time step. This is crucial for applications where items occur together rather than in strict sequential order.
914+
915+
### What are Itemsets?
916+
917+
- **Flat sequences**: `['A', 'B', 'C']` - each item occurs at a separate time step
918+
- **Itemset sequences**: `[['A', 'B'], ['C']]` - items A and B occur together at the first time step, then C occurs later
919+
920+
### Why Use Itemsets?
921+
922+
Itemsets are essential when temporal co-occurrence matters in your domain:
923+
924+
- **Market basket analysis**: Customers buy multiple items in a single shopping trip, then return for more items later
925+
- **Web analytics**: Users open multiple pages in parallel tabs before moving to the next set of pages
926+
- **Event logs**: Multiple events can occur simultaneously in complex systems
927+
- **Purchase patterns**: Items bought together vs. items bought in sequence
928+
929+
### Using Itemsets
930+
931+
#### Basic Example
932+
933+
```python
934+
from gsppy import GSP
935+
936+
# Itemset format: nested lists where inner lists are items that occur together
937+
transactions = [
938+
[['Bread', 'Milk'], ['Eggs']], # Bought Bread & Milk together, then Eggs later
939+
[['Bread', 'Milk', 'Butter']], # Bought all three items together
940+
[['Bread', 'Milk'], ['Eggs']], # Same pattern as customer 1
941+
]
942+
943+
gsp = GSP(transactions)
944+
patterns = gsp.search(min_support=0.5)
945+
946+
# Pattern ('Bread',) will match any itemset containing Bread
947+
# Pattern ('Bread', 'Eggs') will match sequences where Bread appears before Eggs
948+
# (even if they're in different itemsets)
949+
```
950+
951+
#### Backward Compatibility with Flat Sequences
952+
953+
GSP-Py automatically normalizes flat sequences to itemsets internally, ensuring full backward compatibility:
954+
955+
```python
956+
from gsppy import GSP
957+
958+
# These are equivalent after normalization:
959+
flat_transactions = [['A', 'B', 'C']] # Flat format
960+
itemset_transactions = [[['A'], ['B'], ['C']]] # Equivalent itemset format
961+
962+
# Both produce the same results
963+
gsp1 = GSP(flat_transactions)
964+
gsp2 = GSP(itemset_transactions)
965+
966+
# Patterns are identical
967+
patterns1 = gsp1.search(min_support=0.5)
968+
patterns2 = gsp2.search(min_support=0.5)
969+
```
970+
971+
### Itemset Matching Semantics
972+
973+
Pattern matching with itemsets uses **subset semantics**:
974+
975+
- A pattern element matches a sequence element if all items in the pattern element are present in the sequence element
976+
- Example: Pattern `[['A', 'B']]` matches sequence element `['A', 'B', 'C']` because {A, B} ⊆ {A, B, C}
977+
- Pattern elements must appear in order across the sequence
978+
979+
```python
980+
from gsppy import GSP
981+
982+
transactions = [
983+
[['A', 'B', 'D'], ['E'], ['C', 'F']], # A,B,D together, then E, then C,F together
984+
]
985+
986+
gsp = GSP(transactions)
987+
988+
# Pattern ('A', 'C') will match because:
989+
# - 'A' is in first itemset ['A', 'B', 'D'] ✓
990+
# - 'C' appears later in third itemset ['C', 'F'] ✓
991+
# - Order is preserved ✓
992+
```
993+
994+
### Reading Itemsets from SPM Format
995+
996+
The SPM/GSP format supports itemsets using delimiters:
997+
998+
- `-1`: End of itemset
999+
- `-2`: End of sequence
1000+
1001+
```python
1002+
from gsppy.utils import read_transactions_from_spm
1003+
1004+
# SPM file content:
1005+
# 1 2 -1 3 -1 -2
1006+
# 1 -1 3 4 -1 -2
1007+
1008+
# Read with itemsets preserved
1009+
transactions = read_transactions_from_spm("data.txt", preserve_itemsets=True)
1010+
# Result: [[['1', '2'], ['3']], [['1'], ['3', '4']]]
1011+
1012+
# Read with itemsets flattened (backward compatible)
1013+
transactions = read_transactions_from_spm("data.txt", preserve_itemsets=False)
1014+
# Result: [['1', '2', '3'], ['1', '3', '4']]
1015+
```
1016+
1017+
### Itemsets with Timestamps
1018+
1019+
Itemsets work seamlessly with temporal constraints:
1020+
1021+
```python
1022+
from gsppy import GSP
1023+
1024+
# Itemsets with timestamps: [(item, timestamp), ...]
1025+
transactions = [
1026+
[[('Login', 0), ('Home', 0)], [('Product', 5)], [('Checkout', 10)]],
1027+
[[('Login', 0)], [('Home', 2), ('Product', 2)], [('Checkout', 15)]],
1028+
]
1029+
1030+
# Find patterns where events in the same itemset occur together
1031+
# and subsequent itemsets occur within maxgap time units
1032+
gsp = GSP(transactions, maxgap=10)
1033+
patterns = gsp.search(min_support=0.5)
1034+
```
1035+
1036+
### Complete Example
1037+
1038+
See [examples/itemset_example.py](examples/itemset_example.py) for comprehensive examples including:
1039+
1040+
- Market basket analysis with itemsets
1041+
- Web clickstream with parallel page views
1042+
- Comparison of flat vs. itemset semantics
1043+
- Reading and processing SPM format files
1044+
1045+
### Key Takeaways
1046+
1047+
**Itemsets capture co-occurrence** of items at the same time step
1048+
**Flat sequences are automatically normalized** to itemsets internally
1049+
**Both formats work seamlessly** with GSP-Py
1050+
**Use itemsets when temporal co-occurrence matters** in your domain
1051+
**SPM format supports** both flat and itemset representations
1052+
1053+
---
1054+
9101055
## ⏱️ Temporal Constraints
9111056

9121057
GSP-Py supports **time-constrained sequential pattern mining** with three powerful temporal constraints: `mingap`, `maxgap`, and `maxspan`. These constraints enable domain-specific applications such as medical event mining, retail analytics, and temporal user journey discovery.

examples/itemset_example.py

Lines changed: 210 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,210 @@
1+
"""
2+
Example demonstrating itemset support in GSP-Py.
3+
4+
This example shows how to use GSP-Py with itemsets, where multiple items
5+
can occur together at the same time step in a sequence.
6+
7+
Key concepts:
8+
1. Flat sequences: ['A', 'B', 'C'] - each item at separate time steps
9+
2. Itemset sequences: [['A', 'B'], ['C']] - A and B occur together, then C
10+
11+
Author: Jackson Antonio do Prado Lima
12+
13+
"""
14+
15+
from gsppy import GSP
16+
17+
def example_flat_vs_itemset():
18+
"""
19+
Demonstrate the difference between flat and itemset representations.
20+
"""
21+
print("=" * 80)
22+
print("EXAMPLE 1: Flat vs Itemset Sequences")
23+
print("=" * 80)
24+
25+
# Flat sequences - each item happens at a separate time step
26+
print("\n1a. Flat sequences (traditional format):")
27+
flat_transactions = [
28+
['A', 'B', 'C'], # A, then B, then C
29+
['A', 'C'], # A, then C
30+
['A', 'B', 'C'], # A, then B, then C
31+
]
32+
print(f" Transactions: {flat_transactions}")
33+
34+
gsp_flat = GSP(flat_transactions)
35+
patterns_flat = gsp_flat.search(min_support=0.66)
36+
37+
print(" Frequent patterns (min_support=0.66):")
38+
for i, level_patterns in enumerate(patterns_flat, start=1):
39+
print(f" {i}-sequences: {level_patterns}")
40+
41+
# Itemset sequences - items in same list occur together
42+
print("\n1b. Itemset sequences:")
43+
itemset_transactions = [
44+
[['A', 'B'], ['C']], # A and B together, then C
45+
[['A'], ['C']], # A, then C
46+
[['A', 'B'], ['C']], # A and B together, then C
47+
]
48+
print(f" Transactions: {itemset_transactions}")
49+
50+
gsp_itemset = GSP(itemset_transactions)
51+
patterns_itemset = gsp_itemset.search(min_support=0.66)
52+
53+
print(" Frequent patterns (min_support=0.66):")
54+
for i, level_patterns in enumerate(patterns_itemset, start=1):
55+
print(f" {i}-sequences: {level_patterns}")
56+
57+
58+
def example_market_basket():
59+
"""
60+
Real-world example: Market basket analysis with itemsets.
61+
62+
Customers can buy multiple items in a single transaction, then return
63+
to buy more items in subsequent transactions.
64+
"""
65+
print("\n" + "=" * 80)
66+
print("EXAMPLE 2: Market Basket Analysis with Itemsets")
67+
print("=" * 80)
68+
69+
# Each customer's purchase history
70+
# Nested lists represent items bought together (same shopping trip)
71+
transactions = [
72+
# Customer 1: Bought bread & milk together, then came back for eggs
73+
[['Bread', 'Milk'], ['Eggs']],
74+
75+
# Customer 2: Bought bread, milk & butter together
76+
[['Bread', 'Milk', 'Butter']],
77+
78+
# Customer 3: Bought bread & milk together, then eggs later
79+
[['Bread', 'Milk'], ['Eggs']],
80+
81+
# Customer 4: Bought bread & milk together, then eggs & butter together
82+
[['Bread', 'Milk'], ['Eggs', 'Butter']],
83+
]
84+
85+
print("\nCustomer transaction history:")
86+
for i, tx in enumerate(transactions, start=1):
87+
print(f" Customer {i}: {tx}")
88+
89+
gsp = GSP(transactions)
90+
patterns = gsp.search(min_support=0.5)
91+
92+
print("\nFrequent patterns (min_support=0.5, i.e., 2+ customers):")
93+
for i, level_patterns in enumerate(patterns, start=1):
94+
print(f"\n {i}-sequences:")
95+
for pattern, support in level_patterns.items():
96+
print(f" {pattern} - appears in {support} customer histories")
97+
98+
# Insights
99+
print("\n📊 Insights:")
100+
print(" - Customers who buy Bread and Milk often return to buy Eggs later")
101+
print(" - This is different from 'Bread, then Milk, then Eggs' pattern")
102+
print(" - Itemsets capture co-occurrence (items bought together)")
103+
104+
105+
def example_clickstream():
106+
"""
107+
Example: Web analytics with itemsets.
108+
109+
Users can view multiple pages in parallel (multiple tabs) before
110+
moving to the next set of pages.
111+
"""
112+
print("\n" + "=" * 80)
113+
print("EXAMPLE 3: Web Clickstream with Parallel Page Views")
114+
print("=" * 80)
115+
116+
# User sessions with parallel page views
117+
sessions = [
118+
# User 1: Opened Home & Products in tabs, then viewed Checkout
119+
[['Home', 'Products'], ['Checkout']],
120+
121+
# User 2: Home and Products together, then Cart, then Checkout
122+
[['Home', 'Products'], ['Cart'], ['Checkout']],
123+
124+
# User 3: Home page, then Products & Cart together, then Checkout
125+
[['Home'], ['Products', 'Cart'], ['Checkout']],
126+
127+
# User 4: Home & Products together, then Checkout
128+
[['Home', 'Products'], ['Checkout']],
129+
]
130+
131+
print("\nUser sessions (parallel page views):")
132+
for i, session in enumerate(sessions, start=1):
133+
print(f" User {i}: {session}")
134+
135+
gsp = GSP(sessions)
136+
patterns = gsp.search(min_support=0.5)
137+
138+
print("\nFrequent navigation patterns (min_support=0.5):")
139+
for i, level_patterns in enumerate(patterns, start=1):
140+
if level_patterns:
141+
print(f"\n {i}-sequences:")
142+
for pattern, support in level_patterns.items():
143+
print(f" {pattern} - in {support} sessions")
144+
145+
146+
def example_spm_format():
147+
"""
148+
Example: Reading itemsets from SPM format files.
149+
150+
SPM format uses delimiters:
151+
- `-1` marks end of itemset
152+
- `-2` marks end of sequence
153+
"""
154+
print("\n" + "=" * 80)
155+
print("EXAMPLE 4: Reading Itemsets from SPM Format")
156+
print("=" * 80)
157+
158+
import tempfile
159+
import os
160+
from gsppy.utils import read_transactions_from_spm
161+
162+
# Create a temporary SPM file with itemsets
163+
spm_content = """1 2 -1 3 -1 -2
164+
1 -1 3 4 -1 -2
165+
1 2 -1 3 -1 -2"""
166+
167+
with tempfile.NamedTemporaryFile(mode='w', suffix='.txt', delete=False) as f:
168+
f.write(spm_content)
169+
temp_path = f.name
170+
171+
try:
172+
print(f"\nSPM file content:\n{spm_content}")
173+
174+
# Read with itemsets flattened (backward compatible)
175+
print("\nReading with itemsets flattened (preserve_itemsets=False):")
176+
flat_txs = read_transactions_from_spm(temp_path, preserve_itemsets=False)
177+
for i, tx in enumerate(flat_txs, start=1):
178+
print(f" Transaction {i}: {tx}")
179+
180+
# Read with itemsets preserved
181+
print("\nReading with itemsets preserved (preserve_itemsets=True):")
182+
itemset_txs = read_transactions_from_spm(temp_path, preserve_itemsets=True)
183+
for i, tx in enumerate(itemset_txs, start=1):
184+
print(f" Transaction {i}: {tx}")
185+
186+
# Use in GSP
187+
print("\nRunning GSP on itemset data:")
188+
gsp = GSP(itemset_txs)
189+
patterns = gsp.search(min_support=0.66)
190+
print(f" Frequent patterns: {patterns}")
191+
192+
finally:
193+
os.unlink(temp_path)
194+
195+
196+
if __name__ == '__main__':
197+
# Run all examples
198+
example_flat_vs_itemset()
199+
example_market_basket()
200+
example_clickstream()
201+
example_spm_format()
202+
203+
print("\n" + "=" * 80)
204+
print("Summary:")
205+
print("=" * 80)
206+
print("✓ Itemsets capture co-occurrence of items at the same time step")
207+
print("✓ Flat sequences are automatically normalized to itemsets internally")
208+
print("✓ Both formats work seamlessly with GSP-Py")
209+
print("✓ Use itemsets when temporal co-occurrence matters in your domain")
210+
print("=" * 80)

0 commit comments

Comments
 (0)