You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Three dedup improvements to eliminate ~50% of duplicate discoveries:
1. Trend dedup by metric + direction (ignore change_pct): prevents the
same trend from re-surfacing daily as the lookback window shifts
2. Outcome-based cross-segment dedup (±10pp): catches mirror discoveries
from definitionally coupled segments (sleep_duration vs time_in_bed)
3. Metric alias groups: hrv/hrv_daily, steps/intraday_steps,
sleep_duration/time_in_bed treated as equivalent during dedup
Existing discovery dedup also updated with alias + trend awareness.
32 pattern-ranker tests (14 new), 21 pattern-spotter vitest passing.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Empirical review of user 3597587c's 67 active discoveries (2026-03-23) revealed ~34 are duplicates or near-duplicates that the current dedup fails to catch. Three root causes:
1607
+
1608
+
**Problem 1: Trends re-surface daily with different change_pct values**
1609
+
1610
+
The same trend (e.g., "workout frequency increasing") appears 4 times across 4 days because the lookback window shifts and the change_pct swings (79% → 2700% → 1300% → 833%). The dedup requires `change_pct` within ±5 percentage points — but these are 10-100x apart.
1611
+
1612
+
Examples from real data:
1613
+
- "Workout Frequency Has Increased" — 4 copies (March 19-22)
1614
+
- "HRV Decreasing" — 2 copies per metric variant
1615
+
- "Steps Increasing" — 4 copies across steps/intraday_steps
1616
+
- "High Activity Increasing" — 3 copies
1617
+
1618
+
**Problem 2: Coupled segment metrics produce mirror discoveries**
1619
+
1620
+
`sleep_duration` and `time_in_bed` segment days nearly identically (r≈0.95). Every discovery from one is duplicated by the other — same outcome, same change_pct, different segment name. The dedup requires matching `segment_metric_key`, so it treats them as distinct.
1621
+
1622
+
7 mirror pairs found:
1623
+
- "Wider Glucose Range on Low Sleep Days" / "...on Low Time in Bed Days" (+37%)
1624
+
- "Higher Max Glucose on Low Sleep Days" / "...on Low Time in Bed Days" (+20.8%)
1625
+
- "More Sedentary on Low Sleep Days" / "...on Low Time in Bed Days" (+16.3%)
1626
+
- etc.
1627
+
1628
+
**Problem 3: Metric aliases produce identical discoveries**
1629
+
1630
+
`hrv` is literally `hrv_daily` for WHOOP users (the derived metric falls back). `steps` equals `intraday_steps` (daily sum of same data). Each discovery appears twice — once per metric variant.
All three fixes are changes to `areDuplicates()` and `deduplicatePatterns()` in `pattern-ranker.ts`. No pipeline or analyzer changes needed.
1640
+
1641
+
**Fix 1: Trend dedup by metric_key + direction (ignore change_pct)**
1642
+
1643
+
For trend candidates, the current rule `metric_key + segment_metric_key + change_pct ±5%` is too narrow. Change_pct for trends varies wildly as the lookback window shifts.
1644
+
1645
+
New rule for trends: Two trend candidates are duplicates if:
1646
+
- Same `metric_key` (or aliases — see Fix 3)
1647
+
- Same `direction` (both increasing or both decreasing)
1648
+
-`change_pct` is ignored for trends
1649
+
1650
+
This also applies to existing discovery dedup: if an active discovery with the same metric_key and direction already exists, the new trend is a duplicate regardless of change_pct.
1651
+
1652
+
**Fix 2: Outcome-based dedup for segment comparisons**
1653
+
1654
+
Two segment comparison candidates are duplicates if:
1655
+
- Same `metric_key` (outcome)
1656
+
-`change_pct` within ±10% (wider tolerance for cross-segment dedup)
1657
+
-`segment_metric_key` can differ (this is the key relaxation)
1658
+
1659
+
This catches the sleep_duration/time_in_bed mirrors. When "Low Sleep Days → glucose_range +37%" and "Low Time in Bed Days → glucose_range +37%" both survive BH, the second is deduped because the outcome and change match.
1660
+
1661
+
The wider ±10% tolerance (vs the current ±5%) accounts for slight differences when two correlated segment metrics don't segment days exactly the same way.
1662
+
1663
+
**Fix 3: Metric alias groups**
1664
+
1665
+
Define alias sets where metrics produce identical or near-identical values for a given user:
if (Math.abs(a.change_pct-b.change_pct) <=threshold) {
1721
+
returntrue;
1722
+
}
1723
+
}
1724
+
1725
+
returnfalse;
1726
+
}
1727
+
```
1728
+
1729
+
#### 8.5.4 Existing Discovery Dedup Enhancement
1730
+
1731
+
The existing discovery dedup also needs to understand aliases and trend direction. The `existingForDedup` data currently only carries `metric_key` and `change_pct`. To support trend dedup, it needs the `pattern_type` and `direction` (if trend) from `metrics_impact`.
1732
+
1733
+
Update the existing discovery query to include `pattern_type` from metrics_impact:
1734
+
1735
+
```typescript
1736
+
const existingForDedup =allExisting.map(d=> ({
1737
+
metrics_impact: d.metrics_impactasArray<{
1738
+
metric_key:string;
1739
+
change_pct:number;
1740
+
pattern_type?:string;
1741
+
}> |null,
1742
+
discovery_type: d.discovery_type,
1743
+
title: d.title, // title contains direction hint for trends
1744
+
}));
1745
+
```
1746
+
1747
+
For trend matching against existing: if the existing discovery's `pattern_type === 'trend'` and the canonical metric_key matches, treat as duplicate regardless of change_pct.
After deploying the fix, existing duplicate discoveries need to be cleaned. Two approaches:
1765
+
1766
+
**Option A: Delete all and re-run**
1767
+
```sql
1768
+
DELETEFROM user_discoveries
1769
+
WHERE discovery_type IN ('unenrolled_pattern', 'observation')
1770
+
AND status IN ('new', 'viewed');
1771
+
-- Then invoke spot-patterns-cron to regenerate
1772
+
```
1773
+
1774
+
**Option B: Keep highest-ranked of each duplicate set** (preserves viewed status)
1775
+
More complex — requires a script to identify duplicate groups and delete all but the best.
1776
+
1777
+
Recommendation: Option A (delete + re-run). The AI narration will regenerate fresh text, and the new dedup logic will prevent duplicates from returning.
1778
+
1779
+
#### 8.5.7 Implementation Scope
1780
+
1781
+
**Files to modify:**
1782
+
-`_shared/pattern-ranker.ts` — rewrite `areDuplicates()`, add alias groups, update existing dedup
1783
+
-`_shared/pattern-ranker.test.ts` — new tests for trend dedup, outcome-based dedup, alias groups
1784
+
-`ai-engine/engines/pattern-spotter.ts` — update `existingForDedup` to include pattern_type
1785
+
1786
+
**Files unchanged:**
1787
+
- All analyzers, metric-discovery, blacklist, BH families — no changes needed
1788
+
1789
+
**Status: DONE (2026-03-23)**
1790
+
-`areDuplicates()` rewritten with three rules: trend direction dedup, outcome-based cross-segment dedup, metric aliases
1791
+
-`canonicalKey()` exported for alias resolution (hrv↔hrv_daily, steps↔intraday_steps, sleep_duration↔time_in_bed)
0 commit comments