Skip to content

Commit 5d364f2

Browse files
schwaaampclaude
andcommitted
BH correction per family + blacklist waves 1-2 (212 entries)
Split BH correction into three independent families (same-day, lagged, trends) to fix the hitchhiker effect where removing tautological pairs caused genuine cross-domain signals to fail BH. Includes blacklist wave 1 (112 entries) and wave 2 (76 entries) covering activity cluster, intraday HR, HRV-recovery, and missing glucose/sleep pairs. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent 5660443 commit 5d364f2

File tree

5 files changed

+489
-19
lines changed

5 files changed

+489
-19
lines changed

docs/planning/insight-engine-v3.md

Lines changed: 105 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -803,7 +803,11 @@ function chiSquaredUniformity(observed: number[]): ChiSquaredResult | null {
803803
e. Excursion clustering (NEW — intraday)
804804
f. Sequential change detection (NEW — intraday)
805805
g. Stability trend analysis (NEW — daily + intraday)
806-
7. CORRECT → UNCHANGED (BH FDR correction across ALL tests)
806+
7. CORRECT → BH FDR correction PER FAMILY (see Section 7.5)
807+
a. Same-day family: segment comparisons + intraday candidates (α=0.10)
808+
b. Lagged family: lagged effect candidates (α=0.10)
809+
c. Trend family: trend candidates (α=0.10)
810+
Merge all survivors → single ranked list for steps 8-12
807811
8. RANK → UNCHANGED (composite score: |d| × -log₁₀(p))
808812
9. FILTER → UNCHANGED (dedup + classify)
809813
10. NARRATE → EXPANDED (new templates for 4 new candidate types)
@@ -912,7 +916,104 @@ interface IntradayDataBundle {
912916

913917
These arrays are passed to the intraday analyzers in step 6d-6g. They are NOT passed to the daily analyzers (6a-6c) which continue to use DailyMetricRow[].
914918

915-
### 7.5 Comprehensive Diagnostics & Failure Isolation
919+
### 7.5 BH Correction Families (Step 7)
920+
921+
#### 7.5.1 The Problem: Hitchhiker Effect
922+
923+
v2 ran BH correction across ALL candidates in a single pool. This worked because the pool was dominated by tautological pairs (steps↔distance, workout_duration↔workout_calories) with astronomically low p-values (p ≈ 10⁻⁸). These trivially significant pairs anchored the top of the BH ranking, creating a high watermark that let weaker cross-domain signals (activity→sleep, glucose→behavior) pass at lower ranks.
924+
925+
When the blacklist removes the tautological pairs, the remaining candidates are all genuine cross-domain signals with higher p-values (p ≈ 0.001-0.05). In a single pool of 700-900 tests, BH's threshold at rank 1 is `0.10/900 ≈ 0.000111`. If no candidate has p < 0.000111, nothing passes — even if there are 20 genuinely interesting findings with p < 0.01.
926+
927+
**The cross-domain findings didn't get weaker. They lost the tautological bodyguards that were inflating the BH watermark.** This is a known statistical phenomenon: removing true positives from a BH pool can cause previously-passing weaker signals to fail, because BH's thresholds are relative to the total test count.
928+
929+
#### 7.5.2 The Solution: Three BH Families
930+
931+
Run BH correction independently on three candidate families:
932+
933+
| Family | What it tests | Typical size |
934+
|---|---|---|
935+
| **Same-day** | Segment comparisons + intraday analyzers (steps 6a, 6d-6g) | 300-500 candidates |
936+
| **Lagged** | Day N behavior → Day N+1 outcome (step 6b) | 150-300 candidates |
937+
| **Trends** | Earlier-half vs recent-half directional shifts (step 6c) | 20-50 candidates |
938+
939+
Each family runs BH at α=0.10 independently. A cross-domain same-day finding with p=0.005 in a pool of 400 needs to beat threshold `k/400 × 0.10` — 2-3x more lenient than `k/900 × 0.10` in the combined pool. A trend with p=0.01 in a pool of 30 easily passes at `k/30 × 0.10`.
940+
941+
#### 7.5.3 Why Three Families Is Principled (Not P-Hacking)
942+
943+
The families correspond to **fundamentally different experimental designs**:
944+
945+
- **Same-day**: "On days when X is high, is Y also high?" — tests contemporaneous associations
946+
- **Lagged**: "After days when X is high, is Y different the next day?" — tests delayed effects across a day boundary
947+
- **Trends**: "Is X shifting over weeks?" — tests directional change over time, no segmentation involved
948+
949+
These are different kinds of hypotheses. The statistical strength of "your bedtime is trending earlier" should not depend on how many same-day activity-vs-sleep pairs were tested — they're unrelated questions. Forcing them into the same BH pool penalizes one for the other's noise.
950+
951+
The analogy: you wouldn't grade a math exam and an English essay on the same curve.
952+
953+
**The line we don't cross**: Splitting further (same-day-glucose, same-day-sleep, same-day-activity...) would create many tiny families where everything passes. That's gaming the math. Three families based on experimental design is defensible. Twenty families based on metric domain is not.
954+
955+
#### 7.5.4 Scaling as Data Grows
956+
957+
When new data sources are added (Garmin, Apple Watch, Dexcom), each family grows. More segment metrics × more outcome metrics = more candidates per family. BH gets stricter within each family proportionally.
958+
959+
This is fine because:
960+
- More data per metric → stronger statistical power → lower p-values for genuine signals
961+
- BH strictness and signal strength scale together
962+
963+
**When a single family gets too large (1000+ candidates)**, the correct response is:
964+
1. **Blacklist maintenance** — review the family's candidates for new tautological pairs that slipped through
965+
2. **Effect size thresholds** — raise MIN_EFFECT_SIZE for that family to filter weak candidates before BH
966+
3. **NOT more family splits** — splitting families is a one-time architectural decision, not an ongoing tuning knob
967+
968+
#### 7.5.5 Implementation
969+
970+
**Pipeline step 7 changes from:**
971+
```
972+
7. CORRECT → BH FDR correction across ALL tests (single pool)
973+
```
974+
975+
**To:**
976+
```
977+
7. CORRECT → BH FDR correction per family
978+
a. Same-day family: segment comparisons + intraday analyzer candidates (α=0.10)
979+
b. Lagged family: lagged effect candidates (α=0.10)
980+
c. Trend family: trend candidates (α=0.10)
981+
Merge all BH survivors into a single ranked list for steps 8-12.
982+
```
983+
984+
Each candidate already carries a `type` field (`segment_comparison`, `lagged_effect`, `trend`, `temporal_distribution`, `excursion_cluster`, `sequential_change`, `stability_trend`). The family assignment:
985+
986+
| Candidate type | BH Family |
987+
|---|---|
988+
| `segment_comparison` | Same-day |
989+
| `temporal_distribution` | Same-day |
990+
| `excursion_cluster` | Same-day |
991+
| `sequential_change` | Same-day |
992+
| `stability_trend` | Same-day |
993+
| `lagged_effect` | Lagged |
994+
| `trend` | Trends |
995+
996+
After BH correction per family, survivors from all three families are merged, ranked by composite score, and flow through steps 8-12 (RANK → FILTER → NARRATE → PERSIST → LOG) unchanged.
997+
998+
**Diagnostics update:**
999+
```json
1000+
{
1001+
"correct": {
1002+
"alpha": 0.10,
1003+
"families": {
1004+
"same_day": { "candidates": 412, "passed_bh": 23 },
1005+
"lagged": { "candidates": 287, "passed_bh": 8 },
1006+
"trends": { "candidates": 31, "passed_bh": 5 }
1007+
},
1008+
"total_candidates": 730,
1009+
"total_passed_bh": 36
1010+
}
1011+
}
1012+
```
1013+
1014+
**Code change scope**: ~20 lines in `pattern-spotter.ts` step 7. Split `allCandidates` by type into three arrays, call `benjaminiHochberg()` three times, merge the survivors. No changes to any other module.
1015+
1016+
### 7.6 Comprehensive Diagnostics & Failure Isolation
9161017

9171018
v2 logs diagnostics at each pipeline step, but v3 expands this significantly. The guiding principle: **every decision the pipeline makes should be traceable** — what was tested, what was skipped, and why.
9181019

@@ -1344,14 +1445,14 @@ metrics_impact: [{
13441445
- Phase B: DONE — All 4 sync functions extract timezone (WHOOP offset in vendor_metadata + local activity_date, Fitbit IANA from profile API, Libre derived offset, Oura from Personal Info API)
13451446
- Phase C: DONE — `daily-aggregation.ts` uses `LocalTimeExtractors` closure factory; glucose uses `display_time`
13461447
- Phase D: DONE — UTC fallback, Pacific, DST, Tokyo, Kolkata tests (28 aggregation tests pass)
1347-
- Phase E: NOT DONE — Backfill script for existing users without timezone, pipeline diagnostics timezone section
1448+
- Phase E: UNNECESSARY — Mobile app auto-pushes timezone on launch; UTC fallback works for users who haven't opened the app
13481449

13491450
**Phase 1: Foundation** — DONE (41 + 39 = 80 tests)
13501451
- `statistical-tests.ts`: Wilcoxon signed-rank, chi-squared uniformity, Spearman ρ, `computeRanks` (20 new tests, 41 total)
13511452
- `metric-type-catalog.ts`: 17 type patterns, label generation, `classifyMetric`/`getLabel`/`classifyWithLabel` (39 tests)
13521453
- Migration `20260320000000_create_insight_engine_blacklist.sql`: table + 24 seed entries
13531454

1354-
**Phase 2: Blacklist seeding**NOT DONE
1455+
**Phase 2: Blacklist seeding** — DONE (2026-03-18)
13551456
- One-time Spearman correlation analysis script needed
13561457
- Run against active users, review pairs with median ρ > 0.80
13571458
- Insert confirmed tautological pairs into `insight_engine_blacklist`

supabase/functions/ai-engine/engines/pattern-spotter.test.ts

Lines changed: 24 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -392,20 +392,38 @@ describe('Pattern Spotter v2 — Full Pipeline', () => {
392392
// BH correction
393393
// =========================================================================
394394

395-
it('applies Benjamini-Hochberg correction and logs diagnostics', async () => {
395+
it('applies BH correction per family (same-day, lagged, trends) and logs diagnostics', async () => {
396396
const data = generateWhoopData(60);
397397
mockDailySummaries = data.dailySummaries;
398398
mockSleepSessions = data.sleepSessions;
399399

400400
const result = await spotPatterns({ lookback_days: 90 }, 'user-1', mockProvider);
401401
const diag = result.diagnostics as Record<string, unknown>;
402402

403-
// Should have correction diagnostics
404-
if (diag.correct) {
405-
const correct = diag.correct as Record<string, number>;
406-
expect(correct.alpha).toBe(0.10);
407-
expect(correct.passed_bh).toBeLessThanOrEqual(correct.total_tests);
403+
// Should have correction diagnostics with family breakdown
404+
expect(diag.correct).toBeDefined();
405+
const correct = diag.correct as Record<string, unknown>;
406+
expect(correct.alpha).toBe(0.10);
407+
408+
// Must have per-family results
409+
const families = correct.families as Record<string, Record<string, number>>;
410+
expect(families).toBeDefined();
411+
expect(families.same_day).toBeDefined();
412+
expect(families.lagged).toBeDefined();
413+
expect(families.trends).toBeDefined();
414+
415+
// Each family reports candidates and passed_bh
416+
for (const family of Object.values(families)) {
417+
expect(typeof family.candidates).toBe('number');
418+
expect(typeof family.passed_bh).toBe('number');
419+
expect(family.passed_bh).toBeLessThanOrEqual(family.candidates);
408420
}
421+
422+
// Total is sum of families
423+
const totalCandidates = Object.values(families).reduce((s, f) => s + f.candidates, 0);
424+
const totalPassedBh = Object.values(families).reduce((s, f) => s + f.passed_bh, 0);
425+
expect(correct.total_candidates).toBe(totalCandidates);
426+
expect(correct.total_passed_bh).toBe(totalPassedBh);
409427
});
410428

411429
// =========================================================================

supabase/functions/ai-engine/engines/pattern-spotter.ts

Lines changed: 38 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -803,21 +803,50 @@ export async function spotPatterns(
803803
}
804804

805805
// =========================================================================
806-
// 7. CORRECT — Benjamini-Hochberg FDR correction
806+
// 7. CORRECT — Benjamini-Hochberg FDR correction per family
807+
//
808+
// Three families based on experimental design:
809+
// - Same-day: segment comparisons + intraday analyzers
810+
// - Lagged: day N behavior → day N+1 outcome
811+
// - Trends: directional shifts over weeks
812+
//
813+
// Running BH per family prevents cross-domain signals from being drowned
814+
// by the test count of unrelated hypothesis families.
807815
// =========================================================================
808816

809-
const bhResults = benjaminiHochberg(allCandidates, 0.10);
810-
const bhSignificant = bhResults.filter(c => c.bh_significant);
817+
const LAGGED_TYPES = new Set(['lagged_effect']);
818+
const TREND_TYPES = new Set(['trend']);
819+
// Everything else (segment_comparison, temporal_distribution, excursion_cluster,
820+
// sequential_change, stability_trend) is same-day family.
821+
822+
const sameDayCandidates = allCandidates.filter(c => !LAGGED_TYPES.has(c.type) && !TREND_TYPES.has(c.type));
823+
const laggedCandidates = allCandidates.filter(c => LAGGED_TYPES.has(c.type));
824+
const trendCandidates = allCandidates.filter(c => TREND_TYPES.has(c.type));
825+
826+
const BH_ALPHA = 0.10;
827+
828+
const sameDayBh = benjaminiHochberg(sameDayCandidates, BH_ALPHA);
829+
const laggedBh = benjaminiHochberg(laggedCandidates, BH_ALPHA);
830+
const trendBh = benjaminiHochberg(trendCandidates, BH_ALPHA);
831+
832+
const sameDaySurvivors = sameDayBh.filter(c => c.bh_significant);
833+
const laggedSurvivors = laggedBh.filter(c => c.bh_significant);
834+
const trendSurvivors = trendBh.filter(c => c.bh_significant);
835+
836+
const bhSignificant = [...sameDaySurvivors, ...laggedSurvivors, ...trendSurvivors];
811837

812838
diagnostics.correct = {
813-
alpha: 0.10,
814-
total_tests: allCandidates.length,
815-
passed_raw: allCandidates.length,
816-
passed_bh: bhSignificant.length,
817-
false_positives_removed: allCandidates.length - bhSignificant.length,
839+
alpha: BH_ALPHA,
840+
families: {
841+
same_day: { candidates: sameDayCandidates.length, passed_bh: sameDaySurvivors.length },
842+
lagged: { candidates: laggedCandidates.length, passed_bh: laggedSurvivors.length },
843+
trends: { candidates: trendCandidates.length, passed_bh: trendSurvivors.length },
844+
},
845+
total_candidates: allCandidates.length,
846+
total_passed_bh: bhSignificant.length,
818847
};
819848

820-
console.log(`[PatternSpotter] BH correction applied`, JSON.stringify(diagnostics.correct));
849+
console.log(`[PatternSpotter] BH correction applied (per family)`, JSON.stringify(diagnostics.correct));
821850

822851
// Map BH results back to PatternCandidate (strip BH fields for ranking)
823852
const bhCandidates: PatternCandidate[] = bhSignificant.map(c => ({

0 commit comments

Comments
 (0)