schwaaamp
diff --git a/‎docs/planning/insight-engine-v3.md‎
Lines changed: 105 additions & 4 deletions b/‎docs/planning/insight-engine-v3.md‎
Lines changed: 105 additions & 4 deletions
diff --git a/‎supabase/functions/ai-engine/engines/pattern-spotter.test.ts‎
Lines changed: 24 additions & 6 deletions b/‎supabase/functions/ai-engine/engines/pattern-spotter.test.ts‎
Lines changed: 24 additions & 6 deletions
diff --git a/‎supabase/functions/ai-engine/engines/pattern-spotter.ts‎
Lines changed: 38 additions & 9 deletions b/‎supabase/functions/ai-engine/engines/pattern-spotter.ts‎
Lines changed: 38 additions & 9 deletions
@@ -803,7 +803,11 @@ function chiSquaredUniformity(observed: number[]): ChiSquaredResult | null {
                    e. Excursion clustering (NEW — intraday)
                    f. Sequential change detection (NEW — intraday)
                    g. Stability trend analysis (NEW — daily + intraday)
- 7. CORRECT     → UNCHANGED (BH FDR correction across ALL tests)
+ 7. CORRECT     → BH FDR correction PER FAMILY (see Section 7.5)
+                   a. Same-day family: segment comparisons + intraday candidates (α=0.10)
+                   b. Lagged family: lagged effect candidates (α=0.10)
+                   c. Trend family: trend candidates (α=0.10)
+                   Merge all survivors → single ranked list for steps 8-12
  8. RANK        → UNCHANGED (composite score: |d| × -log₁₀(p))
  9. FILTER      → UNCHANGED (dedup + classify)
 10. NARRATE     → EXPANDED (new templates for 4 new candidate types)
@@ -912,7 +916,104 @@ interface IntradayDataBundle {
 
 These arrays are passed to the intraday analyzers in step 6d-6g. They are NOT passed to the daily analyzers (6a-6c) which continue to use DailyMetricRow[].
 
-### 7.5 Comprehensive Diagnostics & Failure Isolation
+### 7.5 BH Correction Families (Step 7)
+
+#### 7.5.1 The Problem: Hitchhiker Effect
+
+v2 ran BH correction across ALL candidates in a single pool. This worked because the pool was dominated by tautological pairs (steps↔distance, workout_duration↔workout_calories) with astronomically low p-values (p ≈ 10⁻⁸). These trivially significant pairs anchored the top of the BH ranking, creating a high watermark that let weaker cross-domain signals (activity→sleep, glucose→behavior) pass at lower ranks.
+
+When the blacklist removes the tautological pairs, the remaining candidates are all genuine cross-domain signals with higher p-values (p ≈ 0.001-0.05). In a single pool of 700-900 tests, BH's threshold at rank 1 is `0.10/900 ≈ 0.000111`. If no candidate has p < 0.000111, nothing passes — even if there are 20 genuinely interesting findings with p < 0.01.
+
+**The cross-domain findings didn't get weaker. They lost the tautological bodyguards that were inflating the BH watermark.** This is a known statistical phenomenon: removing true positives from a BH pool can cause previously-passing weaker signals to fail, because BH's thresholds are relative to the total test count.
+
+#### 7.5.2 The Solution: Three BH Families
+
+Run BH correction independently on three candidate families:
+
+| Family | What it tests | Typical size |
+|---|---|---|
+| **Same-day** | Segment comparisons + intraday analyzers (steps 6a, 6d-6g) | 300-500 candidates |
+| **Lagged** | Day N behavior → Day N+1 outcome (step 6b) | 150-300 candidates |
+| **Trends** | Earlier-half vs recent-half directional shifts (step 6c) | 20-50 candidates |
+
+Each family runs BH at α=0.10 independently. A cross-domain same-day finding with p=0.005 in a pool of 400 needs to beat threshold `k/400 × 0.10` — 2-3x more lenient than `k/900 × 0.10` in the combined pool. A trend with p=0.01 in a pool of 30 easily passes at `k/30 × 0.10`.
+
+#### 7.5.3 Why Three Families Is Principled (Not P-Hacking)
+
+The families correspond to **fundamentally different experimental designs**:
+
+- **Same-day**: "On days when X is high, is Y also high?" — tests contemporaneous associations
+- **Lagged**: "After days when X is high, is Y different the next day?" — tests delayed effects across a day boundary
+- **Trends**: "Is X shifting over weeks?" — tests directional change over time, no segmentation involved
+
+These are different kinds of hypotheses. The statistical strength of "your bedtime is trending earlier" should not depend on how many same-day activity-vs-sleep pairs were tested — they're unrelated questions. Forcing them into the same BH pool penalizes one for the other's noise.
+
+The analogy: you wouldn't grade a math exam and an English essay on the same curve.
+
+**The line we don't cross**: Splitting further (same-day-glucose, same-day-sleep, same-day-activity...) would create many tiny families where everything passes. That's gaming the math. Three families based on experimental design is defensible. Twenty families based on metric domain is not.
+
+#### 7.5.4 Scaling as Data Grows
+
+When new data sources are added (Garmin, Apple Watch, Dexcom), each family grows. More segment metrics × more outcome metrics = more candidates per family. BH gets stricter within each family proportionally.
+
+This is fine because:
+- More data per metric → stronger statistical power → lower p-values for genuine signals
+- BH strictness and signal strength scale together
+
+**When a single family gets too large (1000+ candidates)**, the correct response is:
+1. **Blacklist maintenance** — review the family's candidates for new tautological pairs that slipped through
+2. **Effect size thresholds** — raise MIN_EFFECT_SIZE for that family to filter weak candidates before BH
+3. **NOT more family splits** — splitting families is a one-time architectural decision, not an ongoing tuning knob
+
+#### 7.5.5 Implementation
+
+**Pipeline step 7 changes from:**
+```
+7. CORRECT → BH FDR correction across ALL tests (single pool)
+```
+
+**To:**
+```
+7. CORRECT → BH FDR correction per family
+   a. Same-day family: segment comparisons + intraday analyzer candidates (α=0.10)
+   b. Lagged family: lagged effect candidates (α=0.10)
+   c. Trend family: trend candidates (α=0.10)
+   Merge all BH survivors into a single ranked list for steps 8-12.
+```
+
+Each candidate already carries a `type` field (`segment_comparison`, `lagged_effect`, `trend`, `temporal_distribution`, `excursion_cluster`, `sequential_change`, `stability_trend`). The family assignment:
+
+| Candidate type | BH Family |
+|---|---|
+| `segment_comparison` | Same-day |
+| `temporal_distribution` | Same-day |
+| `excursion_cluster` | Same-day |
+| `sequential_change` | Same-day |
+| `stability_trend` | Same-day |
+| `lagged_effect` | Lagged |
+| `trend` | Trends |
+
+After BH correction per family, survivors from all three families are merged, ranked by composite score, and flow through steps 8-12 (RANK → FILTER → NARRATE → PERSIST → LOG) unchanged.
+
+**Diagnostics update:**
+```json
+{
+  "correct": {
+    "alpha": 0.10,
+    "families": {
+      "same_day": { "candidates": 412, "passed_bh": 23 },
+      "lagged": { "candidates": 287, "passed_bh": 8 },
+      "trends": { "candidates": 31, "passed_bh": 5 }
+    },
+    "total_candidates": 730,
+    "total_passed_bh": 36
+  }
+}
+```
+
+**Code change scope**: ~20 lines in `pattern-spotter.ts` step 7. Split `allCandidates` by type into three arrays, call `benjaminiHochberg()` three times, merge the survivors. No changes to any other module.
+
+### 7.6 Comprehensive Diagnostics & Failure Isolation
 
 v2 logs diagnostics at each pipeline step, but v3 expands this significantly. The guiding principle: **every decision the pipeline makes should be traceable** — what was tested, what was skipped, and why.
 
@@ -1344,14 +1445,14 @@ metrics_impact: [{
 - Phase B: DONE — All 4 sync functions extract timezone (WHOOP offset in vendor_metadata + local activity_date, Fitbit IANA from profile API, Libre derived offset, Oura from Personal Info API)
 - Phase C: DONE — `daily-aggregation.ts` uses `LocalTimeExtractors` closure factory; glucose uses `display_time`
 - Phase D: DONE — UTC fallback, Pacific, DST, Tokyo, Kolkata tests (28 aggregation tests pass)
-- Phase E: NOT DONE — Backfill script for existing users without timezone, pipeline diagnostics timezone section
+- Phase E: UNNECESSARY — Mobile app auto-pushes timezone on launch; UTC fallback works for users who haven't opened the app
 
 **Phase 1: Foundation** — DONE (41 + 39 = 80 tests)
 - `statistical-tests.ts`: Wilcoxon signed-rank, chi-squared uniformity, Spearman ρ, `computeRanks` (20 new tests, 41 total)
 - `metric-type-catalog.ts`: 17 type patterns, label generation, `classifyMetric`/`getLabel`/`classifyWithLabel` (39 tests)
 - Migration `20260320000000_create_insight_engine_blacklist.sql`: table + 24 seed entries
 
-**Phase 2: Blacklist seeding** — NOT DONE
+**Phase 2: Blacklist seeding** — DONE (2026-03-18)
 - One-time Spearman correlation analysis script needed
 - Run against active users, review pairs with median ρ > 0.80
 - Insert confirmed tautological pairs into `insight_engine_blacklist`
 
@@ -392,20 +392,38 @@ describe('Pattern Spotter v2 — Full Pipeline', () => {
   // BH correction
   // =========================================================================
 
-  it('applies Benjamini-Hochberg correction and logs diagnostics', async () => {
+  it('applies BH correction per family (same-day, lagged, trends) and logs diagnostics', async () => {
     const data = generateWhoopData(60);
     mockDailySummaries = data.dailySummaries;
     mockSleepSessions = data.sleepSessions;
 
     const result = await spotPatterns({ lookback_days: 90 }, 'user-1', mockProvider);
     const diag = result.diagnostics as Record<string, unknown>;
 
-    // Should have correction diagnostics
-    if (diag.correct) {
-      const correct = diag.correct as Record<string, number>;
-      expect(correct.alpha).toBe(0.10);
-      expect(correct.passed_bh).toBeLessThanOrEqual(correct.total_tests);
+    // Should have correction diagnostics with family breakdown
+    expect(diag.correct).toBeDefined();
+    const correct = diag.correct as Record<string, unknown>;
+    expect(correct.alpha).toBe(0.10);
+
+    // Must have per-family results
+    const families = correct.families as Record<string, Record<string, number>>;
+    expect(families).toBeDefined();
+    expect(families.same_day).toBeDefined();
+    expect(families.lagged).toBeDefined();
+    expect(families.trends).toBeDefined();
+
+    // Each family reports candidates and passed_bh
+    for (const family of Object.values(families)) {
+      expect(typeof family.candidates).toBe('number');
+      expect(typeof family.passed_bh).toBe('number');
+      expect(family.passed_bh).toBeLessThanOrEqual(family.candidates);
     }
+
+    // Total is sum of families
+    const totalCandidates = Object.values(families).reduce((s, f) => s + f.candidates, 0);
+    const totalPassedBh = Object.values(families).reduce((s, f) => s + f.passed_bh, 0);
+    expect(correct.total_candidates).toBe(totalCandidates);
+    expect(correct.total_passed_bh).toBe(totalPassedBh);
   });
 
   // =========================================================================
 
@@ -803,21 +803,50 @@ export async function spotPatterns(
   }
 
   // =========================================================================
-  // 7. CORRECT — Benjamini-Hochberg FDR correction
+  // 7. CORRECT — Benjamini-Hochberg FDR correction per family
+  //
+  // Three families based on experimental design:
+  //   - Same-day: segment comparisons + intraday analyzers
+  //   - Lagged: day N behavior → day N+1 outcome
+  //   - Trends: directional shifts over weeks
+  //
+  // Running BH per family prevents cross-domain signals from being drowned
+  // by the test count of unrelated hypothesis families.
   // =========================================================================
 
-  const bhResults = benjaminiHochberg(allCandidates, 0.10);
-  const bhSignificant = bhResults.filter(c => c.bh_significant);
+  const LAGGED_TYPES = new Set(['lagged_effect']);
+  const TREND_TYPES = new Set(['trend']);
+  // Everything else (segment_comparison, temporal_distribution, excursion_cluster,
+  // sequential_change, stability_trend) is same-day family.
+
+  const sameDayCandidates = allCandidates.filter(c => !LAGGED_TYPES.has(c.type) && !TREND_TYPES.has(c.type));
+  const laggedCandidates = allCandidates.filter(c => LAGGED_TYPES.has(c.type));
+  const trendCandidates = allCandidates.filter(c => TREND_TYPES.has(c.type));
+
+  const BH_ALPHA = 0.10;
+
+  const sameDayBh = benjaminiHochberg(sameDayCandidates, BH_ALPHA);
+  const laggedBh = benjaminiHochberg(laggedCandidates, BH_ALPHA);
+  const trendBh = benjaminiHochberg(trendCandidates, BH_ALPHA);
+
+  const sameDaySurvivors = sameDayBh.filter(c => c.bh_significant);
+  const laggedSurvivors = laggedBh.filter(c => c.bh_significant);
+  const trendSurvivors = trendBh.filter(c => c.bh_significant);
+
+  const bhSignificant = [...sameDaySurvivors, ...laggedSurvivors, ...trendSurvivors];
 
   diagnostics.correct = {
-    alpha: 0.10,
-    total_tests: allCandidates.length,
-    passed_raw: allCandidates.length,
-    passed_bh: bhSignificant.length,
-    false_positives_removed: allCandidates.length - bhSignificant.length,
+    alpha: BH_ALPHA,
+    families: {
+      same_day: { candidates: sameDayCandidates.length, passed_bh: sameDaySurvivors.length },
+      lagged: { candidates: laggedCandidates.length, passed_bh: laggedSurvivors.length },
+      trends: { candidates: trendCandidates.length, passed_bh: trendSurvivors.length },
+    },
+    total_candidates: allCandidates.length,
+    total_passed_bh: bhSignificant.length,
   };
 
-  console.log(`[PatternSpotter] BH correction applied`, JSON.stringify(diagnostics.correct));
+  console.log(`[PatternSpotter] BH correction applied (per family)`, JSON.stringify(diagnostics.correct));
 
   // Map BH results back to PatternCandidate (strip BH fields for ranking)
   const bhCandidates: PatternCandidate[] = bhSignificant.map(c => ({