You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
BH correction per family + blacklist waves 1-2 (212 entries)
Split BH correction into three independent families (same-day, lagged,
trends) to fix the hitchhiker effect where removing tautological pairs
caused genuine cross-domain signals to fail BH. Includes blacklist
wave 1 (112 entries) and wave 2 (76 entries) covering activity cluster,
intraday HR, HRV-recovery, and missing glucose/sleep pairs.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
These arrays are passed to the intraday analyzers in step 6d-6g. They are NOT passed to the daily analyzers (6a-6c) which continue to use DailyMetricRow[].
v2 ran BH correction across ALL candidates in a single pool. This worked because the pool was dominated by tautological pairs (steps↔distance, workout_duration↔workout_calories) with astronomically low p-values (p ≈ 10⁻⁸). These trivially significant pairs anchored the top of the BH ranking, creating a high watermark that let weaker cross-domain signals (activity→sleep, glucose→behavior) pass at lower ranks.
924
+
925
+
When the blacklist removes the tautological pairs, the remaining candidates are all genuine cross-domain signals with higher p-values (p ≈ 0.001-0.05). In a single pool of 700-900 tests, BH's threshold at rank 1 is `0.10/900 ≈ 0.000111`. If no candidate has p < 0.000111, nothing passes — even if there are 20 genuinely interesting findings with p < 0.01.
926
+
927
+
**The cross-domain findings didn't get weaker. They lost the tautological bodyguards that were inflating the BH watermark.** This is a known statistical phenomenon: removing true positives from a BH pool can cause previously-passing weaker signals to fail, because BH's thresholds are relative to the total test count.
928
+
929
+
#### 7.5.2 The Solution: Three BH Families
930
+
931
+
Run BH correction independently on three candidate families:
Each family runs BH at α=0.10 independently. A cross-domain same-day finding with p=0.005 in a pool of 400 needs to beat threshold `k/400 × 0.10` — 2-3x more lenient than `k/900 × 0.10` in the combined pool. A trend with p=0.01 in a pool of 30 easily passes at `k/30 × 0.10`.
940
+
941
+
#### 7.5.3 Why Three Families Is Principled (Not P-Hacking)
942
+
943
+
The families correspond to **fundamentally different experimental designs**:
944
+
945
+
-**Same-day**: "On days when X is high, is Y also high?" — tests contemporaneous associations
946
+
-**Lagged**: "After days when X is high, is Y different the next day?" — tests delayed effects across a day boundary
947
+
-**Trends**: "Is X shifting over weeks?" — tests directional change over time, no segmentation involved
948
+
949
+
These are different kinds of hypotheses. The statistical strength of "your bedtime is trending earlier" should not depend on how many same-day activity-vs-sleep pairs were tested — they're unrelated questions. Forcing them into the same BH pool penalizes one for the other's noise.
950
+
951
+
The analogy: you wouldn't grade a math exam and an English essay on the same curve.
952
+
953
+
**The line we don't cross**: Splitting further (same-day-glucose, same-day-sleep, same-day-activity...) would create many tiny families where everything passes. That's gaming the math. Three families based on experimental design is defensible. Twenty families based on metric domain is not.
954
+
955
+
#### 7.5.4 Scaling as Data Grows
956
+
957
+
When new data sources are added (Garmin, Apple Watch, Dexcom), each family grows. More segment metrics × more outcome metrics = more candidates per family. BH gets stricter within each family proportionally.
958
+
959
+
This is fine because:
960
+
- More data per metric → stronger statistical power → lower p-values for genuine signals
961
+
- BH strictness and signal strength scale together
962
+
963
+
**When a single family gets too large (1000+ candidates)**, the correct response is:
964
+
1.**Blacklist maintenance** — review the family's candidates for new tautological pairs that slipped through
965
+
2.**Effect size thresholds** — raise MIN_EFFECT_SIZE for that family to filter weak candidates before BH
966
+
3.**NOT more family splits** — splitting families is a one-time architectural decision, not an ongoing tuning knob
967
+
968
+
#### 7.5.5 Implementation
969
+
970
+
**Pipeline step 7 changes from:**
971
+
```
972
+
7. CORRECT → BH FDR correction across ALL tests (single pool)
973
+
```
974
+
975
+
**To:**
976
+
```
977
+
7. CORRECT → BH FDR correction per family
978
+
a. Same-day family: segment comparisons + intraday analyzer candidates (α=0.10)
979
+
b. Lagged family: lagged effect candidates (α=0.10)
980
+
c. Trend family: trend candidates (α=0.10)
981
+
Merge all BH survivors into a single ranked list for steps 8-12.
982
+
```
983
+
984
+
Each candidate already carries a `type` field (`segment_comparison`, `lagged_effect`, `trend`, `temporal_distribution`, `excursion_cluster`, `sequential_change`, `stability_trend`). The family assignment:
985
+
986
+
| Candidate type | BH Family |
987
+
|---|---|
988
+
|`segment_comparison`| Same-day |
989
+
|`temporal_distribution`| Same-day |
990
+
|`excursion_cluster`| Same-day |
991
+
|`sequential_change`| Same-day |
992
+
|`stability_trend`| Same-day |
993
+
|`lagged_effect`| Lagged |
994
+
|`trend`| Trends |
995
+
996
+
After BH correction per family, survivors from all three families are merged, ranked by composite score, and flow through steps 8-12 (RANK → FILTER → NARRATE → PERSIST → LOG) unchanged.
**Code change scope**: ~20 lines in `pattern-spotter.ts` step 7. Split `allCandidates` by type into three arrays, call `benjaminiHochberg()` three times, merge the survivors. No changes to any other module.
v2 logs diagnostics at each pipeline step, but v3 expands this significantly. The guiding principle: **every decision the pipeline makes should be traceable** — what was tested, what was skipped, and why.
918
1019
@@ -1344,14 +1445,14 @@ metrics_impact: [{
1344
1445
- Phase B: DONE — All 4 sync functions extract timezone (WHOOP offset in vendor_metadata + local activity_date, Fitbit IANA from profile API, Libre derived offset, Oura from Personal Info API)
0 commit comments