Insight Engine v3 — Dynamic Metric Discovery & Intraday Analysis

Status: DEPLOYED — Phases 0-5 complete, blacklist seeded (212 entries), BH family correction live, batched AI narration live (prompt v5.0) Author: Claude Code / schwaaamp Date: 2026-03-16 Supersedes: pattern-spotter-v2-rewrite.md (complete, deployed)

Problem Statement
Design Goals
Architecture Overview
Metric Discovery (replacing the hardcoded registry)
Blacklisting Trivial Correlations
Intraday Analyzers
Pipeline Changes
Candidate Schema & Narration
Migration Strategy
Open Questions
Multi-Factor Insights (Future: Phase 6)
Plan Review: Gaps, Risks, and Assumptions
TDD Implementation Plan
Migration Strategy
Open Questions

1. Problem Statement

What v2 does well

The v2 pattern spotter is a rigorous statistical engine. It uses Mann-Whitney U tests for non-parametric comparison, Benjamini-Hochberg FDR correction for multiple testing, Cohen's d for effect sizes, and a clean 12-step pipeline. The statistical foundation is sound and should be preserved.

What v2 gets wrong

1. Hardcoded metric registry (62 entries, manually maintained)

Every metric in the system is individually defined in metric-registry.ts with its key, label, unit, source table, directionality, segmentation eligibility, outcome eligibility, provider list, and exclusion list. This creates several problems:

Adding a provider (e.g., Garmin, Apple Health) requires updating the registry, aggregation layer, data cleanser, and potentially the pattern spotter itself.
Adding a metric from an existing provider (e.g., WHOOP adds a new field to their API) requires the same multi-file update.
Staleness: If vendor_metadata starts containing new fields we don't know about, they're silently ignored. The system only sees what we've told it to look for.
Provider coupling: providers: ['whoop'] is fragile. What if two providers start reporting the same metric?

2. Manual tautology prevention (excludeOutcomes)

The current approach to preventing trivial correlations is excludeOutcomes on each metric definition:

{ key: 'strain', excludeOutcomes: ['recovery_score'] }
{ key: 'sleep_duration', excludeOutcomes: ['time_in_bed', 'sleep_efficiency'] }

This is incomplete and fragile. It catches a few known cases but misses many others:

steps as segment → distance_m as outcome (r ≈ 0.95, trivially obvious)
steps as segment → intraday_steps as outcome (same underlying data)
active_calories as segment → calories_total as outcome (one is a component of the other)
workout_duration as segment → workout_calories as outcome (highly correlated)
avg_glucose as segment → glucose_cv as outcome (derived from same readings)
Any glucose metric as segment → any other glucose metric as outcome (all from same CGM data)

Every time a new metric is added, someone must manually decide what to exclude. This doesn't scale.

3. Hardcoded lagged effect pairs

LAGGED_SEGMENT_OUTCOMES lists exactly which segment→outcome pairs to test with a 1-day lag. This means:

New metrics are never tested for lagged effects unless someone adds them
Cross-domain relationships (glucose → sleep, activity → glucose) are only tested if someone thinks to add them
The system can only find relationships it's been told to look for

4. No intraday analysis

The pipeline collapses all data into daily summaries before running any statistical tests. This destroys the temporal structure within each day — which is where some of the most actionable patterns live:

Time-of-day patterns (glucose higher in the evening, HR lower overnight)
Excursion clustering (glucose spikes cluster after dinner, low HR events cluster at 3am)
Dawn phenomenon (consistent glucose rise 3am→7am)
Intra-day stability changes over time

These were demonstrated in the standalone glucose trend analysis script (scripts/glucose-trend-analysis.mjs) and represent high-value findings that users can act on.

5. Segmentation eligibility is opinion-based

The canSegment flag is hand-assigned. There's no principled reason why steps can be a segment but resting_hr cannot. A metric should be eligible for segmentation if the data supports it (enough values, sufficient variance to create meaningful groups), not because someone decided it should be.

2. Design Goals

Derive metrics from data, not configuration. The system should discover what metrics are available for a given user based on their integrated devices and the data that actually exists. Adding a new provider should require only extraction logic, not analysis configuration.
Prevent trivial correlations automatically. Replace manual excludeOutcomes with data-driven blacklisting that catches tautological relationships without human maintenance.
Surface intraday patterns. Add generic analyzers that work on any high-frequency metric (glucose, HR, steps) to find time-of-day patterns, excursion clustering, and stability trends — without hardcoding what to look for.
Determine segmentation eligibility from data. Any metric with sufficient data and variance should be eligible for segmentation. No hand-assigned flags.
Remove hardcoded lagged effect pairs. Test all plausible cross-domain lagged relationships. Let BH correction handle the increased test count.
Preserve the statistical foundation. Mann-Whitney U, Cohen's d, BH correction, composite scoring, deduplication, discovery/observation classification — all of this works and should not change.

3. Architecture Overview

The v2 pipeline has three conceptual layers, but they're entangled:

v2 (entangled):
  Hardcoded Registry → Hardcoded Extraction → Hardcoded Analysis

v3 separates them cleanly:

v3 (layered):
  Layer 1: EXTRACTION (provider-aware, must be maintained)
    Transforms vendor_metadata into standardized metric values.
    This is the ONLY layer that knows about WHOOP vs Oura vs Fitbit.
    Output: Map<metricKey, number> per day + raw intraday arrays.

  Layer 2: DISCOVERY (data-driven, no configuration needed)
    Scans what metrics actually exist for this user.
    Classifies metrics by type (HR, percentage, duration, etc.).
    Computes pairwise correlations for blacklisting.
    Determines segmentation eligibility from variance.
    Output: DiscoveredMetric[] with type metadata + blacklist pairs.

  Layer 3: ANALYSIS (generic, metric-agnostic)
    Runs all applicable analyzers on all eligible metrics.
    BH correction, ranking, classification, narration.
    Does NOT know about specific metrics or providers.
    Output: PatternCandidate[] → discoveries + observations.

Key insight: The extraction layer is the only place that requires per-provider maintenance. Everything downstream is generic and self-configuring.

4. Metric Discovery

4.1 What MUST be configured (irreducible domain knowledge)

Some information cannot be derived from data alone:

Knowledge	Why it can't be derived	Where it lives
Vendor field paths	`vendor_metadata.slow_wave_minutes` vs `vendor_metadata.stages.deep_minutes`	Extraction layer (per-provider)
Unit conversions	WHOOP HRV is in seconds, Oura is in milliseconds	Extraction layer (per-provider)
Plausibility bounds	A heart rate of 300 bpm is physiologically impossible	Metric type catalog (~15 types)
Directionality	Higher HRV is generally better	Metric type catalog
Optimal ranges	Glucose 70-180 mg/dL is "in range"	Metric type catalog (only a few metrics)
Human-readable labels	`sleep_deep_pct` should display as "Deep Sleep %"	Label lookup table

4.2 What CAN be derived from data

Knowledge	How it's derived	Currently hardcoded as
Which metrics exist for this user	Scan non-null keys in DailyMetricRow[]	`SERVER_METRIC_REGISTRY` entries
Which providers supply data	Query connected_devices + check which tables have rows	`providers: ['whoop']` on each metric
Which metrics have enough data	Count non-null values (≥20 for analysis, ≥28 for trends)	`canOutcome` flag
Which metrics are suitable for segmentation	Sufficient data (≥20) + sufficient variance (CV > threshold)	`canSegment` flag
Which metric pairs are tautological	Pairwise Spearman correlation (ρ > 0.85)	`excludeOutcomes` arrays
Which lagged pairs are worth testing	Cross-domain classification by source table	`LAGGED_SEGMENT_OUTCOMES`

4.3 The Metric Type Catalog (replaces per-metric registry)

Instead of 62 individual metric definitions, we define ~15 metric types. Each discovered metric is classified into a type via naming convention matching.

interface MetricTypeConfig {
  /** Patterns to match metric keys against */
  patterns: RegExp[];
  /** Display unit */
  unit: string;
  /** Plausibility bounds [min, max] — values outside are flagged as outliers */
  bounds: [number, number];
  /** General directionality for interpretation */
  direction: 'higher_is_better' | 'lower_is_better' | 'neutral';
  /** Clinical/physiological optimal range for excursion detection.
   *  If absent, analyzer uses user's personal P10/P90 */
  optimalRange?: [number, number];
}

const METRIC_TYPE_CATALOG: Record<string, MetricTypeConfig> = {
  heart_rate: {
    patterns: [/(_hr_|_hr$|^hr_|heart_rate|resting_hr)/],
    unit: 'bpm',
    bounds: [25, 220],
    direction: 'lower_is_better',
  },
  hrv: {
    patterns: [/hrv/],
    unit: 'ms',
    bounds: [1, 300],
    direction: 'higher_is_better',
  },
  glucose: {
    patterns: [/glucose/],
    unit: 'mg/dL',
    bounds: [20, 500],
    direction: 'lower_is_better',
    optimalRange: [70, 180],
  },
  percentage: {
    patterns: [/(_pct$|_percentage|efficiency|consistency)/],
    unit: '%',
    bounds: [0, 100],
    direction: 'neutral',
  },
  score_100: {
    patterns: [/(readiness_score|activity_score|recovery_score)/],
    unit: '/100',
    bounds: [0, 100],
    direction: 'higher_is_better',
  },
  strain: {
    patterns: [/strain/],
    unit: 'score',
    bounds: [0, 21],
    direction: 'neutral',
  },
  steps: {
    patterns: [/steps/],
    unit: 'steps',
    bounds: [0, 100_000],
    direction: 'higher_is_better',
  },
  distance: {
    patterns: [/distance/],
    unit: 'm',
    bounds: [0, 200_000],
    direction: 'higher_is_better',
  },
  calories: {
    patterns: [/calori/],
    unit: 'kcal',
    bounds: [0, 15_000],
    direction: 'neutral',
  },
  duration_hours: {
    patterns: [/sleep_duration|time_in_bed/],
    unit: 'hours',
    bounds: [0, 24],
    direction: 'neutral',
  },
  duration_min: {
    patterns: [/(_min$|_min_|latency|active_min|sedentary)/],
    unit: 'min',
    bounds: [0, 1440],
    direction: 'neutral',
  },
  temperature: {
    patterns: [/skin_temp/],
    unit: '°C',
    bounds: [30, 42],
    direction: 'neutral',
  },
  temp_deviation: {
    patterns: [/temp_deviation/],
    unit: '°C',
    bounds: [-5, 5],
    direction: 'neutral',
  },
  respiratory: {
    patterns: [/respiratory/],
    unit: 'brpm',
    bounds: [4, 40],
    direction: 'neutral',
  },
  spo2: {
    patterns: [/spo2/],
    unit: '%',
    bounds: [50, 100],
    direction: 'higher_is_better',
  },
  clock_time: {
    patterns: [/bedtime_hour|wake_hour/],
    unit: 'hour',
    bounds: [0, 24],
    direction: 'neutral',
  },
  count: {
    patterns: [/count|alarm/],
    unit: 'count',
    bounds: [0, 1000],
    direction: 'neutral',
  },
};

Classification algorithm:

For each metric key found in the user's data, match against patterns in order
First match wins → assigns unit, bounds, direction, optimalRange
Unmatched metrics get conservative defaults: no bounds enforcement, neutral direction, no optimal range
Override table (small, for edge cases where naming convention fails)

Label generation:

Maintain a METRIC_LABELS lookup table (simple Record<string, string>)
For unknown metrics, auto-generate from key: sleep_deep_pct → "Sleep Deep Pct"
Labels are cosmetic — the analysis layer never uses them for logic

4.4 Automatic Segmentation Eligibility

Replace canSegment: true/false with data-driven eligibility:

function isEligibleForSegmentation(values: number[]): boolean {
  // Need enough data points to create meaningful groups
  if (values.length < 20) return false;

  // Need sufficient variance — a metric where everyone is the same
  // doesn't create useful segments
  const m = mean(values);
  if (m === 0) return false;
  const coefficientOfVariation = (stddev(values) / Math.abs(m)) * 100;
  if (coefficientOfVariation < 10) return false;

  // Need a reasonable spread between P25 and P75
  // (if P25 ≈ P75, the binary splits won't create different groups)
  const p25 = percentile(values, 25);
  const p75 = percentile(values, 75);
  if (p75 - p25 < 0.01 * Math.abs(m)) return false;

  return true;
}

Boolean metrics (values are exclusively 0 and 1) get boolean segmentation if both groups have ≥ MIN_GROUP_SIZE members. Detected automatically:

function isBooleanMetric(values: number[]): boolean {
  return values.every(v => v === 0 || v === 1);
}

4.5 Data Density Classification

Each metric needs a dataDensity to determine which analyzers apply. This is derived from the source table:

Source table	Data density	Reasoning
`glucose_data`	`intraday`	~288 readings/day (every 5 min)
`intraday_data`	`intraday`	HR and steps at 5-min intervals
`daily_summary`	`daily`	One row per day
`sleep_sessions`	`daily`	Aggregated to one row per wake-date
`activities`	`daily`	Aggregated to one row per day
`glucose_alarms`	`daily`	Count per day

For intraday sources, the raw (pre-aggregation) readings are passed to the intraday analyzers in addition to the daily aggregates.

5. Blacklisting Trivial Correlations

5.1 The Problem

Users don't want to see: "On high daily steps days, your distance is 34% higher (p=0.0001)." This is tautological — steps and distance measure the same thing. The system should never surface this.

Current v2 approach: Manual excludeOutcomes per metric. Incomplete and fragile.

v3 draft approach (Spearman ρ at runtime): Computed per-user per-run. Problems:

Non-deterministic: Same pair blocked for one user (ρ=0.87) but allowed for another (ρ=0.83)
Threshold fragility: workout_duration↔workout_calories might be ρ=0.78 for yoga, ρ=0.95 for running
Wasted computation: ~1800 pairwise correlations every run, results largely stable
Not auditable: Blacklist computed and discarded — no way to review or override

5.2 Revised Approach: DB-Backed Blacklist

Replace both the hardcoded excludeOutcomes AND the runtime correlation computation with a single insight_blacklist database table.

Principles:

Empirically seeded — populated by analyzing what actually survives BH correction, not by guessing
Stored in the database — editable at runtime, no code deployment needed
Transparent — every entry has a reason and source, auditable via admin dashboard
Deterministic — same blacklist for all users (consistent behavior)
Evolvable — as new metrics appear, new tautological pairs surface in survivors → review → add

5.3 Database Schema

CREATE TABLE insight_engine_blacklist (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),

  -- Pair stored in alphabetical order to prevent duplicates
  -- e.g., ("distance_m", "steps") not ("steps", "distance_m")
  metric_a TEXT NOT NULL,
  metric_b TEXT NOT NULL,

  -- Why this pair is blacklisted
  reason TEXT NOT NULL CHECK (reason IN (
    'same_source',            -- derived from the same raw data
    'definitional',           -- one is a mathematical function of the other
    'trivial_correlation',    -- empirically tautological (e.g., steps ↔ distance)
    'admin'                   -- manually added by admin for other reasons
  )),

  -- Human-readable explanation
  notes TEXT,

  -- How this entry was created
  source TEXT NOT NULL CHECK (source IN (
    'seed',                   -- initial seeding from known source groups
    'empirical_review',       -- added after reviewing BH survivors
    'correlation_analysis',   -- flagged by one-time correlation scan, confirmed by human
    'admin'                   -- manually added via admin dashboard
  )),

  is_active BOOLEAN NOT NULL DEFAULT true,
  created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
  updated_at TIMESTAMPTZ NOT NULL DEFAULT now(),

  UNIQUE(metric_a, metric_b)
);

-- Fast lookup for active entries
CREATE INDEX idx_insight_engine_blacklist_active
  ON insight_engine_blacklist(metric_a, metric_b) WHERE is_active = true;

Constraint: metric_a < metric_b (alphabetical ordering) enforced at insert time to prevent duplicate pairs stored in different orders.

5.4 Seeding Strategy

Step 1: Seed with known source-group pairs

These are relationships that are tautological by construction — derived from the same raw data or definitionally coupled. Inserted with source: 'seed':

metric_a	metric_b	reason	notes
avg_glucose	glucose_cv	same_source	Both computed from glucose_data readings
avg_glucose	glucose_max	same_source	Both computed from glucose_data readings
avg_glucose	glucose_min	same_source	Both computed from glucose_data readings
avg_glucose	glucose_range	same_source	Both computed from glucose_data readings
avg_glucose	overnight_glucose	same_source	Both computed from glucose_data readings
avg_glucose	daytime_glucose	same_source	Both computed from glucose_data readings
avg_glucose	time_in_range	same_source	Both computed from glucose_data readings
avg_glucose	time_in_tight_range	same_source	Both computed from glucose_data readings
glucose_cv	time_in_range	same_source	Both computed from glucose_data readings
glucose_cv	time_in_tight_range	same_source	Both computed from glucose_data readings
glucose_max	glucose_min	same_source	Both computed from glucose_data readings
glucose_max	glucose_range	same_source	glucose_range = glucose_max - glucose_min
glucose_min	glucose_range	same_source	glucose_range = glucose_max - glucose_min
overnight_glucose	daytime_glucose	same_source	Both computed from glucose_data readings
sleep_deep_pct	sleep_light_pct	definitional	Sleep stage percentages sum to ~100%
sleep_deep_pct	sleep_rem_pct	definitional	Sleep stage percentages sum to ~100%
sleep_light_pct	sleep_rem_pct	definitional	Sleep stage percentages sum to ~100%
sleep_duration	sleep_efficiency	definitional	efficiency = duration / time_in_bed
sleep_duration	time_in_bed	definitional	duration ≤ time_in_bed always
sleep_efficiency	time_in_bed	definitional	efficiency = duration / time_in_bed
active_minutes_intraday	intraday_steps	same_source	Both from steps_delta in intraday_data
distance_m	steps	trivial_correlation	distance ≈ steps × stride length
intraday_steps	steps	same_source	Daily steps is sum of intraday steps

(~23 seed entries covering the known structural relationships)

Step 2: One-time correlation analysis (seeding tool, not runtime)

Run a one-time script that:

Fetches data for all active users with ≥30 days of data
Computes Spearman ρ for all metric pairs per user
Identifies pairs where median ρ across users > 0.80 (using median prevents one outlier user from skewing)
Produces a review report: pair, median ρ, user count, sample of top correlations
Human reviews the report → confirms which pairs are tautological
Confirmed pairs inserted into insight_blacklist with source: 'correlation_analysis'

This captures cross-table relationships that aren't obvious from source groups:

active_calories ↔ calories_total (one is a component of the other)
workout_duration ↔ workout_calories (highly correlated volume metrics)
steps ↔ active_calories (both measure physical activity)

Step 3: Ongoing maintenance via BH survivor review

After each pipeline run, the diagnostics log includes a novel_survivors list — BH survivors whose (segment, outcome) pair is NOT in the blacklist and has NOT been previously reviewed. The admin dashboard surfaces these for periodic review.

5.5 Pipeline Integration

At the start of step 6 (SCAN), the pipeline loads the active blacklist:

// Load once per run, cache as Set for O(1) lookup
const { data: blacklistRows } = await supabase
  .from('insight_blacklist')
  .select('metric_a, metric_b')
  .eq('is_active', true);

const blacklist = new Set<string>(
  (blacklistRows ?? []).map(r => `${r.metric_a}::${r.metric_b}`)
);

// Check function used by all test types
function isBlacklisted(metricA: string, metricB: string): boolean {
  const pair = [metricA, metricB].sort().join('::');
  return blacklist.has(pair);
}

Before every statistical test:

For each (segment_metric, outcome_metric):
  if isBlacklisted(segment_metric, outcome_metric): SKIP, log skip reason

This applies uniformly to:

Same-day segment comparisons (step 6a)
Lagged comparisons (step 6b)
Any future test types

5.6 Diagnostics

{
  "blacklist": {
    "entries_loaded": 31,
    "tests_skipped": 147,
    "skipped_pairs_sample": [
      { "segment": "steps", "outcome": "distance_m", "reason": "trivial_correlation" },
      { "segment": "avg_glucose", "outcome": "time_in_range", "reason": "same_source" }
    ],
    "novel_survivors": [
      {
        "segment": "sedentary_min",
        "outcome": "active_calories",
        "change_pct": 18.3,
        "p_value": 0.002,
        "effect_size": 0.61,
        "note": "Not in blacklist — review recommended"
      }
    ]
  }
}

The novel_survivors list is the key feedback loop. It answers: "What new pair combinations survived BH that we haven't reviewed yet?" This feeds the admin review queue.

5.7 Admin Dashboard Integration

The admin insights dashboard gets a new "Blacklist" tab:

Current blacklist: Table of all entries with reason, source, notes, created date, active status
Review queue: Novel BH survivors from recent runs, grouped by pair, with user count and average statistics
Actions: Add to blacklist (with reason/notes), dismiss (mark as reviewed but not blacklisted), deactivate existing entry
Audit log: History of blacklist changes

5.8 Why This Is Better Than Runtime Correlation

Aspect	Runtime Spearman (v3 draft)	DB blacklist (v3 revised)
Determinism	Different per-user	Same for all users
Transparency	Computed and discarded	Fully auditable table
Editability	Change threshold in code, redeploy	Add/remove rows, instant
Cold start	Works immediately but may miss edge cases	Requires seeding, but seeding is systematic
Computation cost	~1800 correlations per run	One DB query per run
False positives	ρ=0.86 blocks interesting pair	Human-reviewed, no false positives
False negatives	ρ=0.83 misses tautological pair	Caught in novel_survivor review
Ongoing maintenance	None (but also no learning)	Review queue surfaces new cases

6. Intraday Analyzers

Four new analyzers that operate on raw intraday data (pre-aggregation). Each produces standard PatternCandidate[] that flow into the existing BH → rank → classify → persist pipeline unchanged.

6.1 Temporal Distribution Analyzer

Generic question: "Does this metric behave significantly differently at certain times of day or days of week?"

Applicable to: Any metric with dataDensity: 'intraday' and ≥200 readings in the lookback window.

Algorithm (time-of-day variant):

Bucket all readings into N time-of-day segments (6 × 4-hour blocks by default)
For each bucket, collect its values as Group A and all other values as Group B
Mann-Whitney U test (already implemented) on A vs B
Cohen's d for effect size
Emit a candidate for each bucket that passes minimum thresholds

interface TemporalDistributionCandidate extends PatternCandidate {
  type: 'temporal_distribution';
  temporal_variant: 'time_of_day' | 'day_of_week';
  bucket_label: string;      // "Evening (17:00-21:00)" or "Saturday"
  bucket_mean: number;
  overall_mean: number;
  bucket_count: number;
  total_count: number;
}

What this discovers without hardcoding:

"Your glucose is 15% higher during Evening (17:00-21:00)" → the system doesn't know this is "post-dinner"; it just finds the statistical signal
"Your heart rate is 8% lower on weekends" → lifestyle differences
"Your glucose variability is highest during Morning (9:00-13:00)"

Day-of-week variant: Same algorithm, but bucket by day of week instead of time of day. Runs on the same intraday data.

Time bucket configuration: The bucket boundaries are NOT hardcoded to clinical periods. They're fixed-width divisions of the day (4-hour blocks). The statistics determine which blocks are significant, not domain knowledge about "dawn" or "post-prandial."

6.2 Excursion Cluster Analyzer

Generic question: "When this metric leaves the user's normal range, do those events cluster at specific times?"

Applicable to: Any metric with dataDensity: 'intraday' and ≥50 excursion events.

Algorithm:

Define excursion thresholds:
- If metric has optimalRange: values outside [low, high]
- If no optimalRange: values outside [P10, P90] of user's personal distribution
Identify contiguous excursion events (≥3 consecutive readings outside range)
Record the start-hour of each excursion
Bucket excursion start times into N time-of-day segments
Chi-squared test vs uniform distribution: are excursions evenly spread?
If significantly clustered, emit a candidate for the peak bucket

interface ExcursionClusterCandidate extends PatternCandidate {
  type: 'excursion_cluster';
  excursion_direction: 'high' | 'low';
  peak_bucket_label: string;       // "Evening (17:00-21:00)"
  excursions_in_peak: number;
  total_excursions: number;
  expected_if_uniform: number;
  chi_squared: number;
}

What this discovers:

"Your high glucose excursions (>180) cluster between 17:00-21:00 (42% of all excursions, expected 17%)"
"Your low heart rate events cluster between 0:00-4:00"

Significance: Chi-squared test with df = N_buckets - 1. For conversion to the common BH pipeline, convert chi-squared to a p-value and use Cramér's V as the effect size measure.

6.3 Sequential Change Analyzer

Generic question: "Does this metric consistently change in a specific direction between two time points each day?"

This is the generalized dawn phenomenon detector — but it works for ANY intraday metric.

Applicable to: Any metric with dataDensity: 'intraday' and ≥20 days with data in both time buckets.

Algorithm:

Divide the day into N buckets (e.g., 8 × 3-hour blocks)
For each day, compute the mean metric value per bucket
Compute the mean-per-bucket across all days (the "average daily profile")
Find the steepest rise and steepest fall between adjacent buckets in the average profile (data-driven pair selection — NOT hardcoded to "3am vs 7am")
For each candidate pair (bucket_A → bucket_B): a. For each day, compute delta = mean_in_B - mean_in_A b. Wilcoxon signed-rank test: is median delta significantly ≠ 0? c. Effect size: mean(deltas) / stddev(deltas) d. Consistency: % of days where delta has the same sign as the overall direction
Emit a candidate if significant AND consistent on >50% of days

interface SequentialChangeCandidate extends PatternCandidate {
  type: 'sequential_change';
  from_bucket: string;    // "0:00-3:00"
  to_bucket: string;      // "6:00-9:00"
  mean_delta: number;
  median_delta: number;
  consistency_pct: number; // % of days showing this direction
  days_assessed: number;
}

Adaptive pair selection prevents combinatorial explosion. Instead of testing all N×(N-1)/2 bucket pairs, we only test:

The pair with the steepest positive gradient (catches dawn phenomenon)
The pair with the steepest negative gradient (catches evening decline)
At most 2-3 additional adjacent pairs near the gradient peaks

This keeps the test count low while finding the strongest signals.

What this discovers:

"Your glucose consistently rises +24 mg/dL between 3:00-6:00 and 6:00-9:00 (72% of days)" — dawn phenomenon, discovered generically
"Your heart rate consistently drops -12 bpm between 21:00-0:00 and 0:00-3:00" — normal nocturnal HR decline
If the nocturnal HR decline is ABSENT, that's also interesting (but requires a different test — absence of expected pattern)

6.4 Stability Trend Analyzer

Generic question: "Is the variability of this metric changing over time?"

Applicable to: Any metric with ≥28 days of data. Works on both daily and intraday metrics.

Algorithm:

For each day (or per-window for intraday), compute a variability score:
- For intraday metrics: CV% of all readings that day, or per-window (e.g., overnight CV%)
- For daily metrics: rolling 7-day CV% (computed from daily values)
Collect all variability scores in date order
Split into earlier-half vs recent-half
Mann-Whitney U test: has variability significantly changed?
Cohen's d for effect size

interface StabilityTrendCandidate extends PatternCandidate {
  type: 'stability_trend';
  variability_metric: string;     // "glucose_overnight_cv" or "sleep_duration_7d_cv"
  stability_direction: 'stabilizing' | 'destabilizing';
  earlier_variability: number;
  recent_variability: number;
}

Window definition for intraday metrics: The time windows are the same buckets used by the temporal distribution analyzer (e.g., Overnight 0:00-6:00, Morning 6:00-12:00, etc.). Stability is computed per window per day, then the trend is assessed across days for each window.

What this discovers:

"Your overnight glucose variability has decreased 28% over the past 8 weeks (stabilizing)"
"Your daytime heart rate variability has increased 15% (destabilizing)"
"Your sleep duration consistency has improved (7-day CV% decreased from 18% to 11%)"

6.5 Wilcoxon Signed-Rank Test (new statistical test needed)

Required for the sequential change analyzer. The Mann-Whitney U test compares two independent samples; Wilcoxon signed-rank compares paired observations (same-day deltas).

interface WilcoxonResult {
  W: number;
  z: number;
  p: number;
}

function wilcoxonSignedRank(deltas: number[]): WilcoxonResult | null {
  // Remove zero deltas
  const nonZero = deltas.filter(d => d !== 0);
  if (nonZero.length < 7) return null;

  // Rank absolute values, track signs
  const absRanked = computeRanks(nonZero.map(Math.abs));

  let wPlus = 0, wMinus = 0;
  for (let i = 0; i < nonZero.length; i++) {
    if (nonZero[i] > 0) wPlus += absRanked[i];
    else wMinus += absRanked[i];
  }

  const W = Math.min(wPlus, wMinus);
  const n = nonZero.length;
  const meanW = n * (n + 1) / 4;
  const varW = n * (n + 1) * (2 * n + 1) / 24;
  const z = (Math.abs(W - meanW) - 0.5) / Math.sqrt(varW);
  const p = 2 * (1 - normalCDF(Math.abs(z)));

  return { W, z, p };
}

6.6 Chi-Squared Test (new statistical test needed)

Required for the excursion cluster analyzer.

interface ChiSquaredResult {
  chiSq: number;
  df: number;
  p: number;
  cramersV: number;
}

function chiSquaredUniformity(observed: number[]): ChiSquaredResult | null {
  const total = observed.reduce((a, b) => a + b, 0);
  if (total < 20) return null;

  const k = observed.length;
  const expected = total / k;

  let chiSq = 0;
  for (const o of observed) {
    chiSq += (o - expected) ** 2 / expected;
  }

  const df = k - 1;
  // Chi-squared to p-value via Wilson-Hilferty approximation
  const z = Math.pow(chiSq / df, 1/3) - (1 - 2 / (9 * df));
  const zNorm = z / Math.sqrt(2 / (9 * df));
  const p = 1 - normalCDF(zNorm);

  const cramersV = Math.sqrt(chiSq / (total * (k - 1)));

  return { chiSq, df, p, cramersV };
}

7. Pipeline Changes

7.1 Updated Pipeline

 1. FETCH       → UNCHANGED (already fetches all 6 tables + pagination)
 2. CLEANSE     → UNCHANGED (plausibility bounds validation)
 3. AGGREGATE   → UNCHANGED (daily rollups)
                   NEW: also pass raw intraday arrays forward
 4. MERGE       → UNCHANGED (build DailyMetricRow[])
 5. DISCOVER    → NEW STEP (replaces GENERATE)
                   a. Scan DailyMetricRow[] for all metric keys with data
                   b. Classify each metric by type (pattern matching)
                   c. Determine segmentation eligibility (data-driven)
                   d. Load blacklist from insight_blacklist table
                   e. Generate dynamic segments for eligible metrics
 6. SCAN        → EXPANDED (each analyzer wrapped in try/catch — one failure doesn't crash pipeline)
                   a. Same-day segment comparisons (existing, blacklist-checked)
                   b. Lagged comparisons — ALL cross-domain pairs (no hardcoded list, blacklist-checked)
                   c. Trend detection (existing)
                   d. Temporal distribution analysis (NEW — intraday)
                   e. Excursion clustering (NEW — intraday)
                   f. Sequential change detection (NEW — intraday)
                   g. Stability trend analysis (NEW — daily + intraday)
 7. CORRECT     → BH FDR correction PER FAMILY (see Section 7.5)
                   a. Same-day family: segment comparisons + intraday candidates (α=0.10)
                   b. Lagged family: lagged effect candidates (α=0.10)
                   c. Trend family: trend candidates (α=0.10)
                   Merge all survivors → single ranked list for steps 8-12
 8. RANK        → UNCHANGED (composite score: |d| × -log₁₀(p))
 9. FILTER      → UNCHANGED (dedup + classify)
10. NARRATE     → EXPANDED (new templates for 4 new candidate types)
11. PERSIST     → UNCHANGED (insert into user_discoveries)
12. LOG         → EXPANDED (see Section 7.5 for comprehensive diagnostics spec)

7.2 Step 5 Detail: DISCOVER

This replaces the current GENERATE step and subsumes the hardcoded metric registry:

interface DiscoveredMetric {
  key: string;
  label: string;
  unit: string;
  type: string;                    // from MetricTypeCatalog
  bounds: [number, number] | null;
  direction: 'higher_is_better' | 'lower_is_better' | 'neutral';
  optimalRange: [number, number] | null;
  dataDensity: 'daily' | 'intraday';
  sourceGroup: string;
  dataPoints: number;              // non-null count
  isEligibleForSegmentation: boolean;
  isBooleanMetric: boolean;
}

interface DiscoveryResult {
  metrics: DiscoveredMetric[];
  segments: DynamicSegment[];      // generated from eligible metrics
  blacklist: Set<string>;          // "metricA::metricB" pairs from DB table
  diagnostics: {
    total_metrics_found: number;
    metrics_with_sufficient_data: number;
    metrics_eligible_for_segmentation: number;
    boolean_metrics: number;
    metrics_by_type: Record<string, number>;      // e.g., { heart_rate: 4, glucose: 9, ... }
    metrics_by_source: Record<string, number>;    // e.g., { daily_summary: 12, glucose_data: 9, ... }
    segmentation_rejections: Array<{              // why metrics weren't eligible
      metric_key: string;
      reason: 'insufficient_data' | 'insufficient_variance' | 'no_iqr_spread';
    }>;
    blacklist_entries_loaded: number;
    segments_generated: number;
  };
}

7.3 Step 6b: Removing Hardcoded Lagged Pairs

Currently LAGGED_SEGMENT_OUTCOMES lists exactly which pairs to test. Instead:

Cross-group lagged testing: Any segment metric from the "daytime" domain can be tested against any outcome metric from the "sleep/recovery" domain with a 1-day lag.

Domain classification (derived from source table and metric type):

Daytime domain: Metrics from daily_summary (steps, strain, active_calories, etc.), activities (workout_*), glucose_data daytime aggregates
Sleep domain: Metrics from sleep_sessions (sleep_duration, sleep_deep_pct, etc.)
Recovery domain: Recovery/readiness scores, resting HR, HRV

Lagged tests are only run for segment→outcome pairs where:

Segment is from the daytime domain
Outcome is from the sleep or recovery domain
The pair is not in the blacklist
Both groups meet MIN_GROUP_SIZE

Effect on test count: Currently ~42 lagged tests. Estimated increase to ~150-200 lagged tests (all eligible daytime segments × all sleep/recovery outcomes). BH correction handles this automatically — the per-test threshold tightens proportionally.

The minimum effect size for lagged tests remains stricter than same-day (currently 7% vs 5%) to account for the weaker expected signal.

7.4 Passing Intraday Data Through the Pipeline

Currently, raw glucose and intraday data are consumed entirely by the aggregation step and discarded. For the new analyzers, we need to pass the raw arrays through:

// After step 3 (AGGREGATE), retain raw arrays for intraday analyzers:
interface IntradayDataBundle {
  /** User's IANA timezone (from user_profiles.timezone, may be null) */
  timezone: string | null;

  /** Raw glucose readings, sorted by timestamp.
   *  localTimestamp uses display_time (user's local wall-clock time)
   *  with fallback to glucose_timestamp if display_time is null.
   *  Per timezone-strategy.md: display_time is already local — no IANA
   *  conversion needed. Use getUTCHours() for local hour, substring(0,10) for local date. */
  glucoseReadings: Array<{
    localTimestamp: string;    // display_time preferred (already user-local)
    utcTimestamp: string;      // glucose_timestamp (factory/UTC, for reference)
    value: number;
    trend: string | null;
  }>;

  /** Raw intraday HR/steps, sorted by timestamp.
   *  Fitbit timestamps are already user-local (no Z suffix).
   *  Oura timestamps are UTC and require IANA conversion.
   *  The bundle normalizes all to local at construction time using
   *  extractLocalDate/extractLocalHour from timezone-strategy.md. */
  intradayReadings: Array<{
    localTimestamp: string;    // Normalized to user-local time
    heartRate: number | null;
    stepsDelta: number | null;
  }>;
}

Normalization at bundle construction: All intraday timestamps are converted to user-local time when the bundle is built (after step 3, before step 6). This means the intraday analyzers don't need to know about provider differences or timezone conversion — they always receive local timestamps. For Fitbit data (already local), this is a no-op. For Oura data (UTC), this applies extractLocalDate/extractLocalHour using the user's IANA timezone.

These arrays are passed to the intraday analyzers in step 6d-6g. They are NOT passed to the daily analyzers (6a-6c) which continue to use DailyMetricRow[].

7.5 BH Correction Families (Step 7)

7.5.1 The Problem: Hitchhiker Effect

v2 ran BH correction across ALL candidates in a single pool. This worked because the pool was dominated by tautological pairs (steps↔distance, workout_duration↔workout_calories) with astronomically low p-values (p ≈ 10⁻⁸). These trivially significant pairs anchored the top of the BH ranking, creating a high watermark that let weaker cross-domain signals (activity→sleep, glucose→behavior) pass at lower ranks.

When the blacklist removes the tautological pairs, the remaining candidates are all genuine cross-domain signals with higher p-values (p ≈ 0.001-0.05). In a single pool of 700-900 tests, BH's threshold at rank 1 is 0.10/900 ≈ 0.000111. If no candidate has p < 0.000111, nothing passes — even if there are 20 genuinely interesting findings with p < 0.01.

The cross-domain findings didn't get weaker. They lost the tautological bodyguards that were inflating the BH watermark. This is a known statistical phenomenon: removing true positives from a BH pool can cause previously-passing weaker signals to fail, because BH's thresholds are relative to the total test count.

7.5.2 The Solution: Three BH Families

Run BH correction independently on three candidate families:

Family	What it tests	Typical size
Same-day	Segment comparisons + intraday analyzers (steps 6a, 6d-6g)	300-500 candidates
Lagged	Day N behavior → Day N+1 outcome (step 6b)	150-300 candidates
Trends	Earlier-half vs recent-half directional shifts (step 6c)	20-50 candidates

Each family runs BH at α=0.10 independently. A cross-domain same-day finding with p=0.005 in a pool of 400 needs to beat threshold k/400 × 0.10 — 2-3x more lenient than k/900 × 0.10 in the combined pool. A trend with p=0.01 in a pool of 30 easily passes at k/30 × 0.10.

7.5.3 Why Three Families Is Principled (Not P-Hacking)

The families correspond to fundamentally different experimental designs:

Same-day: "On days when X is high, is Y also high?" — tests contemporaneous associations
Lagged: "After days when X is high, is Y different the next day?" — tests delayed effects across a day boundary
Trends: "Is X shifting over weeks?" — tests directional change over time, no segmentation involved

These are different kinds of hypotheses. The statistical strength of "your bedtime is trending earlier" should not depend on how many same-day activity-vs-sleep pairs were tested — they're unrelated questions. Forcing them into the same BH pool penalizes one for the other's noise.

The analogy: you wouldn't grade a math exam and an English essay on the same curve.

The line we don't cross: Splitting further (same-day-glucose, same-day-sleep, same-day-activity...) would create many tiny families where everything passes. That's gaming the math. Three families based on experimental design is defensible. Twenty families based on metric domain is not.

7.5.4 Scaling as Data Grows

When new data sources are added (Garmin, Apple Watch, Dexcom), each family grows. More segment metrics × more outcome metrics = more candidates per family. BH gets stricter within each family proportionally.

This is fine because:

More data per metric → stronger statistical power → lower p-values for genuine signals
BH strictness and signal strength scale together

When a single family gets too large (1000+ candidates), the correct response is:

Blacklist maintenance — review the family's candidates for new tautological pairs that slipped through
Effect size thresholds — raise MIN_EFFECT_SIZE for that family to filter weak candidates before BH
NOT more family splits — splitting families is a one-time architectural decision, not an ongoing tuning knob

7.5.5 Implementation

Pipeline step 7 changes from:

7. CORRECT → BH FDR correction across ALL tests (single pool)

To:

7. CORRECT → BH FDR correction per family
   a. Same-day family: segment comparisons + intraday analyzer candidates (α=0.10)
   b. Lagged family: lagged effect candidates (α=0.10)
   c. Trend family: trend candidates (α=0.10)
   Merge all BH survivors into a single ranked list for steps 8-12.

Each candidate already carries a type field (segment_comparison, lagged_effect, trend, temporal_distribution, excursion_cluster, sequential_change, stability_trend). The family assignment:

Candidate type	BH Family
`segment_comparison`	Same-day
`temporal_distribution`	Same-day
`excursion_cluster`	Same-day
`sequential_change`	Same-day
`stability_trend`	Same-day
`lagged_effect`	Lagged
`trend`	Trends

After BH correction per family, survivors from all three families are merged, ranked by composite score, and flow through steps 8-12 (RANK → FILTER → NARRATE → PERSIST → LOG) unchanged.

Diagnostics update:

{
  "correct": {
    "alpha": 0.10,
    "families": {
      "same_day": { "candidates": 412, "passed_bh": 23 },
      "lagged": { "candidates": 287, "passed_bh": 8 },
      "trends": { "candidates": 31, "passed_bh": 5 }
    },
    "total_candidates": 730,
    "total_passed_bh": 36
  }
}

Code change scope: ~20 lines in pattern-spotter.ts step 7. Split allCandidates by type into three arrays, call benjaminiHochberg() three times, merge the survivors. No changes to any other module.

7.6 Comprehensive Diagnostics & Failure Isolation

v2 logs diagnostics at each pipeline step, but v3 expands this significantly. The guiding principle: every decision the pipeline makes should be traceable — what was tested, what was skipped, and why.

7.5.1 Failure Isolation

Each analyzer in step 6 runs inside a try/catch. If one analyzer throws, the pipeline logs the error and continues with remaining analyzers. No single analyzer failure crashes the run.

// Step 6: SCAN — each analyzer isolated
const analyzerResults: Record<string, { candidates: PatternCandidate[]; error?: string }> = {};

for (const [name, analyzerFn] of Object.entries(analyzers)) {
  try {
    const candidates = analyzerFn(data);
    analyzerResults[name] = { candidates };
    allCandidates.push(...candidates);
  } catch (err) {
    console.error(`[PatternSpotter] Analyzer "${name}" failed: ${err.message}`);
    analyzerResults[name] = { candidates: [], error: err.message };
  }
}

The run continues with whatever candidates were produced by the surviving analyzers. The diagnostics record which analyzers succeeded, failed, or were skipped.

7.5.2 Per-Analyzer Diagnostics

Each analyzer reports its own diagnostics independently:

interface AnalyzerDiagnostics {
  status: 'success' | 'error' | 'skipped';
  error_message?: string;
  metrics_tested: number;
  data_points_processed: number;
  candidates_produced: number;
  skip_reasons: Record<string, number>;  // e.g., { insufficient_data: 3, no_significant_bucket: 2 }
  elapsed_ms: number;
}

Rolled up into the run diagnostics:

{
  "scan": {
    "same_day": { "status": "success", "metrics_tested": 342, "candidates_produced": 28, "elapsed_ms": 180 },
    "lagged": { "status": "success", "metrics_tested": 156, "candidates_produced": 8, "elapsed_ms": 95 },
    "trend": { "status": "success", "metrics_tested": 41, "candidates_produced": 3, "elapsed_ms": 22 },
    "temporal_distribution": { "status": "success", "metrics_tested": 2, "data_points_processed": 18420, "candidates_produced": 4, "elapsed_ms": 85 },
    "excursion_cluster": { "status": "success", "metrics_tested": 1, "data_points_processed": 17160, "candidates_produced": 2, "elapsed_ms": 45 },
    "sequential_change": { "status": "success", "metrics_tested": 2, "data_points_processed": 18420, "candidates_produced": 1, "elapsed_ms": 62 },
    "stability_trend": { "status": "error", "error_message": "Cannot read property 'length' of undefined", "candidates_produced": 0, "elapsed_ms": 3 },
    "total_tests": 544,
    "total_candidates": 46
  }
}

7.5.3 Blacklist Diagnostics

Logged during step 6 as tests are skipped:

{
  "blacklist": {
    "entries_loaded": 31,
    "tests_skipped": 147,
    "skipped_by_reason": {
      "same_source": 98,
      "definitional": 31,
      "trivial_correlation": 14,
      "admin": 4
    },
    "skipped_pairs_sample": [
      { "segment": "steps", "outcome": "distance_m", "test_type": "same_day" },
      { "segment": "avg_glucose", "outcome": "time_in_range", "test_type": "same_day" },
      { "segment": "avg_glucose", "outcome": "glucose_cv", "test_type": "lagged" }
    ]
  }
}

7.5.4 Novel Survivor Detection

After BH correction (step 7), compare surviving candidates against the blacklist to identify novel pair combinations — pairs that survived BH but aren't in the blacklist and haven't been seen in previous runs. These feed the admin review queue.

{
  "novel_survivors": [
    {
      "segment_metric": "sedentary_min",
      "outcome_metric": "active_calories",
      "test_type": "same_day",
      "change_pct": -18.3,
      "p_value": 0.002,
      "effect_size": 0.61,
      "classified_as": "discovery",
      "review_status": "not_in_blacklist"
    }
  ]
}

To determine if a survivor is truly "novel" (vs something seen before), the pipeline can check a blacklist_review_log or simply rely on the admin dashboard to deduplicate — survivors that already exist as active discoveries are not novel.

7.5.5 Discovery Step Diagnostics

Step 5 (DISCOVER) logs the full metric discovery process:

{
  "discover": {
    "total_metric_keys_found": 47,
    "metrics_with_sufficient_data": 38,
    "metrics_by_type": {
      "heart_rate": 6,
      "glucose": 9,
      "percentage": 8,
      "duration_min": 5,
      "steps": 2,
      "score_100": 3,
      "strain": 2,
      "unmatched": 3
    },
    "metrics_by_source_table": {
      "daily_summary": 18,
      "sleep_sessions": 14,
      "activities": 7,
      "glucose_data": 9,
      "intraday_data": 4,
      "derived": 2
    },
    "segmentation_eligible": 11,
    "segmentation_rejected": [
      { "key": "spo2", "reason": "insufficient_variance", "cv": 2.1 },
      { "key": "respiratory_rate", "reason": "insufficient_data", "count": 8 }
    ],
    "boolean_metrics": 2,
    "segments_generated": 29
  }
}

7.5.6 Full Run Diagnostics Schema (v3)

The complete insight_engine_runs.diagnostics JSONB for v3:

{
  "engine_version": "v3",
  "user_id": "...",
  "lookback_days": 90,
  "start_date": "2026-01-15",
  "end_date": "2026-03-16",

  "fetch": {
    "daily_summary": 62, "sleep_sessions": 58, "activities": 34,
    "glucose_data": 17160, "glucose_alarms": 12, "intraday_data": 8640,
    "glucose_data_pages": 18, "intraday_data_pages": 9
  },

  "cleanse": {
    "outliers_flagged": 3,
    "devices_affected": ["libre"],
    "details": [{ "date": "2026-02-14", "metric_key": "glucose_value", "value": 18, "valid_range": "20-500" }]
  },

  "aggregate": {
    "daily_summary_dates": 62, "sleep_dates": 58, "activities_dates": 28,
    "glucose_dates": 59, "intraday_dates": 60
  },

  "merge": {
    "dates_with_data": 62,
    "metric_availability": { "steps": "62/62", "avg_glucose": "59/62", "sleep_duration": "58/62" }
  },

  "discover": {
    "total_metric_keys_found": 47,
    "metrics_with_sufficient_data": 38,
    "metrics_by_type": { "heart_rate": 6, "glucose": 9, "percentage": 8 },
    "metrics_by_source_table": { "daily_summary": 18, "sleep_sessions": 14 },
    "segmentation_eligible": 11,
    "segmentation_rejected": [{ "key": "spo2", "reason": "insufficient_variance", "cv": 2.1 }],
    "boolean_metrics": 2,
    "segments_generated": 29,
    "blacklist_entries_loaded": 31
  },

  "scan": {
    "same_day": { "status": "success", "metrics_tested": 342, "candidates_produced": 28, "elapsed_ms": 180 },
    "lagged": { "status": "success", "metrics_tested": 156, "candidates_produced": 8, "elapsed_ms": 95 },
    "trend": { "status": "success", "metrics_tested": 41, "candidates_produced": 3, "elapsed_ms": 22 },
    "temporal_distribution": { "status": "success", "metrics_tested": 2, "candidates_produced": 4, "elapsed_ms": 85 },
    "excursion_cluster": { "status": "success", "metrics_tested": 1, "candidates_produced": 2, "elapsed_ms": 45 },
    "sequential_change": { "status": "success", "metrics_tested": 2, "candidates_produced": 1, "elapsed_ms": 62 },
    "stability_trend": { "status": "success", "metrics_tested": 3, "candidates_produced": 1, "elapsed_ms": 35 },
    "total_tests": 544,
    "total_candidates": 47
  },

  "blacklist": {
    "entries_loaded": 31,
    "tests_skipped": 147,
    "skipped_by_reason": { "same_source": 98, "definitional": 31, "trivial_correlation": 14, "admin": 4 }
  },

  "correct": {
    "alpha": 0.10,
    "total_candidates": 47,
    "passed_bh": 19,
    "false_positives_removed": 28
  },

  "rank": {
    "top_5": [
      { "rank": 1, "type": "sequential_change", "metric": "avg_glucose", "segment": "Glucose rise 0:00→6:00", "change_pct": 25.3, "p_value": "0.0001", "score": "4.12" },
      { "rank": 2, "type": "temporal_distribution", "metric": "avg_glucose", "segment": "Evening (17:00-21:00)", "change_pct": 20.3, "p_value": "0.0003", "score": "3.45" }
    ]
  },

  "filter": {
    "duplicates_removed": 4,
    "after_dedup": 15,
    "classified_as_discovery": 8,
    "classified_as_observation": 7,
    "all_candidates": ["...array of all candidates with fate..."]
  },

  "novel_survivors": [
    { "segment_metric": "sedentary_min", "outcome_metric": "active_calories", "change_pct": -18.3, "p_value": 0.002 }
  ],

  "narrate": { "ai_calls": 0, "template_narratives": 15 },

  "persist": {
    "discoveries_inserted": 8,
    "observations_inserted": 7,
    "data_quality_alerts_upserted": 1
  },

  "elapsed_ms": 3847,
  "exit_reason": "success"
}

7.5.7 Exit Reasons (expanded for v3)

Exit reason	When	Diagnostics available
`success`	≥1 candidate passed BH	Full diagnostics
`insufficient_data`	< 14 total rows fetched	fetch only
`insufficient_aggregated_data`	< 14 daily rows after aggregation	fetch + cleanse + aggregate
`no_candidates`	Tests ran but no candidates met thresholds	Full through scan
`no_bh_survivors`	Candidates generated but all removed by BH	Full through correct
`partial_analyzer_failure`	≥1 analyzer failed but others succeeded, ≥1 BH survivor	Full (with error details per analyzer)
`all_analyzers_failed`	All analyzers threw errors	Full through discover + error details
`error`	Unhandled exception in pipeline	Partial (up to failure point)

The partial_analyzer_failure exit reason is new and important — it signals that results are valid but incomplete. The admin dashboard should flag these runs for investigation.

8. Candidate Schema & Narration

8.1 Candidate Schema Changes

8.1.1 Date Membership (critical foundation for multi-factor)

The existing PatternCandidate stores group_a_values and group_b_values as raw number arrays. This loses which dates produced those values, making it impossible to further stratify groups by a second factor for multi-factor insights (see Section 11).

v3 adds date membership to all segment/lagged candidates:

interface PatternCandidate {
  // ... existing fields ...

  /** Dates in Group A — enables stratified multi-factor analysis.
   *  Stored as YYYY-MM-DD strings matching DailyMetricRow.date */
  group_a_dates?: string[];
  /** Dates in Group B */
  group_b_dates?: string[];

  // group_a_values and group_b_values are RETAINED for backward compatibility
  // but group_a_dates is the source of truth for group membership.
}

Why this matters: To test "high steps + early bedtime → best HRV", the stratified analyzer needs to take the high-steps Group A dates, look up each date's bedtime_hour in DailyMetricRow[], and further split. Without dates, this requires re-running the segmentation from scratch.

Cost: ~60 date strings per candidate (e.g., ["2026-01-15", "2026-01-18", ...]). Negligible vs the existing value arrays which store the same count of floats. In the persisted metrics_impact JSONB, dates are NOT stored — only the computed statistics. Dates are ephemeral within the pipeline run.

8.1.2 New Candidate Types

The four new analyzers produce candidates that extend the existing PatternCandidate interface. The key fields required by the downstream pipeline are already present in all candidates:

// Fields required by BH correction, ranking, and classification:
{
  type: string;          // 'temporal_distribution' | 'excursion_cluster' | etc.
  metric_key: string;
  metric_label: string;
  unit: string;
  p_value: number;       // from MW-U, Wilcoxon, or chi-squared
  effect_size: number;   // Cohen's d, signed-rank r, or Cramér's V
  change_pct: number;    // relative magnitude
  weeks_observed: number;
  segment_label: string; // human-readable description for narration
}

Each analyzer normalizes its test statistic into these common fields so that BH correction, composite scoring, and classification work identically across all candidate types.

8.2 Narrative Templates

const NARRATIVE_TEMPLATES = {
  temporal_distribution: {
    discovery: {
      time_of_day: {
        title: (c) =>
          `Your ${c.metric_label} is consistently ${c.change_pct > 0 ? 'higher' : 'lower'} in the ${c.bucket_label}`,
        summary: (c) =>
          `Your ${c.metric_label} during ${c.bucket_label} averages ${fmt(c.bucket_mean)} ${c.unit}, ` +
          `which is ${fmt(Math.abs(c.change_pct))}% ${c.change_pct > 0 ? 'higher' : 'lower'} ` +
          `than your overall average of ${fmt(c.overall_mean)} ${c.unit}.`,
      },
      day_of_week: {
        title: (c) =>
          `Your ${c.metric_label} tends to be ${c.change_pct > 0 ? 'higher' : 'lower'} on ${c.bucket_label}s`,
        summary: (c) =>
          `On ${c.bucket_label}s, your ${c.metric_label} averages ${fmt(c.bucket_mean)} ${c.unit} — ` +
          `${fmt(Math.abs(c.change_pct))}% ${c.change_pct > 0 ? 'higher' : 'lower'} than other days.`,
      },
    },
    observation: {
      // Same structure with softer language: "appears to be", "may be"
    },
  },

  excursion_cluster: {
    discovery: {
      title: (c) =>
        `Your ${c.excursion_direction === 'high' ? 'high' : 'low'} ${c.metric_label} events ` +
        `cluster in the ${c.peak_bucket_label}`,
      summary: (c) =>
        `${c.excursions_in_peak} of your ${c.total_excursions} ` +
        `${c.excursion_direction === 'high' ? 'high' : 'low'} ${c.metric_label} events ` +
        `(${fmt((c.excursions_in_peak / c.total_excursions) * 100)}%) occurred during ${c.peak_bucket_label}. ` +
        `If events were evenly distributed, you'd expect ~${fmt(c.expected_if_uniform)}.`,
    },
  },

  sequential_change: {
    discovery: {
      title: (c) =>
        `Your ${c.metric_label} consistently ${c.mean_delta > 0 ? 'rises' : 'falls'} ` +
        `between ${c.from_bucket} and ${c.to_bucket}`,
      summary: (c) =>
        `Your ${c.metric_label} ${c.mean_delta > 0 ? 'rises' : 'falls'} by an average of ` +
        `${fmt(Math.abs(c.mean_delta))} ${c.unit} between ${c.from_bucket} and ${c.to_bucket}, ` +
        `observed on ${fmt(c.consistency_pct)}% of the ${c.days_assessed} days assessed.`,
    },
  },

  stability_trend: {
    discovery: {
      title: (c) =>
        `Your ${c.variability_metric} is ${c.stability_direction}`,
      summary: (c) =>
        `The variability of your ${c.metric_label} has ` +
        `${c.stability_direction === 'stabilizing' ? 'decreased' : 'increased'} from ` +
        `${fmt(c.earlier_variability)}% to ${fmt(c.recent_variability)}% over the analysis period.`,
    },
  },
};

8.3 Metrics Impact Schema

The metrics_impact JSONB field in user_discoveries already supports arbitrary structure. New candidate types add their specific proof:

// Temporal distribution
metrics_impact: [{
  metric_key: 'glucose_reading',
  metric_label: 'Glucose',
  baseline_value: 118,          // overall mean
  observed_value: 142,          // bucket mean
  change_pct: 20.3,
  magnitude: 'high',
  unit: 'mg/dL',
  pattern_type: 'temporal_distribution',
  composite_score: 3.45,
  // NEW fields for this type:
  temporal_variant: 'time_of_day',
  bucket_label: 'Evening (17:00-21:00)',
  bucket_count: 4280,
  total_count: 17160,
}]

// Sequential change
metrics_impact: [{
  metric_key: 'glucose_reading',
  metric_label: 'Glucose',
  baseline_value: 95,           // from_bucket mean
  observed_value: 119,          // to_bucket mean
  change_pct: 25.3,
  magnitude: 'high',
  unit: 'mg/dL',
  pattern_type: 'sequential_change',
  composite_score: 4.12,
  from_bucket: '0:00-3:00',
  to_bucket: '6:00-9:00',
  consistency_pct: 72,
  days_assessed: 54,
}]

8.4 AI Narration (Step 10: NARRATE)

8.4.1 The Problem

The current template-based narration produces titles like:

"After Low Time in Bed days (below your P25: <6.6 hours) linked to next-day Sleep Debt +52.7%" "Declining Min Glucose trend -8.3%"

These are accurate but written for a statistician. Users see percentile labels (P25), metric key names (Min Glucose), change percentages with no context, and pattern types (lagged_effect) that mean nothing to them. The result: discoveries that could be actionable are ignored because users can't understand what they mean.

What users need to see:

"When you spend less than about 6½ hours in bed, your sleep debt the next day tends to be significantly higher — about 53% more than usual. Protecting your bedtime window could help."

"Your lowest daily glucose readings have been gradually dropping over the past several weeks. This may reflect changes in your eating schedule or activity habits."

8.4.2 Design

All discoveries and observations get AI narration. The typical run produces 10-30 discoveries after BH family correction and deduplication. At ~2 seconds per AI call, that's 20-60 seconds — well within the 150-second Supabase edge function limit.

The AI produces four fields per discovery:

Field	Purpose	Stored in
`title`	8-12 word plain-language headline	`user_discoveries.title`
`summary`	1-2 sentence explanation a non-expert can understand	`user_discoveries.summary`
`detailed_analysis`	3-5 sentence deeper explanation with context	`user_discoveries.detailed_analysis`
`suggested_experiment`	Concrete action: what to try, for how long, what to watch	`user_discoveries.metrics_impact` (or dedicated column)

Template fallback: If an AI call fails (timeout, rate limit, JSON parse error), the existing template narration is used as a fallback. The discovery is still persisted — it just has less polished language. The failure is logged in diagnostics.

Compliance: All AI output passes through autoCorrectCompliance() which enforces wellness-only language (no diagnostic/treatment claims, correlational framing, required disclaimers). This is already implemented in _shared/compliance/output-validator.ts.

8.4.3 Prompt Design

The system prompt must:

Explain that the user is NOT a statistician
Ban jargon: no "P25", "percentile", "effect size", "p-value", "BH correction"
Translate metric names: "Min Glucose" → "your lowest glucose reading", "Sleep Debt" → "accumulated sleep deficit"
Require plain numbers: "less than about 6½ hours" not "below your P25: <6.6 hours"
Require context: explain WHY the pattern matters, not just WHAT it is
Cover all 7 pattern types with appropriate framing
Produce a suggested experiment that's concrete and time-bounded
Follow compliance guidelines (correlational language, no medical claims)

The user prompt provides the raw statistical data (metric key, segment label, group means, change %, p-value, effect size, weeks observed, sample sizes, pattern type) — the AI's job is to translate this into human language.

8.4.4 Batched AI Call (Not Per-Discovery)

Instead of calling the AI once per discovery (20 calls × 2s = 40s), we send all discoveries in a single batched call (~3-4 seconds total).

How it works:

Build one prompt containing all discoveries as a numbered list, each with its statistical data
The AI returns a JSON array with the same ordering — item 1 narrates discovery 1, etc.
Parse the response and map each narration back to its discovery by array index

Why this doesn't cross-contaminate:

Each discovery is a clearly delimited numbered block in the prompt
The response format requires a JSON array with one entry per input discovery
The AI processes each item independently (there's no shared state between discoveries)
If a single item's narration is weak, it doesn't affect the others

Token budget: 20 discoveries × ~100 tokens input each = ~2000 tokens input. Response: ~150 tokens per discovery × 20 = ~3000 tokens output. Total: ~5000 tokens — well within GPT-4o-mini's context window.

Failure handling: If the single batch call fails (timeout, rate limit, malformed JSON), ALL discoveries fall back to template narration. This is better than 20 individual calls where each can fail independently, because:

One failure mode to handle, not 20
3 seconds of latency, not 40
One retry opportunity if needed

8.4.5 Experiment Recommendations

Current state: The user_discoveries table has a suggested_experiment_id column (FK to experiment_catalog) that is always null. The pattern spotter never queries the experiment catalog or the user's experiment history.

The opportunity: The experiment_catalog table has 8 curated experiments, each with primary_metrics — a JSON array of metric keys the experiment measures. We can match discoveries to relevant experiments:

Discovery metric	Matching experiments
`sleep_deep_pct`	early-bedtime, alcohol-elimination, magnesium-before-bed, digital-sunset
`hrv`	alcohol-elimination, caffeine-curfew, morning-sunlight
`resting_hr`	alcohol-elimination, caffeine-curfew, morning-sunlight
`sleep_duration`	early-bedtime, consistent-wake-time, magnesium-before-bed
`avg_glucose`	post-meal-walk

How experiment matching works in the pipeline:

At pipeline start, fetch the experiment catalog (experiment_catalog where is_active = true)
Fetch the user's experiment history (experiments where user_id = X and status IN ('active', 'completed'))
For each discovery, find catalog experiments whose primary_metrics overlap with the discovery's metric_key
Exclude experiments the user has already completed or is currently running
Pass the matching experiment names + descriptions to the AI as part of the narration prompt

The AI's role in experiment recommendation:

The AI sees: "Here is a discovery about sleep_deep_pct. Here are relevant experiments the user hasn't tried: [early-bedtime, magnesium-before-bed]."
The AI picks the most relevant one and incorporates it naturally: "Consider trying the Early Bedtime experiment — moving your bedtime 45 minutes earlier for 2 weeks to see if it improves your deep sleep."
The pipeline sets suggested_experiment_id to the matching catalog experiment ID

If no catalog experiment matches: The AI generates a free-form suggestion based on the discovery's content (same as current template behavior, but in natural language).

8.4.6 Latency and Cost

Metric	Value
AI calls per run	1 (batched — all discoveries in one call)
Time per call	~3-4 seconds (GPT-4o-mini via OpenAI proxy)
Added latency	3-4 seconds per user
Pipeline base time	3-6 seconds
Total with narration	6-10 seconds (vs 25-65s with per-discovery calls)
Cost per call	~$0.005 (~2000 input + 3000 output tokens)
Cost per run	~$0.005 per user per day

8.4.7 Error Handling

Build batched prompt with all discoveries + experiment catalog context
try:
  Single AI call → parse JSON array → compliance check each item
  For each discovery: use AI text, set suggested_experiment_id if matched
  Record: ai_model, ai_prompt_version on all discovery rows
catch:
  Log warning: "[PatternSpotter] Batch AI narration failed: {error}"
  Fall back to template text for ALL discoveries
  Record: ai_model = null on all rows (indicates template was used)

8.4.8 Diagnostics

{
  "narrate": {
    "mode": "batch",
    "discoveries_narrated": 22,
    "ai_call_succeeded": true,
    "template_fallback": false,
    "experiments_matched": 8,
    "experiments_excluded_already_done": 2,
    "elapsed_ms": 3400
  }
}

8.4.9 Implementation Scope

Files to modify:

ai-engine/engines/pattern-spotter.ts — Step 10 (NARRATE): batched AI call with experiment matching + template fallback
ai-engine/prompts/pattern-detection.ts — Rewrite system prompt: plain-language rules, jargon ban, all 7 types, experiment recommendation instructions
ai-engine/engines/pattern-spotter.test.ts — Update narration test to verify batch call + experiment matching

New queries in pipeline (Step 1 FETCH or Step 10 NARRATE):

experiment_catalog — fetch active experiments with primary_metrics
experiments — fetch user's active + completed experiments

Files unchanged:

_shared/compliance/output-validator.ts — already handles wellness language correction
_shared/pattern-ranker.ts — no changes needed
All analyzer modules — no changes needed

Status: DONE (2026-03-19, batched narration live)

Batched AI narration live: single GPT-4o-mini call per pipeline run
All discoveries and observations get plain-language titles and summaries
OpenAI json_object wrapper format handled (extracts array from { "discoveries": [...] })
Template fallback on AI failure — 21 vitest passing
Admin dashboard updated to show v3 narration diagnostics (mode, patterns narrated, AI succeeded)
Experiment matching NOT YET IMPLEMENTED — suggested_experiment_id still null (deferred to next iteration)

8.5 Deduplication Improvements (Step 9: FILTER)

8.5.1 The Problem

Empirical review of user 3597587c's 67 active discoveries (2026-03-23) revealed ~34 are duplicates or near-duplicates that the current dedup fails to catch. Three root causes:

Problem 1: Trends re-surface daily with different change_pct values

The same trend (e.g., "workout frequency increasing") appears 4 times across 4 days because the lookback window shifts and the change_pct swings (79% → 2700% → 1300% → 833%). The dedup requires change_pct within ±5 percentage points — but these are 10-100x apart.

Examples from real data:

"Workout Frequency Has Increased" — 4 copies (March 19-22)
"HRV Decreasing" — 2 copies per metric variant
"Steps Increasing" — 4 copies across steps/intraday_steps
"High Activity Increasing" — 3 copies

Problem 2: Coupled segment metrics produce mirror discoveries

sleep_duration and time_in_bed segment days nearly identically (r≈0.95). Every discovery from one is duplicated by the other — same outcome, same change_pct, different segment name. The dedup requires matching segment_metric_key, so it treats them as distinct.

7 mirror pairs found:

"Wider Glucose Range on Low Sleep Days" / "...on Low Time in Bed Days" (+37%)
"Higher Max Glucose on Low Sleep Days" / "...on Low Time in Bed Days" (+20.8%)
"More Sedentary on Low Sleep Days" / "...on Low Time in Bed Days" (+16.3%)
etc.

Problem 3: Metric aliases produce identical discoveries

hrv is literally hrv_daily for WHOOP users (the derived metric falls back). steps equals intraday_steps (daily sum of same data). Each discovery appears twice — once per metric variant.

8 alias duplicates found:

"HRV Decreasing" / "Daily HRV Decreasing" (both -7.6%)
"Improved HRV After Low Activity" / "Higher Daily HRV After Low Activity" (both +15.3%)
"Steps Increasing" / "Intraday Steps Increasing" (both ~30-35%)

8.5.2 The Fixes

All three fixes are changes to areDuplicates() and deduplicatePatterns() in pattern-ranker.ts. No pipeline or analyzer changes needed.

Fix 1: Trend dedup by metric_key + direction (ignore change_pct)

For trend candidates, the current rule metric_key + segment_metric_key + change_pct ±5% is too narrow. Change_pct for trends varies wildly as the lookback window shifts.

New rule for trends: Two trend candidates are duplicates if:

Same metric_key (or aliases — see Fix 3)
Same direction (both increasing or both decreasing)
change_pct is ignored for trends

This also applies to existing discovery dedup: if an active discovery with the same metric_key and direction already exists, the new trend is a duplicate regardless of change_pct.

Fix 2: Outcome-based dedup for segment comparisons

Two segment comparison candidates are duplicates if:

Same metric_key (outcome)
change_pct within ±10% (wider tolerance for cross-segment dedup)
segment_metric_key can differ (this is the key relaxation)

This catches the sleep_duration/time_in_bed mirrors. When "Low Sleep Days → glucose_range +37%" and "Low Time in Bed Days → glucose_range +37%" both survive BH, the second is deduped because the outcome and change match.

The wider ±10% tolerance (vs the current ±5%) accounts for slight differences when two correlated segment metrics don't segment days exactly the same way.

Fix 3: Metric alias groups

Define alias sets where metrics produce identical or near-identical values for a given user:

const METRIC_ALIAS_GROUPS: string[][] = [
  ['hrv', 'hrv_daily'],            // hrv = hrv_daily ?? hrv_sleep (WHOOP users: always hrv_daily)
  ['steps', 'intraday_steps'],     // daily total = sum of intraday
  ['sleep_duration', 'time_in_bed'], // definitionally coupled (r≈0.95)
];

During dedup, two metric keys are considered equivalent if they belong to the same alias group. This means:

"HRV Decreasing" and "Daily HRV Decreasing" → same finding
"Steps Increasing" and "Intraday Steps Increasing" → same finding
Any discovery with outcome sleep_duration matches against one with time_in_bed
Any segment using sleep_duration is equivalent to one using time_in_bed

The alias check is used in BOTH within-batch dedup AND existing-discovery dedup.

8.5.3 Updated `areDuplicates` Logic

const METRIC_ALIAS_GROUPS: string[][] = [
  ['hrv', 'hrv_daily'],
  ['steps', 'intraday_steps'],
  ['sleep_duration', 'time_in_bed'],
];

// Pre-computed lookup: metric_key → canonical representative
const METRIC_CANONICAL: Map<string, string> = new Map();
for (const group of METRIC_ALIAS_GROUPS) {
  const canonical = group[0]; // first entry is the canonical
  for (const key of group) {
    METRIC_CANONICAL.set(key, canonical);
  }
}

function canonicalKey(metricKey: string): string {
  return METRIC_CANONICAL.get(metricKey) ?? metricKey;
}

function areDuplicates(a: PatternCandidate, b: PatternCandidate): boolean {
  const aOutcome = canonicalKey(a.metric_key);
  const bOutcome = canonicalKey(b.metric_key);

  // Rule 1: Trend dedup — same metric + same direction (ignore change_pct)
  if (a.type === 'trend' && b.type === 'trend') {
    return aOutcome === bOutcome && a.direction === b.direction;
  }

  // Rule 2: Outcome-based dedup — same outcome + similar change (segment can differ)
  if (aOutcome === bOutcome) {
    // If segments are also aliases, use tighter threshold
    const aSegment = canonicalKey(a.segment_metric_key ?? '');
    const bSegment = canonicalKey(b.segment_metric_key ?? '');
    const threshold = aSegment === bSegment ? 5 : 10;
    if (Math.abs(a.change_pct - b.change_pct) <= threshold) {
      return true;
    }
  }

  return false;
}

8.5.4 Existing Discovery Dedup Enhancement

The existing discovery dedup also needs to understand aliases and trend direction. The existingForDedup data currently only carries metric_key and change_pct. To support trend dedup, it needs the pattern_type and direction (if trend) from metrics_impact.

Update the existing discovery query to include pattern_type from metrics_impact:

const existingForDedup = allExisting.map(d => ({
  metrics_impact: d.metrics_impact as Array<{
    metric_key: string;
    change_pct: number;
    pattern_type?: string;
  }> | null,
  discovery_type: d.discovery_type,
  title: d.title, // title contains direction hint for trends
}));

For trend matching against existing: if the existing discovery's pattern_type === 'trend' and the canonical metric_key matches, treat as duplicate regardless of change_pct.

8.5.5 Impact Estimate

For user 3597587c (67 discoveries):

Fix 1 (trend dedup): eliminates ~12 repeat trends
Fix 2 (outcome-based dedup): eliminates ~14 mirror segment discoveries
Fix 3 (metric aliases): eliminates ~8 alias duplicates

Total: ~34 eliminations → 67 → ~33 unique discoveries

For user 73f1a17e (7 discoveries):

1 duplicate removed (repeat lagged effect)
7 → 6 unique discoveries

8.5.6 Data Cleanup

After deploying the fix, existing duplicate discoveries need to be cleaned. Two approaches:

Option A: Delete all and re-run

DELETE FROM user_discoveries
WHERE discovery_type IN ('unenrolled_pattern', 'observation')
  AND status IN ('new', 'viewed');
-- Then invoke spot-patterns-cron to regenerate

Option B: Keep highest-ranked of each duplicate set (preserves viewed status) More complex — requires a script to identify duplicate groups and delete all but the best.

Recommendation: Option A (delete + re-run). The AI narration will regenerate fresh text, and the new dedup logic will prevent duplicates from returning.

8.5.7 Implementation Scope

Files to modify:

_shared/pattern-ranker.ts — rewrite areDuplicates(), add alias groups, update existing dedup
_shared/pattern-ranker.test.ts — new tests for trend dedup, outcome-based dedup, alias groups
ai-engine/engines/pattern-spotter.ts — update existingForDedup to include pattern_type

Files unchanged:

All analyzers, metric-discovery, blacklist, BH families — no changes needed

Status: DONE (2026-03-23)

areDuplicates() rewritten with three rules: trend direction dedup, outcome-based cross-segment dedup, metric aliases
canonicalKey() exported for alias resolution (hrv↔hrv_daily, steps↔intraday_steps, sleep_duration↔time_in_bed)
deduplicatePatterns() updated: existing discovery matching uses aliases + trend direction awareness
32 pattern-ranker tests passing (14 new), 21 pattern-spotter vitest passing
Expected reduction: ~67 → ~33 unique discoveries for user 3597587c after re-run

9. Migration Strategy

9.1 Phased Rollout

Phase 0: Timezone normalization (see timezone-strategy.md) — DONE

Phase A: DONE — Migration shipped, mobile app pushes device timezone on launch via timezoneRegistration.ts
Phase B: DONE — All 4 sync functions extract timezone (WHOOP offset in vendor_metadata + local activity_date, Fitbit IANA from profile API, Libre derived offset, Oura from Personal Info API)
Phase C: DONE — daily-aggregation.ts uses LocalTimeExtractors closure factory; glucose uses display_time
Phase D: DONE — UTC fallback, Pacific, DST, Tokyo, Kolkata tests (28 aggregation tests pass)
Phase E: UNNECESSARY — Mobile app auto-pushes timezone on launch; UTC fallback works for users who haven't opened the app

Phase 1: Foundation — DONE (41 + 39 = 80 tests)

statistical-tests.ts: Wilcoxon signed-rank, chi-squared uniformity, Spearman ρ, computeRanks (20 new tests, 41 total)
metric-type-catalog.ts: 17 type patterns, label generation, classifyMetric/getLabel/classifyWithLabel (39 tests)
Migration 20260320000000_create_insight_engine_blacklist.sql: table + 24 seed entries

Phase 2: Blacklist seeding — DONE (2026-03-18, 212 entries across 3 migrations)

Seed: 24 entries (same_source glucose pairs, definitional sleep pairs, steps↔distance)
Wave 1: 112 entries (activity cluster — movement, energy, intensity, workout volume cross-pairs)
Wave 2: 76 entries (active_minutes, low_active_min, intraday HR proxies, workout_count, HRV↔recovery, hrv↔hrv_daily)
BH correction split into 3 families (same-day, lagged, trends) to fix hitchhiker effect — see Section 7.5
One-time Spearman correlation analysis script needed
Run against active users, review pairs with median ρ > 0.80
Insert confirmed tautological pairs into insight_engine_blacklist
Build admin dashboard "Blacklist" tab for ongoing management

Phase 3: Discovery layer — DONE (20 + 19 = 39 tests)

metric-discovery.ts: Data-driven metric discovery, type classification, eligibility, segment generation, blacklist checking (20 tests)
pattern-spotter.ts: Registry replaced with discovery, LAGGED_SEGMENT_OUTCOMES replaced with cross-domain regex, blacklist loaded from DB, group_a_dates/group_b_dates on candidates (19 vitest)
metric-registry.ts: Deprecated canSegment/canOutcome/excludeOutcomes helpers, updated doc header to reflect v3 label-lookup role

Phase 4: Intraday analyzers — DONE (51 tests)

intraday-bundle.ts: Normalize raw readings to local timestamps (9 tests)
temporal-distribution-analyzer.ts: Time-of-day MW-U per 4-hour bucket (9 tests)
excursion-cluster-analyzer.ts: Chi-squared clustering of out-of-range events (11 tests)
sequential-change-analyzer.ts: Wilcoxon signed-rank on daily deltas, adaptive pair selection (11 tests)
stability-trend-analyzer.ts: CV% trend via MW-U earlier vs recent (11 tests)
pattern-spotter.ts: All 4 analyzers wired in with try/catch isolation, narrative templates for 4 new candidate types (in vitest)

Phase 5: Cleanup — DONE

metric-registry.ts: getSegmentMetrics, getOutcomeMetrics, getExcludedOutcomes marked @deprecated
metric-registry.ts doc header updated to reflect v3 role as label/unit lookup only
MetricDef fields retained for segment-generator.ts backward compatibility (not removed — would break 21 segment-generator tests with no benefit)
Novel_survivor review and admin dashboard updates deferred to post-deployment

Phase 6: Multi-factor insights (future, see Section 11) — NOT STARTED

Foundation laid: group_a_dates/group_b_dates on PatternCandidate
Requires v3 single-factor to be validated in production first

Implementation Summary

242 tests passing, 0 failures across 14 test files:

Module	File	Tests
LocalTimeExtractors	`local-time.test.ts`	10
Timezone utils	`timezone-utils.test.ts`	8
Statistical tests	`statistical-tests.test.ts`	41
Metric type catalog	`metric-type-catalog.test.ts`	39
Daily aggregation	`daily-aggregation.test.ts`	28
Metric discovery	`metric-discovery.test.ts`	20
Pattern ranker	`pattern-ranker.test.ts`	18
Intraday bundle	`intraday-bundle.test.ts`	9
Temporal distribution	`temporal-distribution-analyzer.test.ts`	9
Excursion clustering	`excursion-cluster-analyzer.test.ts`	11
Sequential change	`sequential-change-analyzer.test.ts`	11
Stability trend	`stability-trend-analyzer.test.ts`	11
Segment generator (v2, retained)	`segment-generator.test.ts`	21
Metric registry (v2, retained)	`metric-registry.test.ts`	13
Pattern spotter (vitest)	`pattern-spotter.test.ts`	19

Remaining for deployment

Apply DB migration: supabase db push or supabase migration up (creates insight_engine_blacklist table + seed data)
Deploy edge functions: supabase functions deploy spot-patterns-cron (includes all new shared modules)
Verify via logs: Confirm the 12-step pipeline runs end-to-end with discover step replacing generate, intraday analyzers executing, blacklist loaded
Phase 2 (blacklist seeding): Run correlation analysis script after first production runs to identify additional tautological pairs from BH survivors
Phase 0E (backfill): Run timezone backfill for existing users without timezone in profile

9.2 Backward Compatibility

user_discoveries table schema does not change (metrics_impact is JSONB, supports arbitrary structure)
insight_engine_runs diagnostics JSONB is additive (new fields, no removed fields)
Mobile discovery UI renders based on discovery_type + metrics_impact — new pattern types may need mobile UI updates to display optimally, but will degrade gracefully (title + summary always available)
Existing discoveries are not affected

9.3 Performance Considerations

Concern	Impact	Mitigation
Blacklist table query	1 query, ~30 rows	< 10ms. Cached in Set for O(1) lookup.
More lagged tests (~150 vs ~42)	~3.5× more MW-U tests	< 100ms. MW-U is fast.
Intraday analyzers on ~17,000 glucose readings	4 analyzers × bucketing + statistics	< 500ms. Pre-bucketed, standard tests.
BH correction on more candidates	Linear in candidate count	Negligible.
Metric discovery + type classification	Scan ~60 metric keys, match patterns	< 5ms. Negligible.
Total estimated overhead		< 1 second additional per run

The v2 pipeline currently runs in 2-5 seconds. The v3 additions should bring it to 3-6 seconds — well within edge function limits. The removal of runtime Spearman correlation computation (~50ms) slightly offsets the new analyzer costs.

10. Open Questions

10.1 Resolved

Q: How to handle metrics that don't match any type pattern? A: Conservative defaults — no plausibility bounds (accept all values), neutral directionality, no optimal range. The metric still gets analyzed; we just can't interpret directionality or detect excursions against a clinical range. Personal P10/P90 serves as the excursion threshold.

Q: What about the existing metric-registry.ts file? A: It becomes a label/unit lookup table for display purposes and an extraction config for the aggregation layer. It is no longer consulted by the analysis layer for eligibility, segmentation, or exclusion decisions.

Q: Should blacklisting be computed at runtime (Spearman ρ) or stored in DB? A: DB table. See Section 5 for full rationale. Runtime correlation is non-deterministic, non-transparent, and non-editable. The DB approach is seeded empirically, auditable, and editable without code deployments. Spearman correlation is still useful as a one-time seeding tool, not a per-run computation.

10.2 Open

Q: Blacklist seeding — what correlation threshold for the one-time analysis? The seeding script computes Spearman ρ across users and flags pairs with median ρ > threshold. Using 0.80 as the median threshold (not 0.85 as previously proposed for per-user runtime) because the median smooths out user-specific noise. Pairs between 0.70-0.80 may warrant manual review. The key difference from the previous plan: this threshold only determines what gets FLAGGED for human review, not what gets automatically blacklisted. A human confirms each entry before it's inserted.

Q: How often should novel_survivors be reviewed? Proposed: weekly during initial rollout (Phase 5), then monthly once the blacklist stabilizes. The admin dashboard should surface the count of unreviewed novel survivors as a badge/alert. Once fewer than 3 novel pairs appear per week, the blacklist is likely mature.

Q: How many time-of-day buckets? Using 6 × 4-hour blocks for temporal distribution and 8 × 3-hour blocks for sequential change detection. The tradeoff: more buckets = more granularity but more tests (BH gets stricter). Fewer buckets = less granularity but stronger per-test power. Could be data-driven: use more buckets when data density is high (glucose with 288 readings/day) and fewer when sparse (intraday HR with 96 readings/day).

Q: Should intraday analyzers use UTC or local time? A: RESOLVED — see timezone-strategy.md.

All intraday analysis uses user-local time. The IntradayDataBundle normalizes all timestamps to local at construction time:

Glucose: display_time is already user-local. Use directly (no IANA conversion needed).
Fitbit intraday: Already user-local (wall-clock time, no Z suffix). Use directly.
Oura intraday: UTC timestamps → convert via extractLocalDate/extractLocalHour with user's IANA timezone from user_profiles.timezone.

The existing v2 bug (glucose overnight/daytime using UTC hours) is fixed as part of timezone strategy Phase C.

Action items are tracked in timezone-strategy.md Phase B-C, not duplicated here.

Q: Excursion minimum duration — how many consecutive readings? Currently proposed: ≥3 consecutive readings outside range (15 min for CGM). For HR data at 5-min intervals, this is also 15 min. Could be made proportional to the metric's sampling frequency.

Q: Should stability trend operate on specific windows (overnight, daytime) or whole-day? Proposed: both. Whole-day CV% is one metric; per-window CV% (using the same temporal buckets) provides more granular stability tracking. The analyzer runs on all windows and lets BH correction determine which are significant.

Q: Clock-time metrics and midnight wrap? bedtime_hour is stored as 0-23.99 (e.g., 23.5 = 11:30 PM, 0.5 = 12:30 AM). Percentile-based segmentation breaks at the midnight boundary: someone at 23:30 and someone at 00:30 are 1 hour apart but 23 hours apart numerically. P25/P75 splits produce nonsensical groups when bedtimes span midnight. Fix: normalize clock-time metrics to a signed offset from midnight (e.g., 23:30 → -0.5, 00:30 → +0.5, 22:00 → -2.0) before computing percentiles and segments. This affects bedtime_hour primarily; wake_hour rarely crosses midnight so is less impacted. This bug exists in v2 today and should be fixed in v3.

Q: Mobile UI for new candidate types? The discovery detail screen (discovery/[id].tsx) renders based on metrics_impact. New pattern types (temporal_distribution, sequential_change, etc.) would benefit from custom visualizations (time-of-day charts, daily profile graphs). But they degrade gracefully to title + summary + metric cards. Mobile UI updates can follow as a separate workstream.

11. Multi-Factor Insights (Future: Phase 6)

11.1 The Vision

Single-factor: "High steps days → 12% better HRV" Multi-factor: "High steps + early bedtime → 20% better HRV (steps alone: 12%, bedtime alone: 8%)"

Users want to know which combinations of behaviors produce the best outcomes. This is interaction effect analysis — the compound effect of two factors is greater (or different) than either alone.

11.2 Why Not Now

Multi-factor requires single-factor to be working well first. The single-factor BH survivors become the input for multi-factor analysis. Building multi-factor before v3 single-factor is validated would be premature.

However, the architectural foundation must be laid now — specifically, storing date membership in candidates (Section 8.1.1). Retrofitting this later would require re-running all statistical tests to capture which dates belong to each group. This is the one change that must ship with v3.

11.3 Approaches Considered

Approach A: Combinatorial Segment Generation (rejected)

Generate all pairs of existing segments as compound segments and test every combination.

11 segmentable metrics × 3 split types = 33 segments
Pairwise compounds: 33 × 32 / 2 = 528 compound segments
Each tested against ~40 outcomes = 21,120 new tests
BH correction over 21K tests: per-test threshold drops to ~0.000005
Most compound groups have tiny N (both conditions simultaneously true)

Verdict: Doesn't scale. The combinatorial explosion drowns real signals in BH correction, and most groups are too small for meaningful tests.

Approach B: Two-Way ANOVA / Aligned Rank Transform (deferred)

Use a non-parametric two-way ANOVA equivalent to test for interaction effects directly.

Statistically rigorous
Tests the interaction term (A×B) separately from main effects
But: complex to implement, hard to explain to users, doesn't integrate cleanly with the BH pipeline

Verdict: Potentially valuable long-term for users with very dense data (6+ months). Not practical for initial multi-factor support.

Approach C: Decision Tree Feature Importance (deferred)

Fit a decision tree per outcome metric. The tree naturally discovers interactions (split on A, then split on B within A).

Captures higher-order interactions automatically
But: different statistical paradigm, overfitting risk with 60-90 days of data, can't control false discovery rate via BH, requires new infrastructure

Verdict: Interesting for a future "AI-powered insight" layer but not appropriate for the statistical pipeline.

Approach D: Stratified Analysis (selected)

Take each BH-surviving single-factor discovery and ask: "Within this group, does a second factor add explanatory power?"

Single-factor pipeline produces: "High steps → 12% better HRV" (Group A = high-steps days)
Within Group A dates, split by a second segment (early vs late bedtime)
MW-U test: within high-steps days, is HRV significantly different for early vs late bedtime?
If yes, AND the compound effect > single-factor effect: emit compound candidate

Verdict: Best fit for v3. Small test count, natural narrative, reuses existing statistical tools, builds on already-validated single-factor results.

11.4 Stratified Analysis Design

How it works in the pipeline

Steps 1-9:   v3 pipeline (unchanged)
             Produces single-factor BH survivors with group_a_dates

Step 9.5:    INTERACT (new, Phase 6)
             For each single-factor BH survivor S (factor A → outcome Y):
               For each eligible segment metric B (B ≠ A, B ≠ Y, not blacklisted):
                 Filter DailyMetricRow[] to only S.group_a_dates
                 Split these dates by B's segment condition (high/low/boolean)
                 If both sub-groups ≥ MIN_COMPOUND_GROUP_SIZE:
                   MW-U test on Y between sub-groups
                   If significant AND compound effect > single-factor effect:
                     Emit compound candidate

Step 9.6:    CORRECT-COMPOUND (separate BH family, stricter alpha)
             BH correction on compound candidates only (α = 0.05)
             Separate from single-factor BH (already done in step 7)

Steps 10-12: NARRATE, PERSIST, LOG (extended for compound type)

Why separate BH families?

Single-factor and multi-factor candidates must be corrected in separate BH families because:

Different base rates: Compound candidates are derived from already-significant single-factor results (higher prior probability). Mixing them with single-factor candidates dilutes both.
Stricter threshold for compound claims: "X + Y → Z" is a stronger claim than "X → Z". Using α=0.05 (vs 0.10 for single-factor) reflects this.
Test count independence: Adding 100 compound tests shouldn't make single-factor tests harder to pass. Separate families ensure this.

Test count analysis

If 10 single-factor discoveries survive BH, and there are 10 eligible segment metrics:

10 discoveries × 10 second factors = 100 compound tests
Minus blacklisted pairs and insufficient group sizes → ~50-70 actual tests
BH at α=0.05 over 70 tests: per-test threshold ~0.0007-0.001
Only genuinely strong interactions survive

This is tractable — no combinatorial explosion.

Compound candidate schema

interface CompoundCandidate extends PatternCandidate {
  type: 'compound_interaction';

  /** The single-factor discovery this builds on */
  base_factor: {
    segment_metric_key: string;
    segment_label: string;
    outcome_metric_key: string;
    single_factor_change_pct: number;
    single_factor_p_value: number;
  };

  /** The second factor tested within the first factor's group */
  second_factor: {
    segment_metric_key: string;
    segment_label: string;
    split_type: string;
  };

  /** Effect of factor B within factor A's group */
  conditional_change_pct: number;

  /** Total compound effect (A + B combined vs neither) */
  compound_change_pct: number;

  /** Is the compound effect greater than the sum of individual effects?
   *  (synergistic = true interaction, not just additive) */
  is_synergistic: boolean;
}

Narrative templates

// Additive (compound > single, but not synergistic)
title: "High steps + early bedtime → best HRV days"
summary: "On high-steps days, going to bed early is associated with an additional
  8% improvement in HRV (total compound effect: 20%, vs 12% from high steps alone)."

// Synergistic (compound > sum of individual effects)
title: "Workouts + low glucose CV → exceptional deep sleep"
summary: "On workout days, having stable glucose is associated with 25% more deep sleep —
  more than the 9% from workouts alone plus the 11% from stable glucose alone.
  The combination appears to have a synergistic effect."

// Observation (weaker signal)
title: "Weekend + high strain may be linked to better recovery"
summary: "On weekends, high-strain days appear to be associated with 7% better recovery
  scores. We're watching this pattern as more data comes in."

Blacklist implications

The existing insight_engine_blacklist table naturally handles compound blacklisting. For a compound test (A, B → Y), three pair checks are needed:

isBlacklisted(A, Y) — already checked (the single-factor passed this)
isBlacklisted(B, Y) — must check (B might be tautological with Y)
isBlacklisted(A, B) — must check (if A and B are tautological, the compound is meaningless)

All three checks use the same blacklist table. No schema change needed.

Minimum group sizes

Stratification shrinks groups. If high-steps Group A has 15 days, and early bedtime within that has 7 days, we're at the current MIN_GROUP_SIZE = 7 limit.

Options:

Accept it: Multi-factor insights require dense data. Users with <90 days won't see many compound results. This is fine — it's a premium insight for engaged users.
Reduce minimum for compound tests: MIN_COMPOUND_GROUP_SIZE = 5. Riskier but produces more results. Mitigated by the stricter BH alpha (0.05).

Proposed: Start with MIN_COMPOUND_GROUP_SIZE = 7 (same as single-factor). If too few compounds survive, relax to 5.

11.5 What v3 Must Ship to Enable This

Change	Where	Why
Add `group_a_dates`, `group_b_dates` to PatternCandidate	pattern-ranker.ts	Enables date-based stratification lookup
Populate dates during SCAN step	pattern-spotter.ts	Store which dates are in each group
Blacklist supports 3-way checks	Already works	pair-based checks cover all three combinations
BH function callable on subsets	Already works	`benjaminiHochberg()` takes any candidate array
DailyMetricRow[] accessible after step 9	pattern-spotter.ts	Compound analyzer needs to look up second-factor values by date

The first two items are the only code changes needed in v3. Everything else is already in place.

11.6 What Phase 6 Adds (NOT in v3 scope)

Stratified analysis step (9.5: INTERACT)
Separate BH correction for compound family (9.6: CORRECT-COMPOUND)
Compound candidate type and narrative templates
discovery_type: 'compound_pattern' in user_discoveries (new migration)
Admin dashboard visualization for compound patterns
Mobile UI for compound discovery cards (showing both factors)

12. Plan Review: Gaps, Risks, and Assumptions

12.0 CRITICAL: Systemic Timezone Misalignment

This is the most significant finding from the plan review. It affects v2 today and must be addressed in v3.

The Problem

The insight engine merges data from 6 tables into DailyMetricRow[] keyed by date string (e.g., "2026-03-15"). But that date means different things depending on the data source:

Table	Provider	What "March 15" means	How date is extracted
daily_summary	Oura	User's local March 15	Oura API returns `day` in user-local time
daily_summary	WHOOP	UTC March 15	`record.start.split('T')[0]` on UTC ISO
daily_summary	Fitbit	UTC March 15	Date ranges constructed as UTC
sleep_sessions	All	UTC wake date	`toISOString().substring(0,10)` on calculated wake time
activities	All	UTC date	`extractDate(start_time)` on UTC ISO
glucose_data	Libre	UTC date (factory time)	`extractDate(glucose_timestamp)`
intraday_data	Fitbit	User's local date	`${date}T${time}` with no Z suffix
intraday_data	Oura	UTC date	Oura returns UTC ISO timestamps

Impact example: A US Pacific (UTC-8) user works out at 9 PM local on March 15. That's 2026-03-16T05:00:00Z. The activity gets grouped into March 16. But Oura's daily_summary calls that day March 15 (user-local). Their sleep starting at 11 PM local (7 AM UTC March 16) gets a wake date of March 16 UTC. When the merge joins by date key, the workout is on March 16, Oura's steps are on March 15, and sleep is on March 16. The correlation "workout days → better sleep" is now testing misaligned day pairs.

The further from UTC the user is, the worse this gets. US timezones (UTC-5 to UTC-8), Australia (UTC+10), and Asia (UTC+8/+9) are all significantly affected.

What's Available But Not Used

WHOOP sends timezone_offset (e.g., "-08:00") on every cycle, sleep, and workout record. It is defined in the TypeScript interfaces (transformers.ts:49,90,125) but discarded during transformation — never stored in the database.

Oura returns user-local dates for daily data but UTC timestamps for sleep/activity. The local→UTC offset is implicit but not captured.

Fitbit intraday data arrives as local wall-clock time. Daily summary dates are from UTC ranges.

Libre provides both FactoryTimestamp (system/UTC) and Timestamp (user-local), stored as glucose_timestamp and display_time respectively.

No user timezone exists in the database. user_profiles has name, email, date_of_birth, biological_sex, preferences (unused JSONB) — no timezone field.

Recommended Fix

See timezone-strategy.md for the complete plan. Summary of key decisions:

IANA timezone is the source of truth — stored on user_profiles.timezone (e.g., "America/Los_Angeles"). No per-record offset columns. IANA handles DST automatically via Intl.DateTimeFormat.
Auto-detect, don't ask — Mobile device pushes timezone on every app launch (Expo Localization). Fitbit provides IANA directly. WHOOP/Libre offsets validate against existing IANA. User never needs to manually configure.
Travel is a v2 concern — v1 uses a single IANA timezone per user, auto-updated. Minor misalignment during travel (~5 records out of 60-90 days) is within statistical noise.
Glucose uses display_time directly — already user-local, no IANA conversion needed. Simplest and most reliable path.
Aggregation normalizes at read time — Raw UTC timestamps are never modified. extractLocalDate() and extractLocalHour() use Intl.DateTimeFormat with the user's IANA timezone.
Graceful degradation — If no timezone is known, fall back to UTC (current behavior). Pipeline diagnostics log the fallback.

Impact on v3 Plan

This is a prerequisite for the intraday analyzers. If time-of-day bucketing uses a mix of UTC and local timestamps, the temporal distribution analyzer will produce incorrect results.

Revised phase ordering:

Phase 0: Timezone normalization (Phases A-E of timezone-strategy.md)
Phase 1-3: Can proceed in parallel (blacklist, discovery, metric catalog don't depend on timezone for core logic)
Phase 4: Intraday analyzers strictly require Phase 0 to be complete

Phase 0 also fixes pre-existing v2 bugs that affect ALL daily correlations for non-UTC users.

12.05 Concerns with Timezone Strategy (cross-reference)

These are concerns about the timezone strategy that affect v3. They should be addressed in timezone-strategy.md but are noted here for completeness.

TZ-CONCERN-1: Intl.DateTimeFormat performance — RESOLVED Solved by the createLocalTimeExtractors() closure factory adopted in timezone-strategy.md Section 4.3. Two Intl.DateTimeFormat instances are created once per pipeline run and captured by the closure. All aggregation functions receive a LocalTimeExtractors object, not a timezone string. ~54ms total vs ~1.8s naive (33x faster).

TZ-CONCERN-2: Oura intraday provider-awareness in aggregation The timezone strategy says aggregateIntradayByDay should treat Fitbit timestamps as local and Oura timestamps as UTC. But aggregateIntradayByDay currently receives a flat array of rows with no provider indicator. It doesn't know which rows came from Fitbit vs Oura. Options:

(a) Normalize all intraday timestamps to local during bundle construction (before aggregation) — this is what the IntradayDataBundle now does
(b) Add a provider or is_local_timestamp flag to each intraday row
(c) Normalize Oura timestamps at sync/ingestion time (like the WHOOP activity_date fix)

The v3 plan chose (a) — normalize at bundle construction. The timezone strategy Phase C should align with this approach.

TZ-CONCERN-3: WHOOP vendor_metadata historical backfill The timezone strategy Phase B preserves WHOOP timezone_offset in vendor_metadata going forward. Phase E attempts to backfill from vendor_metadata. But timezone_offset is NOT currently stored in vendor_metadata — it's discarded. So historical WHOOP records have no timezone data to backfill from. The backfill script for WHOOP users must fall back to the user's profile timezone (once set by mobile device) or leave historical records at UTC.

TZ-CONCERN-4: display_time fetch not in pattern-spotter query The pattern-spotter.ts fetchAllPaginated call for glucose_data selects glucose_timestamp, glucose_value, trend but NOT display_time. This must be added in Phase 0/C. Without it, the intraday bundle can't construct localTimestamp for glucose readings.

12.1 Gaps (things the plan doesn't address yet)

GAP-1: display_time not in glucose fetch query See TZ-CONCERN-4 above. The pattern-spotter.ts fetchAllPaginated call for glucose_data selects glucose_timestamp, glucose_value, trend but NOT display_time. Must be added in timezone strategy Phase C to enable correct local-time bucketing for both daily aggregation and intraday analyzers.

GAP-2: Metric type classification for raw intraday readings The Metric Type Catalog (Section 4.3) classifies daily metric keys (e.g., avg_glucose). But the intraday analyzers operate on raw readings (glucose_value, heart_rate) that haven't gone through type classification. The analyzers need to know: what's the optimalRange? What are the plausibility bounds? The plan should specify how intraday metrics get typed. Proposed: the analyzer receives type metadata from the parent metric (e.g., glucose_data readings inherit the glucose type config; intraday_data heart_rate readings inherit the heart_rate type config). This is a small lookup, not a new system.

GAP-3: Blacklist seeding script not specified The plan references a "one-time Spearman correlation analysis script" in Phase 2 but doesn't specify where it lives, how it's run, or its output format. It should be a standalone script in scripts/ (like glucose-trend-analysis.mjs), producing a CSV/JSON report for human review, with a companion SQL file to insert confirmed entries.

GAP-4: Blacklist seeding as a migration The ~23 seed entries (source-group pairs) should be a Supabase migration (INSERT INTO insight_engine_blacklist ...) so they're reproducible and version-controlled. The plan mentions creating the table migration but doesn't mention seeding it in the same migration.

GAP-5: Cross-type deduplication The current dedup logic (pattern-ranker.ts) checks metric_key + segment_metric_key + change_pct. But a temporal_distribution candidate ("Evening glucose is 15% higher") and a segment_comparison candidate ("High daytime glucose → higher evening glucose") could report the same underlying phenomenon. Need dedup rules that work across candidate types. Proposed: dedup on metric_key + change_pct only (ignore segment_metric_key for cross-type comparison), with a wider tolerance (±10% instead of ±5%) for cross-type pairs.

GAP-6: Excursion cluster minimum may be too high The plan requires ≥50 excursion events for the chi-squared test. A well-controlled glucose user with TIR >80% might have only 20-30 excursions in 60 days. The chi-squared test requires ≥5 expected per bucket; with 6 buckets and 30 total excursions, expected per bucket = 5 — exactly at the minimum. Proposed: lower to ≥30 total excursions (5 per bucket × 6 buckets).

GAP-7: display_time nullability display_time is nullable in the schema. If a glucose reading has no display_time (e.g., from a future Dexcom integration that doesn't provide it), the intraday analyzers can't bucket it. Need a fallback: use glucose_timestamp and log a warning. The aggregation fix should also handle this gracefully.

12.2 Risks we are accepting

RISK-1: BH correction gets stricter with more tests v2 runs ~400 tests. v3 will run ~700+ (more lagged pairs + intraday analyzers). BH at α=0.10 will tighten per-test thresholds by ~40%. Some marginal v2 discoveries may stop being significant. This is statistically correct (they were likely false positives) but could confuse users who previously saw a finding that vanishes.

Mitigation: The diagnostics track total test count. If BH becomes too strict, we can consider separate BH families for daily vs intraday analyses (like we do for single-factor vs compound in Phase 6). However, separating BH families is a double-edged sword — it increases discovery rate but also increases false positive rate.

Recommendation: Accept this risk. Start with a single BH family. Monitor the pass rate in diagnostics. If BH survivors drop dramatically vs v2, consider family separation.

RISK-2: Blacklist cold start for new providers When a new provider (Garmin, Apple Health) is added, there are no blacklist entries for its metrics. Tautological pairs will surface in the first runs until someone reviews novel_survivors and adds them. Users might see one or two obvious correlations before the blacklist catches up.

Mitigation: When adding a new provider, pre-seed likely tautological pairs based on the metric naming (e.g., if Garmin has garmin_steps and garmin_distance, add them proactively). The novel_survivor detection will catch anything missed within one run cycle.

RISK-3: Label quality for auto-discovered metrics If a new metric appears that isn't in the label lookup table, the auto-generated label (sleep_deep_pct → "Sleep Deep Pct") is user-facing in discovery titles. Poor labels degrade trust.

Mitigation: The label lookup table should be comprehensive for all currently known metrics (~62 entries from v2 registry — keep these labels). Auto-generation only applies to genuinely novel metrics. The admin dashboard should flag discoveries with auto-generated labels for review.

RISK-4: Intraday analyzer assumptions about data density The temporal distribution analyzer assumes roughly uniform data density across time-of-day buckets. But CGM sensors have gaps (sensor warmup, falloffs, replacements). If a user's overnight readings are sparse (sensor fell off while sleeping), the overnight bucket will have lower N, reducing statistical power and potentially creating false signals.

Mitigation: Each bucket reports its count in diagnostics. The analyzer should require minimum N per bucket (e.g., ≥30 readings) to include a bucket in the analysis. Buckets below the threshold are excluded (not compared).

RISK-5: Edge function timeout v2 runs 2-5 seconds. The plan estimates v3 adds <1 second. But this hasn't been benchmarked with real data. 17,000 glucose readings × 4 analyzers could take longer than estimated, especially on cold starts. Supabase edge function default timeout is 60 seconds, but we're also constrained by the cron orchestrator's rate limiting (1-second delay between user runs).

Mitigation: Each analyzer is isolated in try/catch with elapsed_ms tracking. If total elapsed approaches 30 seconds, remaining analyzers can be skipped (fail-open with partial_analyzer_failure exit reason). Monitor elapsed_ms in production diagnostics.

RISK-6: v2 overnight/daytime glucose bug is pre-existing The aggregateGlucoseByDay function uses glucose_timestamp (UTC) instead of display_time (local) for overnight/daytime classification. This means overnight_glucose and daytime_glucose metrics are wrong for all non-UTC users in v2 today. Fixing this in v3 will change these metrics' values, potentially flipping existing discoveries (e.g., "your overnight glucose is trending down" might become "trending up" when correctly computed).

Mitigation: This is a bug fix, not a behavioral change. The correct values should be used. Existing discoveries based on incorrect overnight/daytime values will naturally be superseded by new discoveries based on correct values. Document this as a known breaking change in the v3 release notes.

RISK-7: Sequential change analyzer finds obvious physiological patterns The analyzer might surface "Your heart rate drops between 21:00 and 3:00" — which is just... normal nocturnal HR decline. Everyone has this. It's not actionable.

Mitigation: The classifier already distinguishes discoveries (strong signal, actionable) from observations (weaker signal). Universal physiological patterns will have low effect size relative to the user's own variance (everyone has the pattern, so it's not discriminating). Additionally, consider a "minimum prevalence surprise" threshold — if consistency is >90% of days, the pattern is likely physiological and not behavioral. Could add a flag: is_universal_pattern: true to downrank these, or let BH correction handle it (if the signal is universal, the p-value will be very low but the change_pct might be large — monitor in practice).

12.3 Assumptions we are making

Supabase edge function memory is sufficient for holding ~17,000 glucose readings + ~8,600 intraday readings + daily metric rows simultaneously in memory. This is ~25,000 objects × ~100 bytes each ≈ 2.5 MB. Well within the 256 MB memory limit.
display_time for Libre and raw timestamp for Fitbit/WHOOP/Oura intraday data both represent user-local wall-clock time encoded as UTC ISO strings. If a future provider stores true UTC timestamps in these fields, time-of-day bucketing will be wrong for that provider.
The current ~62 metrics are comprehensive enough that few genuinely novel metrics will appear from existing providers. Auto-discovery is primarily for new providers, not for finding metrics we've been ignoring from existing ones.
Users care about time-of-day patterns in their health data. The intraday analyzers are the biggest new investment. If users don't find "your glucose is higher in the evening" actionable, these analyzers add complexity without value. The glucose trend analysis script suggests these findings ARE valuable, but production user feedback will be the real test.
The blacklist review cadence is sustainable. The plan assumes someone (admin) will periodically review novel_survivors. If this doesn't happen, the blacklist becomes stale and tautological pairs slip through.

13. TDD Implementation Plan

Tests are written first, then implementation makes them pass. Each step produces a green test suite before moving to the next. Steps within a phase can be parallelized where noted.

Conventions

Edge function tests use Deno test (Deno.test + assertEquals)
Shared module tests live alongside source: _shared/foo.test.ts
Pattern spotter integration tests use vitest (existing setup)
All tests run in CI via pre-commit hook (existing)

Phase 0: Timezone (Phases B-E of `timezone-strategy.md`)

Phase A is DONE. Remaining work:

Step 0.1: `_shared/local-time.ts` — LocalTimeExtractors factory

New file. Tests first:

TEST: createLocalTimeExtractors(null).extractDate("2026-03-16T06:30:00Z") → "2026-03-16"
TEST: createLocalTimeExtractors(null).extractHour("2026-03-16T06:30:00Z") → 6
TEST: createLocalTimeExtractors("America/Los_Angeles").extractDate("2026-03-16T06:30:00Z") → "2026-03-15"
TEST: createLocalTimeExtractors("America/Los_Angeles").extractHour("2026-03-16T06:30:00Z") → 22.5
TEST: createLocalTimeExtractors("Asia/Tokyo").extractDate("2026-03-15T20:00:00Z") → "2026-03-16"
TEST: createLocalTimeExtractors("Asia/Tokyo").extractHour("2026-03-15T20:00:00Z") → 5
TEST: createLocalTimeExtractors("Asia/Kolkata").extractHour("2026-03-16T00:00:00Z") → 5.5
TEST: DST transition — LA, March 8 spring forward: UTC 09:00 → hour 1, UTC 11:00 → hour 4
TEST: invalid timestamp → extractDate returns substring fallback, extractHour returns null
TEST: two extractors created (not 18,000) — verify via formatter mock or timing check

Implementation: createLocalTimeExtractors() with closure over Intl.DateTimeFormat.

Step 0.2: `_shared/timezone-utils.ts` — Offset parsing and IANA resolution

New file.

TEST: parseOffsetToMinutes("-08:00") → -480
TEST: parseOffsetToMinutes("+05:30") → 330
TEST: parseOffsetToMinutes("+00:00") → 0
TEST: parseOffsetToMinutes("invalid") → 0
TEST: getIanaOffset("America/Los_Angeles", Jan 15 2026) → -480
TEST: getIanaOffset("America/Los_Angeles", Jul 15 2026) → -420
TEST: resolveIanaFromOffset(-480, "America/Los_Angeles", Jan 15) → "America/Los_Angeles"
TEST: resolveIanaFromOffset(-420, "America/Los_Angeles", Jul 15) → "America/Los_Angeles" (DST)
TEST: resolveIanaFromOffset(-300, "America/Los_Angeles", Jan 15) → null (travel detected)
TEST: resolveIanaFromOffset(-480, null, Jan 15) → null (no existing IANA)

Step 0.3: Aggregation layer — timezone-aware functions

Modify _shared/daily-aggregation.ts. Tests written against existing test file, new timezone variants:

TEST: aggregateGlucoseByDay groups by display_time, not glucose_timestamp
TEST: aggregateGlucoseByDay — reading at display_time "2026-03-15T22:00:00Z" → daytime (hour 22)
TEST: aggregateGlucoseByDay — null display_time falls back to glucose_timestamp
TEST: groupSleepByWakeDate(sessions, LA_extractors) — sleep ending 7 AM Pacific → correct local date
TEST: groupSleepByWakeDate(sessions, null_extractors) — preserves current UTC behavior
TEST: aggregateActivitiesByDay(activities, LA_extractors) — 11 PM Pacific activity → correct local date
TEST: countAlarmsByDay(alarms, LA_extractors) — alarm at 1 AM UTC → previous local date for LA
TEST: bedtime_hour with LA timezone → local hour, not UTC hour
TEST: wake_hour with LA timezone → local hour, not UTC hour
TEST: all functions with null_extractors → identical to current v2 output (regression guard)

Implementation: Add localTime: LocalTimeExtractors parameter to each function. Replace extractDate/extractHour calls.

Step 0.4: Pattern spotter — `display_time` fetch + lookback fix

Modify pattern-spotter.ts.

TEST: glucose_data query includes display_time in SELECT
TEST: endDate uses user's local "today" (mock timezone + current time)
TEST: startDate is lookback_days before local "today"
TEST: diagnostics include timezone.iana, timezone.source, timezone.fallback

Step 0.5: Sync function timezone extraction

Modify sync functions. Can be parallelized across providers.

-- WHOOP --
TEST: transformWhoopCycle includes timezone_offset in vendor_metadata
TEST: transformWhoopSleep includes timezone_offset in vendor_metadata
TEST: transformWhoopWorkout includes timezone_offset in vendor_metadata
TEST: WHOOP activity_date uses timezone_offset for local date extraction
TEST: WHOOP cycle at 11 PM UTC-8 → activity_date is local date (not UTC+1 day)

-- Fitbit --
TEST: sync-fitbit fetches user profile timezone
TEST: Fitbit IANA timezone stored in user_profiles.timezone

-- Libre --
TEST: sync-libre derives offset from display_time - glucose_timestamp
TEST: derived offset consistent with existing IANA → tz_updated_at refreshed

-- Oura --
TEST: sync-oura fetches personal_info timezone (if available)

Phase 1: Foundation

Can proceed in parallel with Phase 0 Steps 0.2-0.5.

Step 1.1: `_shared/statistical-tests.ts` — New tests

Extend existing file.

-- Wilcoxon Signed-Rank --
TEST: wilcoxonSignedRank([5, 3, 8, 2, 7]) — known W, z, p values
TEST: wilcoxonSignedRank with all-positive deltas → very low p (consistent direction)
TEST: wilcoxonSignedRank with balanced ±deltas → high p (no direction)
TEST: wilcoxonSignedRank with zeros removed correctly
TEST: wilcoxonSignedRank with <7 non-zero deltas → returns null
TEST: wilcoxonSignedRank ties handled via average-rank method

-- Chi-Squared Uniformity --
TEST: chiSquaredUniformity([10, 10, 10, 10, 10, 10]) → high p (uniform)
TEST: chiSquaredUniformity([50, 5, 5, 5, 5, 5]) → low p (clustered)
TEST: chiSquaredUniformity with <20 total → returns null
TEST: cramersV correct for known data
TEST: Wilson-Hilferty p-value approximation within ±0.01 of scipy reference

Step 1.2: `_shared/metric-type-catalog.ts` — New module

New file.

TEST: classifyMetric("resting_hr") → type "heart_rate", bounds [25, 220]
TEST: classifyMetric("avg_glucose") → type "glucose", optimalRange [70, 180]
TEST: classifyMetric("sleep_deep_pct") → type "percentage", bounds [0, 100]
TEST: classifyMetric("bedtime_hour") → type "clock_time", bounds [0, 24]
TEST: classifyMetric("strain") → type "strain", bounds [0, 21]
TEST: classifyMetric("steps") → type "steps"
TEST: classifyMetric("workout_calories") → type "calories"
TEST: classifyMetric("totally_unknown_metric") → default type (no bounds, neutral direction)
TEST: first matching pattern wins (order matters)
TEST: getLabel("sleep_deep_pct") → "Deep Sleep %" (from lookup)
TEST: getLabel("unknown_metric") → "Unknown Metric" (auto-generated from key)

Step 1.3: `insight_engine_blacklist` migration

New migration file.

TEST: table insight_engine_blacklist exists after migration
TEST: columns: id, metric_a, metric_b, reason, notes, source, is_active, created_at, updated_at
TEST: UNIQUE constraint on (metric_a, metric_b)
TEST: ~23 seed entries present (same_source, definitional, trivial_correlation)
TEST: seed entry ("avg_glucose", "glucose_cv") exists with reason "same_source"
TEST: seed entry ("distance_m", "steps") exists with reason "trivial_correlation"
TEST: seed entry ("sleep_deep_pct", "sleep_light_pct") exists with reason "definitional"
TEST: index on (metric_a, metric_b) WHERE is_active = true exists

Phase 2: Blacklist Seeding

Step 2.1: Correlation analysis script

New file: scripts/blacklist-seeding-analysis.mjs (like glucose-trend-analysis.mjs)

TEST: spearmanRho on perfectly correlated data → 1.0
TEST: spearmanRho on perfectly anti-correlated data → -1.0
TEST: spearmanRho on uncorrelated data → ~0 (within ±0.15)
TEST: spearmanRho handles ties correctly
TEST: script outputs pairs with median ρ > 0.80 across users
TEST: output includes pair, median ρ, user count, sample correlations

This produces a review report. Human confirms entries → INSERT into blacklist table.

Phase 3: Discovery Layer

Depends on: Phase 1 (type catalog, blacklist table).

Step 3.1: `_shared/metric-discovery.ts` — New module

New file.

-- Metric scanning --
TEST: discovers all non-null metric keys from DailyMetricRow[]
TEST: counts non-null values per metric
TEST: classifies each metric via type catalog
TEST: reports metrics_by_type breakdown

-- Segmentation eligibility --
TEST: metric with ≥20 values and CV >10% → eligible
TEST: metric with 15 values → not eligible (insufficient data)
TEST: metric with CV 3% → not eligible (insufficient variance)
TEST: metric with P25 ≈ P75 → not eligible (no IQR spread)
TEST: boolean metric (all 0s and 1s) → detected as boolean, eligible if both groups ≥ MIN_GROUP_SIZE
TEST: reports segmentation_rejected with reason per rejected metric

-- Blacklist loading --
TEST: loads active entries from insight_engine_blacklist (mocked)
TEST: isBlacklisted("avg_glucose", "glucose_cv") → true
TEST: isBlacklisted("glucose_cv", "avg_glucose") → true (order-independent)
TEST: isBlacklisted("steps", "sleep_duration") → false (not in blacklist)
TEST: deactivated entries not loaded

-- Segment generation --
TEST: eligible continuous metric → binary_high + binary_low segments generated
TEST: eligible continuous metric with ≥30 values → tertile_middle also generated
TEST: boolean metric → boolean segment generated
TEST: outcome metrics exclude blacklisted pairs

Step 3.2: Pattern spotter integration — replace registry with discovery

Modify pattern-spotter.ts.

TEST: pipeline calls discoverMetrics() instead of getSegmentMetrics()/getOutcomeMetrics()
TEST: segments generated from discovery result, not hardcoded registry
TEST: blacklisted pairs skipped in same-day scan (count in diagnostics)
TEST: blacklisted pairs skipped in lagged scan
TEST: lagged tests run for all cross-domain pairs (not hardcoded LAGGED_SEGMENT_OUTCOMES)
TEST: diagnostics include discover.* fields (metrics_found, by_type, segmentation_rejected, etc.)
TEST: diagnostics include blacklist.* fields (entries_loaded, tests_skipped, skipped_by_reason)
TEST: novel_survivors detected (BH survivors not in blacklist)
TEST: pipeline with no data → exits with "insufficient_data", correct diagnostics
TEST: pipeline with data but no BH survivors → exits with "no_bh_survivors"

Step 3.3: PatternCandidate — add group_a_dates/group_b_dates

Modify _shared/pattern-ranker.ts and scan step.

TEST: segment comparison candidates include group_a_dates (array of YYYY-MM-DD strings)
TEST: segment comparison candidates include group_b_dates
TEST: group_a_dates.length matches group_a_values.length
TEST: lagged candidates include dates for the OUTCOME day (not the segment day)
TEST: trend candidates do not have dates (not applicable)
TEST: dates are not persisted in metrics_impact JSONB (ephemeral within pipeline)

Phase 4: Intraday Analyzers

Depends on: Phase 0 Step 0.3 (aggregation timezone fixes), Phase 1 (statistical tests, type catalog).

Step 4.1: IntradayDataBundle construction

New construction logic in pattern-spotter.ts or a helper module.

TEST: glucose readings use display_time as localTimestamp
TEST: glucose readings with null display_time fall back to glucose_timestamp
TEST: Oura intraday timestamps converted to local via LocalTimeExtractors
TEST: Fitbit intraday timestamps used as-is (already local)
TEST: bundle includes timezone from user profile
TEST: bundle readings sorted by localTimestamp

Step 4.2: `_shared/temporal-distribution-analyzer.ts` — New module

New file.

-- Time-of-day variant --
TEST: buckets readings into 6 × 4-hour segments
TEST: segment with significantly higher mean → candidate emitted
TEST: segment with significantly lower mean → candidate emitted (change_pct negative)
TEST: all segments similar → no candidates
TEST: candidate includes bucket_label, bucket_mean, overall_mean, bucket_count
TEST: bucket with <30 readings → excluded from analysis
TEST: <200 total readings → analyzer returns empty (skipped)
TEST: p_value and effect_size (Cohen's d) in standard PatternCandidate fields

-- Day-of-week variant --
TEST: buckets readings into 7 day-of-week segments
TEST: weekend significantly different → candidate with bucket_label "Saturday" or "Sunday"
TEST: candidate has temporal_variant: "day_of_week"

Step 4.3: `_shared/excursion-cluster-analyzer.ts` — New module

New file.

TEST: excursions detected outside optimalRange [70, 180] for glucose
TEST: excursions detected outside personal P10/P90 when no optimalRange
TEST: contiguous excursion = ≥3 consecutive readings outside range
TEST: excursion start hours bucketed into 6 segments
TEST: clustered excursions → chi-squared significant, candidate emitted
TEST: uniformly distributed excursions → no candidate
TEST: <30 total excursions → skipped
TEST: candidate includes peak_bucket_label, excursions_in_peak, total_excursions, expected_if_uniform
TEST: effect_size uses Cramér's V
TEST: high and low excursions analyzed separately

Step 4.4: `_shared/sequential-change-analyzer.ts` — New module

New file.

-- Adaptive pair selection --
TEST: average daily profile computed across all days (mean per bucket per day, then mean across days)
TEST: steepest positive gradient pair identified correctly
TEST: steepest negative gradient pair identified correctly
TEST: at most 4-5 pairs tested (not all N×(N-1)/2)

-- Wilcoxon test on daily deltas --
TEST: consistent daily rise (glucose 3am→7am) → significant Wilcoxon, candidate emitted
TEST: random daily changes → Wilcoxon not significant, no candidate
TEST: candidate includes from_bucket, to_bucket, mean_delta, median_delta, consistency_pct
TEST: consistency_pct = % of days where delta has same sign as overall direction
TEST: consistency <50% → no candidate emitted even if Wilcoxon significant
TEST: <20 days with data in both buckets → skipped
TEST: effect_size = mean(deltas) / stddev(deltas)

Step 4.5: `_shared/stability-trend-analyzer.ts` — New module

New file.

TEST: per-day CV% computed for intraday metrics
TEST: per-window CV% computed (overnight, daytime windows)
TEST: earlier-half vs recent-half CV% compared via MW-U
TEST: decreasing CV% → "stabilizing" candidate
TEST: increasing CV% → "destabilizing" candidate
TEST: stable CV% across period → no candidate
TEST: <28 days of data → skipped
TEST: 7-day rolling CV% for daily metrics (non-intraday)
TEST: effect_size is Cohen's d on the CV values

Step 4.6: Pipeline integration — wire analyzers into step 6

Modify pattern-spotter.ts.

TEST: each analyzer called with correct IntradayDataBundle data
TEST: each analyzer wrapped in try/catch
TEST: one analyzer throws → others still run, candidates collected
TEST: partial_analyzer_failure exit reason when ≥1 analyzer fails but others produce BH survivors
TEST: all_analyzers_failed exit reason when all throw
TEST: per-analyzer diagnostics: status, metrics_tested, candidates_produced, elapsed_ms, skip_reasons
TEST: intraday candidates have standard PatternCandidate fields (p_value, effect_size, change_pct)
TEST: intraday candidates flow through BH correction alongside daily candidates
TEST: intraday candidates flow through ranking, dedup, classification correctly
TEST: new narrative templates produce readable title + summary for each candidate type

Phase 5: Cleanup + Validation

Step 5.1: MetricDef simplification

TEST: canSegment, canOutcome, excludeOutcomes removed from MetricDef interface
TEST: getSegmentMetrics() and getOutcomeMetrics() removed (replaced by discovery)
TEST: getExcludedOutcomes() removed (replaced by blacklist)
TEST: metric-registry.ts only has: key, label, unit, sourceTable, providers
TEST: all existing tests still pass (no regressions from field removal)

Step 5.2: End-to-end validation

TEST: run full pipeline for test user with known timezone → verify discoveries are timezone-correct
TEST: run full pipeline with timezone: null → output matches v2 behavior (regression guard)
TEST: novel_survivors list populated in diagnostics
TEST: blacklist prevents all known tautological pairs from surfacing
TEST: at least one intraday candidate type survives BH for a user with CGM data

Execution Order (Critical Path)

                    ┌─────────────────────┐
                    │  Phase 0, Step 0.1  │  local-time.ts
                    │  (BLOCKING)         │
                    └────────┬────────────┘
                             │
              ┌──────────────┼──────────────┐
              ▼              ▼              ▼
     ┌────────────┐  ┌────────────┐  ┌────────────┐
     │ Step 0.2   │  │ Step 0.3   │  │ Phase 1    │
     │ tz-utils   │  │ aggregation│  │ Steps 1.1  │
     │            │  │ layer      │  │ 1.2, 1.3   │
     └────────────┘  └─────┬──────┘  └──────┬─────┘
              │             │                │
              ▼             ▼                ▼
     ┌────────────┐  ┌────────────┐  ┌────────────┐
     │ Step 0.5   │  │ Step 0.4   │  │ Phase 2    │
     │ sync funcs │  │ spotter    │  │ blacklist   │
     │ (parallel) │  │ fetch+tz   │  │ seeding    │
     └────────────┘  └─────┬──────┘  └──────┬─────┘
                            │                │
                            └───────┬────────┘
                                    ▼
                            ┌────────────┐
                            │ Phase 3    │
                            │ Discovery  │
                            │ layer      │
                            └──────┬─────┘
                                   │
                    ┌──────────────┼──────────────┐
                    ▼              ▼              ▼
           ┌────────────┐  ┌────────────┐  ┌────────────┐
           │ Step 4.1   │  │ Steps 4.2  │  │ Step 4.5   │
           │ Intraday   │  │ 4.3, 4.4   │  │ Stability  │
           │ bundle     │  │ (parallel) │  │ trend      │
           └──────┬─────┘  └──────┬─────┘  └──────┬─────┘
                  │               │               │
                  └───────────────┼───────────────┘
                                  ▼
                          ┌────────────┐
                          │ Step 4.6   │
                          │ Pipeline   │
                          │ integration│
                          └──────┬─────┘
                                 ▼
                          ┌────────────┐
                          │ Phase 5    │
                          │ Cleanup +  │
                          │ validation │
                          └────────────┘

Key parallelization opportunities:

Phase 1 (stats, catalog, blacklist) runs in parallel with Phase 0 Steps 0.2-0.5
Phase 2 (blacklist seeding) runs in parallel with Phase 0 completion
Steps 4.2, 4.3, 4.4, 4.5 (individual analyzers) are fully independent — can be implemented in parallel
Step 0.5 (sync functions) can be parallelized across providers

Critical path: Step 0.1 → Step 0.3 → Step 0.4 → Phase 3 → Step 4.1 → Step 4.6 → Phase 5

Appendix A: Comparison with v2

Aspect	v2 (current)	v3 (proposed)
Metric definition	62 hardcoded entries	~15 type patterns + auto-discovery
Segmentation eligibility	Hand-assigned `canSegment` flag	Data-driven (CV > threshold, ≥20 points)
Tautology prevention	Manual `excludeOutcomes` per metric	DB-backed blacklist, empirically seeded
Lagged effect pairs	Hardcoded `LAGGED_SEGMENT_OUTCOMES`	Cross-group testing, all eligible pairs
Analysis resolution	Daily only	Daily + intraday
Analyzer types	3 (segment, lagged, trend)	7 (+temporal distribution, excursion cluster, sequential change, stability trend)
Statistical tests	Mann-Whitney U, Cohen's d	+Wilcoxon signed-rank, chi-squared
Provider coupling in analysis	`providers` field on each metric	Zero — analysis layer is provider-agnostic
Multi-factor support	Not possible	Foundation laid (date membership), Phase 6 adds stratified analysis
New provider effort	Update registry + aggregation + cleanser + spotter	Update extraction layer only

Appendix B: Example Discoveries (not possible in v2)

Discovery	Analyzer	Candidate type
"Your glucose is 18% higher during Evening (17:00-21:00)"	Temporal Distribution	temporal_distribution
"Your high glucose events cluster between 17:00-21:00 (38% of all, expected 17%)"	Excursion Clustering	excursion_cluster
"Your glucose consistently rises +22 mg/dL between 3:00 and 6:00 (68% of days)"	Sequential Change	sequential_change
"Your overnight glucose variability has decreased 25% over 8 weeks"	Stability Trend	stability_trend
"Your heart rate is 12% lower on weekends vs weekdays"	Temporal Distribution	temporal_distribution
"Your heart rate spikes cluster in the 6:00-9:00 window"	Excursion Clustering	excursion_cluster
"Your resting HR is trending down 0.3 bpm/week (improving)"	Trend (existing)	trend
"On high glucose CV days, your deep sleep % is 11% lower"	Segment Comparison (existing, but newly eligible via auto-discovery)	segment_comparison
"High steps + early bedtime → 20% better HRV (steps alone: 12%)"	Stratified Analysis (Phase 6)	compound_interaction
"Workout + stable glucose → 25% more deep sleep (synergistic)"	Stratified Analysis (Phase 6)	compound_interaction

FilesExpand file tree

insight-engine-v3.md

Latest commit

History

insight-engine-v3.md

File metadata and controls

Insight Engine v3 — Dynamic Metric Discovery & Intraday Analysis

Table of Contents

1. Problem Statement

What v2 does well

What v2 gets wrong

2. Design Goals

3. Architecture Overview

4. Metric Discovery

4.1 What MUST be configured (irreducible domain knowledge)

4.2 What CAN be derived from data

4.3 The Metric Type Catalog (replaces per-metric registry)

4.4 Automatic Segmentation Eligibility

4.5 Data Density Classification

5. Blacklisting Trivial Correlations

5.1 The Problem

5.2 Revised Approach: DB-Backed Blacklist

5.3 Database Schema

5.4 Seeding Strategy

5.5 Pipeline Integration

5.6 Diagnostics

5.7 Admin Dashboard Integration

5.8 Why This Is Better Than Runtime Correlation

6. Intraday Analyzers

6.1 Temporal Distribution Analyzer

6.2 Excursion Cluster Analyzer

6.3 Sequential Change Analyzer

6.4 Stability Trend Analyzer

6.5 Wilcoxon Signed-Rank Test (new statistical test needed)

6.6 Chi-Squared Test (new statistical test needed)

7. Pipeline Changes

7.1 Updated Pipeline

7.2 Step 5 Detail: DISCOVER

7.3 Step 6b: Removing Hardcoded Lagged Pairs

7.4 Passing Intraday Data Through the Pipeline

7.5 BH Correction Families (Step 7)

7.5.1 The Problem: Hitchhiker Effect

7.5.2 The Solution: Three BH Families

7.5.3 Why Three Families Is Principled (Not P-Hacking)

7.5.4 Scaling as Data Grows

7.5.5 Implementation

7.6 Comprehensive Diagnostics & Failure Isolation

7.5.1 Failure Isolation

7.5.2 Per-Analyzer Diagnostics

7.5.3 Blacklist Diagnostics

7.5.4 Novel Survivor Detection

7.5.5 Discovery Step Diagnostics

7.5.6 Full Run Diagnostics Schema (v3)

7.5.7 Exit Reasons (expanded for v3)

8. Candidate Schema & Narration

8.1 Candidate Schema Changes

8.1.1 Date Membership (critical foundation for multi-factor)

8.1.2 New Candidate Types

8.2 Narrative Templates

8.3 Metrics Impact Schema

8.4 AI Narration (Step 10: NARRATE)

8.4.1 The Problem

8.4.2 Design

8.4.3 Prompt Design

8.4.4 Batched AI Call (Not Per-Discovery)

8.4.5 Experiment Recommendations

8.4.6 Latency and Cost

8.4.7 Error Handling

8.4.8 Diagnostics

8.4.9 Implementation Scope

8.5 Deduplication Improvements (Step 9: FILTER)

8.5.1 The Problem

8.5.2 The Fixes

8.5.3 Updated areDuplicates Logic

8.5.4 Existing Discovery Dedup Enhancement

8.5.5 Impact Estimate

8.5.6 Data Cleanup

8.5.7 Implementation Scope

9. Migration Strategy

9.1 Phased Rollout

8.5.3 Updated `areDuplicates` Logic

Phase 0: Timezone (Phases B-E of `timezone-strategy.md`)

Step 0.1: `_shared/local-time.ts` — LocalTimeExtractors factory

Step 0.2: `_shared/timezone-utils.ts` — Offset parsing and IANA resolution

Step 0.4: Pattern spotter — `display_time` fetch + lookback fix

Step 1.1: `_shared/statistical-tests.ts` — New tests

Step 1.2: `_shared/metric-type-catalog.ts` — New module

Step 1.3: `insight_engine_blacklist` migration

Step 3.1: `_shared/metric-discovery.ts` — New module

Step 4.2: `_shared/temporal-distribution-analyzer.ts` — New module

Step 4.3: `_shared/excursion-cluster-analyzer.ts` — New module

Step 4.4: `_shared/sequential-change-analyzer.ts` — New module

Step 4.5: `_shared/stability-trend-analyzer.ts` — New module