|
| 1 | +# RFC 0001: Provider Capability Model, Health Scoring, and Fallback Policies |
| 2 | + |
| 3 | +Status: Proposed |
| 4 | +Authors: TextureHQ |
| 5 | +Created: 2025-08-15 |
| 6 | +Target: Minor release (backward-compatible) |
| 7 | + |
| 8 | +## Motivation |
| 9 | +We currently use a fixed provider precedence and implicit fallback. This RFC formalizes: |
| 10 | +- Provider capability metadata (what each provider supports) |
| 11 | +- Provider health scoring and circuit breakers |
| 12 | +- Pluggable fallback policies (priority, priority-then-health, weighted) |
| 13 | +- Normalized error taxonomy |
| 14 | +All defaults preserve existing behavior. |
| 15 | + |
| 16 | +## Goals & Non-Goals |
| 17 | +Goals |
| 18 | +- Additive types and options; no breaking API changes |
| 19 | +- Deterministic default behavior identical to today |
| 20 | +- Clear extension path for new providers |
| 21 | +Non-Goals |
| 22 | +- Changing existing return shapes by default |
| 23 | +- Mandatory logging/metrics dependencies |
| 24 | + |
| 25 | +## High-level Design |
| 26 | +We introduce a ProviderRegistry that knows provider capabilities and health, and selects providers per request via a policy engine. WeatherService uses the registry; if no config is provided, it registers built-in providers in current order and uses the priority policy (preserves behavior). |
| 27 | + |
| 28 | +### Capability Model |
| 29 | +Each provider publishes static metadata: |
| 30 | +- id: string (e.g., "nws", "openweather") |
| 31 | +- supports: { current: boolean; hourly?: boolean; daily?: boolean; alerts?: boolean } |
| 32 | +- regions?: string[] | GeoJSON region hint (optional) |
| 33 | +- units?: ("standard"|"metric"|"imperial")[] (optional) |
| 34 | +- locales?: string[] (optional) |
| 35 | + |
| 36 | +### Health Model |
| 37 | +We keep a rolling in-memory snapshot per provider: |
| 38 | +- successRate: EMA or sliding window over last N calls |
| 39 | +- p95LatencyMs: recent percentile |
| 40 | +- lastFailureAt?: number (epoch) |
| 41 | +- circuit: "closed" | "open" | "half-open" |
| 42 | +Outcomes are recorded on every provider call: { ok: boolean, latencyMs, errorCode? }. |
| 43 | + |
| 44 | +### Error Taxonomy |
| 45 | +Normalize provider errors to: |
| 46 | +- NetworkError, RateLimitError, NotFoundError, ValidationError, ParseError, UpstreamError, UnavailableError |
| 47 | +Attach: { provider, status?, retryAfterMs?, endpoint? }. Do not log/propagate secrets. |
| 48 | +When all providers fail, return CompositeProviderError with per-provider normalized entries. |
| 49 | + |
| 50 | +### Policies |
| 51 | +- priority (default): try in configured order |
| 52 | +- priority-then-health: priority order, but skip providers that are open-circuit or below health thresholds; probe half-open last |
| 53 | +- weighted: choose initial provider by weights among healthy providers; fallback to next-best healthy |
| 54 | + |
| 55 | +### Configuration (all optional) |
| 56 | +WeatherService options additions: |
| 57 | +- providerPolicy?: "priority" | "priority-then-health" | "weighted" |
| 58 | +- providerWeights?: Record<string, number> |
| 59 | +- healthThresholds?: { minSuccessRate?: number; maxP95Ms?: number } |
| 60 | +- circuit?: { failureCountToOpen?: number; halfOpenAfterMs?: number; successToClose?: number } |
| 61 | +- logger?: { trace/debug/info/warn/error(fielded) } |
| 62 | +- metrics?: hooks for counters/histograms (noop by default) |
| 63 | + |
| 64 | +## Detailed Design |
| 65 | + |
| 66 | +### New/updated modules |
| 67 | +- src/providers/providerRegistry.ts |
| 68 | + - register(providerId, adapter, capability) |
| 69 | + - recordOutcome(providerId, outcome) |
| 70 | + - getHealth(providerId) |
| 71 | + - listProviders(intent): filters by capabilities |
| 72 | +- src/providers/policy.ts |
| 73 | + - selectCandidates(intent, registry, config): ProviderId[] with reasons for skips |
| 74 | +- src/errors.ts (additions) |
| 75 | + - new error classes + CompositeProviderError |
| 76 | +- src/providers/* adapters |
| 77 | + - export capability metadata |
| 78 | + - report outcomes via registry hook |
| 79 | +- src/weatherService.ts |
| 80 | + - wire registry + policy; defaults keep current order and behavior when no options provided |
| 81 | + |
| 82 | +### Data types (sketch) |
| 83 | +- ProviderId = "nws" | "openweather" | string |
| 84 | +- Capability |
| 85 | +- HealthSnapshot |
| 86 | +- Outcome: { ok: true, latencyMs } | { ok: false, latencyMs, code, status?, retryAfterMs? } |
| 87 | +- PolicyConfig (see above) |
| 88 | + |
| 89 | +### Circuit Breaker |
| 90 | +- Open after N consecutive failures |
| 91 | +- Half-open after halfOpenAfterMs, allow limited probes |
| 92 | +- Close after successToClose consecutive successes |
| 93 | + |
| 94 | +## Backward Compatibility |
| 95 | +- Defaults: priority policy, all providers registered in existing precedence, no health filtering, no circuit breaker unless configured with safe defaults (or enabled with conservative thresholds) |
| 96 | +- Existing method signatures unchanged |
| 97 | + |
| 98 | +## Observability |
| 99 | +- Optional logger interface with structured fields |
| 100 | +- Optional metrics hooks (provider_success/failure, latency histograms, circuit_state) |
| 101 | +- CorrelationId is accepted/propagated when provided |
| 102 | + |
| 103 | +## Testing Strategy |
| 104 | +- Unit: capability validation, policy selection logic, circuit transitions (fake timers), error normalization |
| 105 | +- Integration: simulated provider failures and latency distributions; ensure default path equals current behavior |
| 106 | +- Snapshot: CompositeProviderError structure |
| 107 | + |
| 108 | +## Rollout Plan |
| 109 | +1. Land types/interfaces, provider capability exports, no behavior change |
| 110 | +2. Implement registry + outcome recording; keep disabled by default |
| 111 | +3. Implement policy engine; enable via options, default unchanged |
| 112 | +4. Add circuit + thresholds with conservative defaults (opt-in) |
| 113 | +5. Docs + examples; minor release |
| 114 | + |
| 115 | +## Alternatives Considered |
| 116 | +- Single global retry layer only (insufficient insight/control) |
| 117 | +- Hard-coding health logic inside WeatherService (less modular/extendable) |
| 118 | + |
| 119 | +## Security & Privacy |
| 120 | +- Never include tokens or credentials in errors/logs/metrics |
| 121 | +- Ensure PII isn’t logged; redact query params as needed |
| 122 | + |
| 123 | +## Open Questions |
| 124 | +- Default thresholds for health and circuit when enabled? |
| 125 | +- How granular should regions be (country list vs polygons)? |
| 126 | +- Should we persist health across process restarts? |
0 commit comments