Production Hardening Checklist

## Production Hardening Checklist

Findings from code audit of both APIs prior to 20k user launch. **None of these block launch** — billing correctness was fixed in the v6 credits PR. These are operational improvements to harden the platform.

---

### Critical (P0) — Security

- [ ] **[token-metering]** Remove X-Dev-Admin header bypass or add cryptographic signature verification (`core/auth.py:147`)
- [ ] **[token-metering]** Validate SSO_URL at startup — fail fast instead of 500 on first auth attempt (`config.py:28`)
- [ ] **[token-metering]** Audit CORS configuration — ensure credentials not allowed with overly permissive origins (`main.py:80-86`)
- [ ] **[study-mode]** Fix admin secret timing attack — use `secrets.compare_digest()` (`main.py:290`)
- [ ] **[study-mode]** Add JWKS cache staleness limit (max 24h) before failing hard (`auth.py:64-68`)
- [ ] **[study-mode]** Ensure `.gitignore` covers all `.env` files to prevent credential leaks

---

### High (P1) — Reliability & Performance

**Token Metering API:**
- [ ] Add `pool_recycle=3600` to DB engine to prevent stale connections (`core/database.py:56`)
- [ ] Configure DB connection timeout to prevent indefinite hangs (`core/database.py:53-61`)
- [ ] Add composite index `(user_id, request_id)` on token_transactions for idempotency lookups
- [ ] Add composite index `(model, is_active, effective_date DESC)` on pricing table
- [ ] Fix Slowapi sync Redis client blocking event loop (`core/rate_limit.py:30-52`)
- [ ] Health endpoint should verify Redis Lua scripts are loaded (`routes/health.py:26-47`)

**Study Mode API:**
- [ ] Add retry logic (2-3 attempts w/ exponential backoff) to metering HTTP client (`metering/client.py:91-153`)
- [ ] Fix metering client connection pool lifecycle — proper singleton with cleanup (`metering/client.py:33-40`)
- [ ] Add timeout to LLM streaming — `asyncio.wait_for()` wrapper around `Runner.run_streamed()` (`chatkit_server.py:325`)
- [ ] Add backpressure to background writer queue — reject or circuit-break at >900 items (`chatkit_store/cached_postgres_store.py:61-67`)
- [ ] Increase DB pool size or add semaphore for concurrent DB operations (`chatkit_store/postgres_store.py:90-96`)
- [ ] Add circuit breaker to GitHub content fetcher — fail fast after 3 consecutive failures (`services/content_loader.py:68-74`)
- [ ] Optimize Redis SCAN operations — limit total keys scanned (`chatkit_store/cached_postgres_store.py:208-216`)

---

### Medium (P2) — Observability & Operations

**Token Metering API:**
- [ ] Emit metrics/alerts when Redis fails and metering enters fail-open mode (`services/metering.py:218`)
- [ ] Add request duration logging for SLA monitoring (`routes/metering.py`)
- [ ] Track idempotent deduct replay frequency with metrics (`services/metering.py:296`)
- [ ] Validate configuration at startup — fail fast on invalid REDIS_URL (`config.py`)
- [ ] Add structured logging with correlation fields (user_id, request_id)

**Study Mode API:**
- [ ] Add metering API health check to `/health` endpoint (`main.py:232-266`)
- [ ] Propagate X-Request-ID to all log statements and downstream services
- [ ] Add alerting when rate limiter enters fail-open mode (`core/rate_limit.py:109-119`)
- [ ] Add background writer queue metrics (depth, write duration, failure rate)
- [ ] Track active WebSocket connections with periodic cleanup registry
- [ ] Monitor cache hit/miss rates

---

### Low (P3) — Code Quality

- [ ] **[token-metering]** Fix `request_id` unique constraint — nullable defeats uniqueness intent (`models/transaction.py:89`)
- [ ] **[token-metering]** Validate UUID version in `UUID_PATTERN` (`routes/schemas.py:26-27`)
- [ ] **[token-metering]** Make DEFAULT_PRICING configurable without deployment (`services/metering.py:46-51`)
- [ ] **[token-metering]** Enable SSL cert verification for DB connections (`core/database.py:34-37`)
- [ ] **[study-mode]** Remove TODO comment in `fte/tools.py:82` or implement feature
- [ ] **[study-mode]** Move hardcoded config values to `config.py` (`chatkit_server.py:37-39`)
- [ ] **[study-mode]** Fix silent exception swallowing in content cache parsing (`services/content_loader.py:143-144`)

---

### Context

- **Source**: Code audit during v6 credits migration (PR on `001-freemium-tracker` branch)
- **Scope**: 34 files, ~8,600 lines across both APIs
- **Not blocking**: None of these affect billing correctness (fixed in v6 credits PR)
- **Prioritization**: P0 items should be addressed before scaling beyond beta users

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Production Hardening Checklist — Token Metering + Study Mode APIs #690

Critical (P0) — Security

High (P1) — Reliability & Performance

Medium (P2) — Observability & Operations

Low (P3) — Code Quality

Context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Production Hardening Checklist — Token Metering + Study Mode APIs #690

Description