-
Notifications
You must be signed in to change notification settings - Fork 102
Open
Description
Production Hardening Checklist
Findings from code audit of both APIs prior to 20k user launch. None of these block launch — billing correctness was fixed in the v6 credits PR. These are operational improvements to harden the platform.
Critical (P0) — Security
- [token-metering] Remove X-Dev-Admin header bypass or add cryptographic signature verification (
core/auth.py:147) - [token-metering] Validate SSO_URL at startup — fail fast instead of 500 on first auth attempt (
config.py:28) - [token-metering] Audit CORS configuration — ensure credentials not allowed with overly permissive origins (
main.py:80-86) - [study-mode] Fix admin secret timing attack — use
secrets.compare_digest()(main.py:290) - [study-mode] Add JWKS cache staleness limit (max 24h) before failing hard (
auth.py:64-68) - [study-mode] Ensure
.gitignorecovers all.envfiles to prevent credential leaks
High (P1) — Reliability & Performance
Token Metering API:
- Add
pool_recycle=3600to DB engine to prevent stale connections (core/database.py:56) - Configure DB connection timeout to prevent indefinite hangs (
core/database.py:53-61) - Add composite index
(user_id, request_id)on token_transactions for idempotency lookups - Add composite index
(model, is_active, effective_date DESC)on pricing table - Fix Slowapi sync Redis client blocking event loop (
core/rate_limit.py:30-52) - Health endpoint should verify Redis Lua scripts are loaded (
routes/health.py:26-47)
Study Mode API:
- Add retry logic (2-3 attempts w/ exponential backoff) to metering HTTP client (
metering/client.py:91-153) - Fix metering client connection pool lifecycle — proper singleton with cleanup (
metering/client.py:33-40) - Add timeout to LLM streaming —
asyncio.wait_for()wrapper aroundRunner.run_streamed()(chatkit_server.py:325) - Add backpressure to background writer queue — reject or circuit-break at >900 items (
chatkit_store/cached_postgres_store.py:61-67) - Increase DB pool size or add semaphore for concurrent DB operations (
chatkit_store/postgres_store.py:90-96) - Add circuit breaker to GitHub content fetcher — fail fast after 3 consecutive failures (
services/content_loader.py:68-74) - Optimize Redis SCAN operations — limit total keys scanned (
chatkit_store/cached_postgres_store.py:208-216)
Medium (P2) — Observability & Operations
Token Metering API:
- Emit metrics/alerts when Redis fails and metering enters fail-open mode (
services/metering.py:218) - Add request duration logging for SLA monitoring (
routes/metering.py) - Track idempotent deduct replay frequency with metrics (
services/metering.py:296) - Validate configuration at startup — fail fast on invalid REDIS_URL (
config.py) - Add structured logging with correlation fields (user_id, request_id)
Study Mode API:
- Add metering API health check to
/healthendpoint (main.py:232-266) - Propagate X-Request-ID to all log statements and downstream services
- Add alerting when rate limiter enters fail-open mode (
core/rate_limit.py:109-119) - Add background writer queue metrics (depth, write duration, failure rate)
- Track active WebSocket connections with periodic cleanup registry
- Monitor cache hit/miss rates
Low (P3) — Code Quality
- [token-metering] Fix
request_idunique constraint — nullable defeats uniqueness intent (models/transaction.py:89) - [token-metering] Validate UUID version in
UUID_PATTERN(routes/schemas.py:26-27) - [token-metering] Make DEFAULT_PRICING configurable without deployment (
services/metering.py:46-51) - [token-metering] Enable SSL cert verification for DB connections (
core/database.py:34-37) - [study-mode] Remove TODO comment in
fte/tools.py:82or implement feature - [study-mode] Move hardcoded config values to
config.py(chatkit_server.py:37-39) - [study-mode] Fix silent exception swallowing in content cache parsing (
services/content_loader.py:143-144)
Context
- Source: Code audit during v6 credits migration (PR on
001-freemium-trackerbranch) - Scope: 34 files, ~8,600 lines across both APIs
- Not blocking: None of these affect billing correctness (fixed in v6 credits PR)
- Prioritization: P0 items should be addressed before scaling beyond beta users
🤖 Generated with Claude Code
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels