Skip to content

Production Hardening Checklist — Token Metering + Study Mode APIs #690

@mjunaidca

Description

@mjunaidca

Production Hardening Checklist

Findings from code audit of both APIs prior to 20k user launch. None of these block launch — billing correctness was fixed in the v6 credits PR. These are operational improvements to harden the platform.


Critical (P0) — Security

  • [token-metering] Remove X-Dev-Admin header bypass or add cryptographic signature verification (core/auth.py:147)
  • [token-metering] Validate SSO_URL at startup — fail fast instead of 500 on first auth attempt (config.py:28)
  • [token-metering] Audit CORS configuration — ensure credentials not allowed with overly permissive origins (main.py:80-86)
  • [study-mode] Fix admin secret timing attack — use secrets.compare_digest() (main.py:290)
  • [study-mode] Add JWKS cache staleness limit (max 24h) before failing hard (auth.py:64-68)
  • [study-mode] Ensure .gitignore covers all .env files to prevent credential leaks

High (P1) — Reliability & Performance

Token Metering API:

  • Add pool_recycle=3600 to DB engine to prevent stale connections (core/database.py:56)
  • Configure DB connection timeout to prevent indefinite hangs (core/database.py:53-61)
  • Add composite index (user_id, request_id) on token_transactions for idempotency lookups
  • Add composite index (model, is_active, effective_date DESC) on pricing table
  • Fix Slowapi sync Redis client blocking event loop (core/rate_limit.py:30-52)
  • Health endpoint should verify Redis Lua scripts are loaded (routes/health.py:26-47)

Study Mode API:

  • Add retry logic (2-3 attempts w/ exponential backoff) to metering HTTP client (metering/client.py:91-153)
  • Fix metering client connection pool lifecycle — proper singleton with cleanup (metering/client.py:33-40)
  • Add timeout to LLM streaming — asyncio.wait_for() wrapper around Runner.run_streamed() (chatkit_server.py:325)
  • Add backpressure to background writer queue — reject or circuit-break at >900 items (chatkit_store/cached_postgres_store.py:61-67)
  • Increase DB pool size or add semaphore for concurrent DB operations (chatkit_store/postgres_store.py:90-96)
  • Add circuit breaker to GitHub content fetcher — fail fast after 3 consecutive failures (services/content_loader.py:68-74)
  • Optimize Redis SCAN operations — limit total keys scanned (chatkit_store/cached_postgres_store.py:208-216)

Medium (P2) — Observability & Operations

Token Metering API:

  • Emit metrics/alerts when Redis fails and metering enters fail-open mode (services/metering.py:218)
  • Add request duration logging for SLA monitoring (routes/metering.py)
  • Track idempotent deduct replay frequency with metrics (services/metering.py:296)
  • Validate configuration at startup — fail fast on invalid REDIS_URL (config.py)
  • Add structured logging with correlation fields (user_id, request_id)

Study Mode API:

  • Add metering API health check to /health endpoint (main.py:232-266)
  • Propagate X-Request-ID to all log statements and downstream services
  • Add alerting when rate limiter enters fail-open mode (core/rate_limit.py:109-119)
  • Add background writer queue metrics (depth, write duration, failure rate)
  • Track active WebSocket connections with periodic cleanup registry
  • Monitor cache hit/miss rates

Low (P3) — Code Quality

  • [token-metering] Fix request_id unique constraint — nullable defeats uniqueness intent (models/transaction.py:89)
  • [token-metering] Validate UUID version in UUID_PATTERN (routes/schemas.py:26-27)
  • [token-metering] Make DEFAULT_PRICING configurable without deployment (services/metering.py:46-51)
  • [token-metering] Enable SSL cert verification for DB connections (core/database.py:34-37)
  • [study-mode] Remove TODO comment in fte/tools.py:82 or implement feature
  • [study-mode] Move hardcoded config values to config.py (chatkit_server.py:37-39)
  • [study-mode] Fix silent exception swallowing in content cache parsing (services/content_loader.py:143-144)

Context

  • Source: Code audit during v6 credits migration (PR on 001-freemium-tracker branch)
  • Scope: 34 files, ~8,600 lines across both APIs
  • Not blocking: None of these affect billing correctness (fixed in v6 credits PR)
  • Prioritization: P0 items should be addressed before scaling beyond beta users

🤖 Generated with Claude Code

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions