Skip to content

fix: guard llm_mini cost — goal progress rate-limit + deduplicate chat history #5530

@beastoin

Description

@beastoin

Impact: OpenAI spend spiked ~5x on Mar 9 (gpt-4.1-mini input tokens rose ~11x). Spike window 03:00–09:00 UTC, peaked at ~35x normal tokens/req. Already resolved as of Mar 10 — filing for the 4 structural guards to prevent recurrence.

Introduced by PRs #5493, #5500, #5503 (@kodjima33, deployed 01:08 UTC Mar 9).

Current Behavior

  • extract_and_update_goal_progress (routers/chat.py:117) fires llm_mini on every chat message unconditionally — regardless of whether the user has active goals.
  • extract_question_from_conversation (utils/llm/chat.py:1022-1027) sends the full 10-message history twice in the same prompt (<user_last_messages> + <previous_messages> overlap), doubling input tokens.
  • Prompt cache hit rate collapsed from ~39% to ~17% due to new unique prompts from onboarding changes.
  • process_conversation() lacks an idempotency gate — reconnection storms reprocess already-completed conversations with 5+ llm_mini calls each.

Expected Behavior

llm_mini calls are guarded by rate limits, deduplicated payloads, prompt caching, and idempotency checks so that per-user cost stays within baseline.

Affected Areas

File Line Description
routers/chat.py 117 extract_and_update_goal_progress called unconditionally
utils/llm/chat.py 1022-1027 Doubled message payload in extract_question_from_conversation
utils/llm/clients.py 16 llm_mini client config (prompt cache settings)
utils/processing_memories.py process_conversation() missing idempotency gate

Acceptance Criteria

  1. Rate-limit goal progress extraction: extract_and_update_goal_progress fires only when user has active goals AND max 1x per 60s per user (Redis key TTL).
  2. Deduplicate chat history: extract_question_from_conversation sends each message exactly once — remove <user_last_messages> / <previous_messages> overlap.
  3. Restore prompt cache hit rate: Enable prompt_cache_retention=24h on llm_mini (clients.py:16) to recover cache hits on repeated prompt patterns.
  4. Idempotency gate on process_conversation(): Skip reprocessing if conversation already completed (check Firestore/Redis for processed flag before firing the 5+ llm_mini fan-out).

Files to Modify

  • routers/chat.py
  • utils/llm/chat.py
  • utils/llm/clients.py
  • utils/processing_memories.py
  • database/redis_db.py (if new rate-limit key needed)

Impact

Cost-only fix — no user-facing behavior change. Guards prevent recurrence of the ~5x spend spike.


by AI for @beastoin

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingp1Priority: Critical (score 22-29)

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions