-
Notifications
You must be signed in to change notification settings - Fork 1.4k
Open
Labels
bugSomething isn't workingSomething isn't workingp1Priority: Critical (score 22-29)Priority: Critical (score 22-29)
Description
Impact: OpenAI spend spiked ~5x on Mar 9 (gpt-4.1-mini input tokens rose ~11x). Spike window 03:00–09:00 UTC, peaked at ~35x normal tokens/req. Already resolved as of Mar 10 — filing for the 4 structural guards to prevent recurrence.
Introduced by PRs #5493, #5500, #5503 (@kodjima33, deployed 01:08 UTC Mar 9).
Current Behavior
extract_and_update_goal_progress(routers/chat.py:117) firesllm_minion every chat message unconditionally — regardless of whether the user has active goals.extract_question_from_conversation(utils/llm/chat.py:1022-1027) sends the full 10-message history twice in the same prompt (<user_last_messages>+<previous_messages>overlap), doubling input tokens.- Prompt cache hit rate collapsed from ~39% to ~17% due to new unique prompts from onboarding changes.
process_conversation()lacks an idempotency gate — reconnection storms reprocess already-completed conversations with 5+llm_minicalls each.
Expected Behavior
llm_mini calls are guarded by rate limits, deduplicated payloads, prompt caching, and idempotency checks so that per-user cost stays within baseline.
Affected Areas
| File | Line | Description |
|---|---|---|
routers/chat.py |
117 | extract_and_update_goal_progress called unconditionally |
utils/llm/chat.py |
1022-1027 | Doubled message payload in extract_question_from_conversation |
utils/llm/clients.py |
16 | llm_mini client config (prompt cache settings) |
utils/processing_memories.py |
— | process_conversation() missing idempotency gate |
Acceptance Criteria
- Rate-limit goal progress extraction:
extract_and_update_goal_progressfires only when user has active goals AND max 1x per 60s per user (Redis key TTL). - Deduplicate chat history:
extract_question_from_conversationsends each message exactly once — remove<user_last_messages>/<previous_messages>overlap. - Restore prompt cache hit rate: Enable
prompt_cache_retention=24honllm_mini(clients.py:16) to recover cache hits on repeated prompt patterns. - Idempotency gate on process_conversation(): Skip reprocessing if conversation already completed (check Firestore/Redis for processed flag before firing the 5+
llm_minifan-out).
Files to Modify
routers/chat.pyutils/llm/chat.pyutils/llm/clients.pyutils/processing_memories.pydatabase/redis_db.py(if new rate-limit key needed)
Impact
Cost-only fix — no user-facing behavior change. Guards prevent recurrence of the ~5x spend spike.
by AI for @beastoin
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't workingp1Priority: Critical (score 22-29)Priority: Critical (score 22-29)