High token usage due to Prompt Cache in agents (large cached fixed context) – bug or expected behavior? #11615
Replies: 1 comment 1 reply
-
|
I can confirm this issue with some benchmark data. Using an agent with Evidence (consecutive turns, same conversation):
Only 16 seconds apart. Anthropic cache TTL is 5 minutes. The second turn Within-turn caching DOES work — when tool calls create multiple sub-turns Cost impact: I benchmarked the same conversation with caching ON vs OFF:
For simple turns, caching is 25% MORE expensive than no caching — you Overall cache ON still wins (~$1.85 vs ~$2.19 for a full conversation) The root cause is likely the agent framework reconstructing the message Could you provide some insights on this issue, please? Or what am I missing here? Many thanks! Pieter |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
We are observing unexpectedly high token usage when using agents in LibreChat, even for very simple queries (e.g. “What time is it?”).
Specifically, we see metrics like:
promptTokens.input: very low (e.g. 6 tokens)
promptTokens.write: very high (20k–30k tokens)
promptTokens.read: 0
completionTokens: normal
This suggests that a very large Prompt Cache is being written, corresponding to some fixed cached context (system prompt, tools, RAG, model wrappers, etc.), rather than the user’s actual input.
Some relevant observations:
The issue only happens with agents, not with simple chat sessions.
Modifying the agent’s instructions does not reduce promptTokens.write.
Changing the model (e.g. from Claude Sonnet to another model) significantly reduces the cache size, which suggests the prompt is assembled differently depending on the model.
With OpenAI models (e.g. GPT-4o-mini), no prompt caching (write/read) is observed, although the prompt itself is still large.
We could not find a clear way to “clear” or invalidate the prompt cache, other than indirect changes or waiting for the TTL to expire.
Our main questions are:
Is this behavior expected, or is it a known bug?
Which components exactly are included in this cached fixed context for agents?
Is there a recommended way to:
clear or invalidate the prompt cache,
prevent certain blocks (RAG, tools, global system prompts) from being cached,
or disable prompt caching per agent or globally?
Does this behavior apply only to agents, or should it also be expected in other modes?
Have there been other reports of unexpectedly high promptTokens.write usage in similar setups?
The goal is to understand how to control or reduce the cached fixed context, especially for agents that are also used for simple conversational queries, in order to avoid unnecessary token costs.
Beta Was this translation helpful? Give feedback.
All reactions