Minimising token usage #2811

Mst96 · 2025-09-01T14:32:26Z

Mst96
Sep 1, 2025

Is there any way to minimise repetitive token processing?

I am developing a multi-tool agent. My setup uses abstraction layers like OpenRouter and LiteLLM to maintain model flexibility, so I can switch between providers as needed (though Gemini is my primary model). The agent is really effective right now but my issue is token usage. My agent depends on a large, static system prompt and a complex set of tool definitions.

My aim is to somehow reduce the token usage while maintaining the quality of execution of my agent.

Current Flow and Limitation
On every turn, and for each tool call, ADK sends the entire static prompt (system instructions + tool definitions) along with the dynamic conversation history. This repeated transmission of unchanging data leads to significant token consumption and high operational costs.

Just feels really redundant that on every call we're processing the same tokens/data. Doesn't feel sustainable at all to burn this many tokens or having to keep hacking workarounds

I'm working on ways to summarise some of the data so the most relevant data is processed on each run but the majority of token usage comes from the static prompt. So the biggest impact would come from improving this somehow.

While native SDKs like the Google GenAI SDK offer implicit caching as a solution, adopting it would remove the core benefit of using an abstraction layer for model flexibility.

Goal
To reduce the amount of tokens (or cost) per llm call.

I want to implement an efficient caching mechanism for the static portion of the prompt to drastically reduce token usage and cost without sacrificing model flexibility.

Ideally, I'd summarise the tool call results and cache the static prompt.

Questions
What is the recommended strategy or design pattern within ADK for managing and caching static prompt data when using third-party or abstracted LLM clients?

How can I architect my agent to separate the static and dynamic contexts, and prevent the re-transmission of the static portion on each turn?

Are there any documented examples or reference implementations of ADK agents that solve this specific token efficiency problem while maintaining a model-agnostic architecture?

Thank you for any guidance or best practices you can share.

Answered by adk-bot

Sep 1, 2025

Response from ADK Answering Agent (experimental, answer may be inaccurate)

TLDR: To minimize token usage in your ADK-based agent, you can use the query schema mode in AgentBuilderAssistant for on-demand schema fetching, and implement a caching layer using ADK Plugins with the before_model_callback to avoid repeated processing of static data. This approach maintains your model-agnostic architecture while improving token efficiency.

Hello Mst96,

Thank you for your detailed question. Minimizing token usage, especially with large static prompts, is a common challenge. Here are some strategies and design patterns within ADK to help you achieve your goal while maintaining a model-agnostic arch…

View full answer

adk-bot · 2025-09-01T14:33:47Z

adk-bot
Sep 1, 2025
Maintainer

Response from ADK Answering Agent (experimental, answer may be inaccurate)

TLDR: To minimize token usage in your ADK-based agent, you can use the query schema mode in AgentBuilderAssistant for on-demand schema fetching, and implement a caching layer using ADK Plugins with the before_model_callback to avoid repeated processing of static data. This approach maintains your model-agnostic architecture while improving token efficiency.

Hello Mst96,

Thank you for your detailed question. Minimizing token usage, especially with large static prompts, is a common challenge. Here are some strategies and design patterns within ADK to help you achieve your goal while maintaining a model-agnostic architecture.

Use the `query` Schema Mode for Token Efficiency

The AgentBuilderAssistant in ADK offers two modes for handling the AgentConfig schema:

embedded: This mode loads the entire schema into the instruction context at the beginning of the conversation. While this can be faster for execution, it consumes a significant number of tokens, which is not ideal for your use case. [1]
query: This mode uses a schema query tool to fetch schema details on-demand as needed. This approach results in lower initial token usage and is well-suited for token-conscious applications like yours. [1]

You can configure this by setting the schema_mode parameter to "query" when creating your AgentBuilderAssistant instance.

Implement Caching with ADK Plugins

ADK's plugin system is designed for implementing horizontal features like caching across your application. You can use callbacks to intercept requests and return cached responses, which will prevent the re-transmission of the static portion of your prompt on each turn.

The before_model_callback is particularly useful here. You can create a plugin that, before making a call to the model, checks if the static part of the prompt has been processed before. If so, you can retrieve the result from your cache and avoid the expensive LLM call. [2]

This is the recommended design pattern within ADK for managing and caching static prompt data.

Maintain a Model-Agnostic Architecture

ADK is designed to be model-agnostic, supporting various models through wrapper classes like LiteLlm. [3] This means you can implement the caching logic described above without tying your agent to a specific model or SDK, thus preserving the flexibility of your current setup.

Monitor Token Usage

To better understand and optimize your token consumption, you can integrate observability tools like AgentOps with ADK. These tools can provide detailed insights into your token usage, helping you identify areas for improvement. [4] Additionally, the LlmResponse object in ADK now includes token_usage information, which you can use for your own monitoring and analysis. [5]

Summary of Recommendations

Strategy/Design Pattern: Use ADK Plugins with before_model_callback to implement a caching layer.
Agent Architecture: The plugin-based caching mechanism will help you separate static and dynamic contexts.
Examples: While there isn't a specific, complete example for this exact caching scenario, the ADK documentation on plugins provides a solid foundation for building your solution. [2]

By leveraging these features, you can significantly reduce token usage and operational costs without sacrificing the model flexibility that is core to your agent's design.

[1] https://github.com/google/adk-python/blob/main/contributing/samples/adk_agent_builder_assistant/README.md
[2] https://google.github.io/adk-docs/plugins/
[3] https://google.github.io/adk-docs/agents/models/
[4] https://google.github.io/adk-docs/observability/agentops/
[5] https://github.com/google/adk-python/blob/main/CHANGELOG.md

2 replies

hangfei Sep 11, 2025
Maintainer

We are adding support for context management, filtering and caching. See roadmap #2133

Mst96 Sep 12, 2025
Author

thanks for the update @hangfei, this is much needed!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Minimising token usage #2811

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Minimising token usage #2811

Uh oh!

Uh oh!

Mst96 Sep 1, 2025

Replies: 1 comment · 2 replies

Uh oh!

adk-bot Sep 1, 2025 Maintainer

Use the query Schema Mode for Token Efficiency

Implement Caching with ADK Plugins

Maintain a Model-Agnostic Architecture

Monitor Token Usage

Summary of Recommendations

Uh oh!

hangfei Sep 11, 2025 Maintainer

Uh oh!

Mst96 Sep 12, 2025 Author

Mst96
Sep 1, 2025

Replies: 1 comment 2 replies

adk-bot
Sep 1, 2025
Maintainer

Use the `query` Schema Mode for Token Efficiency

hangfei Sep 11, 2025
Maintainer

Mst96 Sep 12, 2025
Author