-
Notifications
You must be signed in to change notification settings - Fork 148
Description
Today any-llm has no way to express prompt caching intent through the Completion API. This discussion proposes a design for CompletionParams and the message format, and examines what's needed to implement it. ResponsesParams already has prompt_cache_key and prompt_cache_retention, so the Responses API side is partially covered.
Prompt caching across providers
To inform the design, here's how caching works across the providers any-llm supports. There are two distinct kinds of control:
- Block-level: the user marks specific content blocks as cache breakpoints, telling the provider where to split the cached prefix.
- Request-level: the user sets parameters on the request itself, controlling how the cache behaves (retention duration, routing hints).
| Provider | Mechanism | Block-level control | Request-level control |
|---|---|---|---|
| Anthropic | Prefix-based. cache_control on content blocks marks breakpoints. Max 4. TTL: 5min or 1h. |
cache_control on blocks |
None |
| Bedrock (Converse API) | Prefix-based, different wire format: cachePoint block inserted between content blocks. Max 4. TTL: 5min or 1h. |
cachePoint blocks |
None |
| Vertex AI w/ Anthropic models | Same as Anthropic. Supports 1h TTL on newer models. | cache_control on blocks |
None |
| Azure w/ Anthropic models | Same as Anthropic. Not currently supported by any-llm (see #804). | cache_control on blocks |
None |
| OpenAI | Automatic prefix caching for prompts >1024 tokens. | None (automatic) | prompt_cache_retention ("in_memory" / "24h"), prompt_cache_key (routing hint) |
| DeepSeek | Automatic prefix caching (disk-based). | None (automatic) | None |
| xAI (Grok) | Automatic prefix caching (OpenAI-compatible API). 50-75% discount on cached tokens. | None (automatic) | None (works via OpenAI compatibility) |
| Gemini | Separate Context Caching API β pre-created CachedContent resources referenced by name. |
N/A (different paradigm) | N/A (different paradigm) |
| Others | No caching mechanism exposed or documented. | N/A | N/A |
Sources:
- Anthropic: Prompt caching docs
- Bedrock: ContentBlock API reference, CachePointBlock API reference, Prompt caching guide
- Vertex AI: Claude prompt caching on Vertex AI
- Azure: Claude in Microsoft Foundry docs
- OpenAI: Prompt caching guide
- DeepSeek: Context caching docs
- xAI: Consumption and rate limits
- Gemini: Context caching docs
Gemini's context caching is a fundamentally different feature (resource lifecycle management) and probably out of scope here.
Proposed user-facing API
The proposal has two complementary parts, matching the two kinds of control found in the ecosystem.
1. Block-level cache breakpoints
Since any-llm uses OpenAI's Chat Completion format as its canonical representation, caching intent can be expressed as a cache_control key on content block dicts. This composes naturally: content blocks are already dicts, and extra keys are ignored by providers that don't use them.
messages = [
{"role": "system", "content": [
{"type": "text", "text": "You are a helpful assistant.", "cache_control": "auto"}
]},
{"role": "user", "content": [
{"type": "text", "text": "< long document >", "cache_control": "extended"},
{"type": "text", "text": "What does this document say?"}
]},
]cache_control takes a semantic enum value rather than a provider-specific setting, similar to how any-llm already handles reasoning_effort. The value expresses caching intent and each provider maps it to its native semantics on a best-effort basis:
cache_control |
Anthropic / Vertex AI | Bedrock (Converse API) | OpenAI / Others |
|---|---|---|---|
"auto" |
{"type": "ephemeral"} (5min TTL) |
{"cachePoint": {"type": "default"}} inserted after block |
Stripped |
"extended" |
{"type": "ephemeral", "ttl": "1h"} |
{"cachePoint": {"type": "default", "ttl": "1h"}} after block |
Stripped |
Sources for native formats:
- Anthropic
cache_control: Prompt caching docs β uses{"type": "ephemeral"}on content blocks, with optional"ttl": "1h"for extended caching. - Bedrock
cachePoint: CachePointBlock API reference β uses{"cachePoint": {"type": "default"}}as a separate content block, with optional"ttl": "1h".
Block-level hints are stripped for providers that cache automatically (OpenAI, DeepSeek, xAI) since they don't need or accept breakpoint markers.
2. Request-level cache parameters
OpenAI exposes request-level caching controls that are orthogonal to block-level breakpoints: they don't tell the provider where to cache, but how to manage the cache:
prompt_cache_retention: How long to keep cached prefixes active. OpenAI accepts"in-memory"(5-10min default) and"24h"(extended).prompt_cache_key: A routing hint (max 64 chars) to improve cache hit rates when many requests share common prefixes.
These would live on CompletionParams, not on content blocks:
completion = await client.acompletion(
model="gpt-4o",
messages=[...],
prompt_cache_retention="24h",
prompt_cache_key="user-123",
)The OpenResponses spec defines both prompt_cache_key and prompt_cache_retention on CreateResponseBody. any-llm's ResponsesParams already mirrors these fields. Using the same names on CompletionParams keeps the two APIs consistent and aligned with the spec. OpenAI's Chat Completions endpoint also accepts both parameters, so the mapping is direct, no translation needed for the primary consumer.
OpenResponses spec: CreateResponseBody schema, PromptCacheRetentionEnum
Obviously those are just suggestions, I researched the topic a bit but having a fresh perspective on the design is super important here. Feel free to share your thoughts on whether you want to support this and how.
Related issues
While researching cache control support across providers, I found that Claude on Azure (Microsoft Foundry) is not supported by any-llm at all. This is tracked as a separate feature request: #804. Once that provider exists, it would be great to have it automatically benefit from any block-level cache control work done here.