[Discussion]: Exposing prompt caching controls

Today any-llm has no way to express prompt caching intent through the Completion API. This discussion proposes a design for `CompletionParams` and the message format, and examines what's needed to implement it. `ResponsesParams` already has `prompt_cache_key` and `prompt_cache_retention`, so the Responses API side is partially covered.

## Prompt caching across providers

To inform the design, here's how caching works across the providers any-llm supports. There are two distinct kinds of control:

- **Block-level**: the user marks specific content blocks as cache breakpoints, telling the provider _where_ to split the cached prefix.
- **Request-level**: the user sets parameters on the request itself, controlling _how_ the cache behaves (retention duration, routing hints).

| Provider                          | Mechanism                                                                                                        | Block-level control       | Request-level control                                                                 |
| --------------------------------- | ---------------------------------------------------------------------------------------------------------------- | ------------------------- | ------------------------------------------------------------------------------------- |
| **Anthropic**                     | Prefix-based. `cache_control` on content blocks marks breakpoints. Max 4. TTL: 5min or 1h.                       | `cache_control` on blocks | None                                                                                  |
| **Bedrock (Converse API)**        | Prefix-based, different wire format: `cachePoint` block inserted between content blocks. Max 4. TTL: 5min or 1h. | `cachePoint` blocks       | None                                                                                  |
| **Vertex AI w/ Anthropic models** | Same as Anthropic. Supports 1h TTL on newer models.                                                              | `cache_control` on blocks | None                                                                                  |
| **Azure w/ Anthropic models**     | Same as Anthropic. Not currently supported by any-llm (see #804).                                                | `cache_control` on blocks | None                                                                                  |
| **OpenAI**                        | Automatic prefix caching for prompts >1024 tokens.                                                               | None (automatic)          | `prompt_cache_retention` (`"in_memory"` / `"24h"`), `prompt_cache_key` (routing hint) |
| **DeepSeek**                      | Automatic prefix caching (disk-based).                                                                           | None (automatic)          | None                                                                                  |
| **xAI (Grok)**                    | Automatic prefix caching (OpenAI-compatible API). 50-75% discount on cached tokens.                              | None (automatic)          | None (works via OpenAI compatibility)                                                 |
| **Gemini**                        | Separate Context Caching API — pre-created `CachedContent` resources referenced by name.                         | N/A (different paradigm)  | N/A (different paradigm)                                                              |
| **Others**                        | No caching mechanism exposed or documented.                                                                      | N/A                       | N/A                                                                                   |

Sources:

- Anthropic: [Prompt caching docs](https://docs.anthropic.com/en/docs/build-with-claude/prompt-caching)
- Bedrock: [ContentBlock API reference](https://docs.aws.amazon.com/bedrock/latest/APIReference/API_runtime_ContentBlock.html), [CachePointBlock API reference](https://docs.aws.amazon.com/bedrock/latest/APIReference/API_runtime_CachePointBlock.html), [Prompt caching guide](https://docs.aws.amazon.com/bedrock/latest/userguide/prompt-caching.html)
- Vertex AI: [Claude prompt caching on Vertex AI](https://docs.cloud.google.com/vertex-ai/generative-ai/docs/partner-models/claude/prompt-caching)
- Azure: [Claude in Microsoft Foundry docs](https://platform.claude.com/docs/en/build-with-claude/claude-in-microsoft-foundry)
- OpenAI: [Prompt caching guide](https://platform.openai.com/docs/guides/prompt-caching)
- DeepSeek: [Context caching docs](https://api-docs.deepseek.com/guides/kv_cache)
- xAI: [Consumption and rate limits](https://docs.x.ai/docs/key-information/consumption-and-rate-limits)
- Gemini: [Context caching docs](https://ai.google.dev/gemini-api/docs/caching)

Gemini's context caching is a fundamentally different feature (resource lifecycle management) and probably out of scope here.

## Proposed user-facing API

The proposal has two complementary parts, matching the two kinds of control found in the ecosystem.

### 1. Block-level cache breakpoints

Since any-llm uses OpenAI's Chat Completion format as its canonical representation, caching intent can be expressed as a `cache_control` key on content block dicts. This composes naturally: content blocks are already dicts, and extra keys are ignored by providers that don't use them.

```python
messages = [
    {"role": "system", "content": [
        {"type": "text", "text": "You are a helpful assistant.", "cache_control": "auto"}
    ]},
    {"role": "user", "content": [
        {"type": "text", "text": "< long document >", "cache_control": "extended"},
        {"type": "text", "text": "What does this document say?"}
    ]},
]
```

`cache_control` takes a semantic enum value rather than a provider-specific setting, similar to how any-llm already handles `reasoning_effort`. The value expresses caching intent and each provider maps it to its native semantics on a best-effort basis:

| `cache_control` | Anthropic / Vertex AI                | Bedrock (Converse API)                                         | OpenAI / Others |
| --------------- | ------------------------------------ | -------------------------------------------------------------- | --------------- |
| `"auto"`        | `{"type": "ephemeral"}` (5min TTL)   | `{"cachePoint": {"type": "default"}}` inserted after block     | Stripped        |
| `"extended"`    | `{"type": "ephemeral", "ttl": "1h"}` | `{"cachePoint": {"type": "default", "ttl": "1h"}}` after block | Stripped        |

Sources for native formats:

- Anthropic `cache_control`: [Prompt caching docs](https://docs.anthropic.com/en/docs/build-with-claude/prompt-caching) — uses `{"type": "ephemeral"}` on content blocks, with optional `"ttl": "1h"` for extended caching.
- Bedrock `cachePoint`: [CachePointBlock API reference](https://docs.aws.amazon.com/bedrock/latest/APIReference/API_runtime_CachePointBlock.html) — uses `{"cachePoint": {"type": "default"}}` as a separate content block, with optional `"ttl": "1h"`.

Block-level hints are stripped for providers that cache automatically (OpenAI, DeepSeek, xAI) since they don't need or accept breakpoint markers.

### 2. Request-level cache parameters

OpenAI exposes request-level caching controls that are orthogonal to block-level breakpoints: they don't tell the provider where to cache, but how to manage the cache:

- **`prompt_cache_retention`**: How long to keep cached prefixes active. OpenAI accepts `"in-memory"` (5-10min default) and `"24h"` (extended).
- **`prompt_cache_key`**: A routing hint (max 64 chars) to improve cache hit rates when many requests share common prefixes.

These would live on `CompletionParams`, not on content blocks:

```python
completion = await client.acompletion(
    model="gpt-4o",
    messages=[...],
    prompt_cache_retention="24h",
    prompt_cache_key="user-123",
)
```

The [OpenResponses spec](https://github.com/openresponses/openresponses) defines both `prompt_cache_key` and `prompt_cache_retention` on `CreateResponseBody`. any-llm's `ResponsesParams` already mirrors these fields. Using the same names on `CompletionParams` keeps the two APIs consistent and aligned with the spec. OpenAI's Chat Completions endpoint also accepts both parameters, so the mapping is direct, no translation needed for the primary consumer.

OpenResponses spec: [`CreateResponseBody` schema](https://github.com/openresponses/openresponses/blob/main/schema/components/schemas/CreateResponseBody.json), [`PromptCacheRetentionEnum`](https://github.com/openresponses/openresponses/blob/main/schema/components/schemas/PromptCacheRetentionEnum.json)


---

Obviously those are just suggestions, I researched the topic a bit but having a fresh perspective on the design is super important here. Feel free to share your thoughts on whether you want to support this and how.

## Related issues

While researching cache control support across providers, I found that Claude on Azure (Microsoft Foundry) is not supported by any-llm at all. This is tracked as a separate feature request: #804. Once that provider exists, it would be great to have it automatically benefit from any block-level cache control work done here.

`cache_control`	Anthropic / Vertex AI	Bedrock (Converse API)	OpenAI / Others
`"auto"`	`{"type": "ephemeral"}` (5min TTL)	`{"cachePoint": {"type": "default"}}` inserted after block	Stripped
`"extended"`	`{"type": "ephemeral", "ttl": "1h"}`	`{"cachePoint": {"type": "default", "ttl": "1h"}}` after block	Stripped

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Discussion]: Exposing prompt caching controls #805

Prompt caching across providers

Proposed user-facing API

1. Block-level cache breakpoints

2. Request-level cache parameters

Related issues

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Provider	Mechanism	Block-level control	Request-level control
Anthropic	Prefix-based. `cache_control` on content blocks marks breakpoints. Max 4. TTL: 5min or 1h.	`cache_control` on blocks	None
Bedrock (Converse API)	Prefix-based, different wire format: `cachePoint` block inserted between content blocks. Max 4. TTL: 5min or 1h.	`cachePoint` blocks	None
Vertex AI w/ Anthropic models	Same as Anthropic. Supports 1h TTL on newer models.	`cache_control` on blocks	None
Azure w/ Anthropic models	Same as Anthropic. Not currently supported by any-llm (see #804).	`cache_control` on blocks	None
OpenAI	Automatic prefix caching for prompts >1024 tokens.	None (automatic)	`prompt_cache_retention` (`"in_memory"` / `"24h"`), `prompt_cache_key` (routing hint)
DeepSeek	Automatic prefix caching (disk-based).	None (automatic)	None
xAI (Grok)	Automatic prefix caching (OpenAI-compatible API). 50-75% discount on cached tokens.	None (automatic)	None (works via OpenAI compatibility)
Gemini	Separate Context Caching API — pre-created `CachedContent` resources referenced by name.	N/A (different paradigm)	N/A (different paradigm)
Others	No caching mechanism exposed or documented.	N/A	N/A

[Discussion]: Exposing prompt caching controls #805

Description

Prompt caching across providers

Proposed user-facing API

1. Block-level cache breakpoints

2. Request-level cache parameters

Related issues

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions