Skip to content

[Discussion]: Exposing prompt caching controlsΒ #805

@bilelomrani1

Description

@bilelomrani1

Today any-llm has no way to express prompt caching intent through the Completion API. This discussion proposes a design for CompletionParams and the message format, and examines what's needed to implement it. ResponsesParams already has prompt_cache_key and prompt_cache_retention, so the Responses API side is partially covered.

Prompt caching across providers

To inform the design, here's how caching works across the providers any-llm supports. There are two distinct kinds of control:

  • Block-level: the user marks specific content blocks as cache breakpoints, telling the provider where to split the cached prefix.
  • Request-level: the user sets parameters on the request itself, controlling how the cache behaves (retention duration, routing hints).
Provider Mechanism Block-level control Request-level control
Anthropic Prefix-based. cache_control on content blocks marks breakpoints. Max 4. TTL: 5min or 1h. cache_control on blocks None
Bedrock (Converse API) Prefix-based, different wire format: cachePoint block inserted between content blocks. Max 4. TTL: 5min or 1h. cachePoint blocks None
Vertex AI w/ Anthropic models Same as Anthropic. Supports 1h TTL on newer models. cache_control on blocks None
Azure w/ Anthropic models Same as Anthropic. Not currently supported by any-llm (see #804). cache_control on blocks None
OpenAI Automatic prefix caching for prompts >1024 tokens. None (automatic) prompt_cache_retention ("in_memory" / "24h"), prompt_cache_key (routing hint)
DeepSeek Automatic prefix caching (disk-based). None (automatic) None
xAI (Grok) Automatic prefix caching (OpenAI-compatible API). 50-75% discount on cached tokens. None (automatic) None (works via OpenAI compatibility)
Gemini Separate Context Caching API β€” pre-created CachedContent resources referenced by name. N/A (different paradigm) N/A (different paradigm)
Others No caching mechanism exposed or documented. N/A N/A

Sources:

Gemini's context caching is a fundamentally different feature (resource lifecycle management) and probably out of scope here.

Proposed user-facing API

The proposal has two complementary parts, matching the two kinds of control found in the ecosystem.

1. Block-level cache breakpoints

Since any-llm uses OpenAI's Chat Completion format as its canonical representation, caching intent can be expressed as a cache_control key on content block dicts. This composes naturally: content blocks are already dicts, and extra keys are ignored by providers that don't use them.

messages = [
    {"role": "system", "content": [
        {"type": "text", "text": "You are a helpful assistant.", "cache_control": "auto"}
    ]},
    {"role": "user", "content": [
        {"type": "text", "text": "< long document >", "cache_control": "extended"},
        {"type": "text", "text": "What does this document say?"}
    ]},
]

cache_control takes a semantic enum value rather than a provider-specific setting, similar to how any-llm already handles reasoning_effort. The value expresses caching intent and each provider maps it to its native semantics on a best-effort basis:

cache_control Anthropic / Vertex AI Bedrock (Converse API) OpenAI / Others
"auto" {"type": "ephemeral"} (5min TTL) {"cachePoint": {"type": "default"}} inserted after block Stripped
"extended" {"type": "ephemeral", "ttl": "1h"} {"cachePoint": {"type": "default", "ttl": "1h"}} after block Stripped

Sources for native formats:

  • Anthropic cache_control: Prompt caching docs β€” uses {"type": "ephemeral"} on content blocks, with optional "ttl": "1h" for extended caching.
  • Bedrock cachePoint: CachePointBlock API reference β€” uses {"cachePoint": {"type": "default"}} as a separate content block, with optional "ttl": "1h".

Block-level hints are stripped for providers that cache automatically (OpenAI, DeepSeek, xAI) since they don't need or accept breakpoint markers.

2. Request-level cache parameters

OpenAI exposes request-level caching controls that are orthogonal to block-level breakpoints: they don't tell the provider where to cache, but how to manage the cache:

  • prompt_cache_retention: How long to keep cached prefixes active. OpenAI accepts "in-memory" (5-10min default) and "24h" (extended).
  • prompt_cache_key: A routing hint (max 64 chars) to improve cache hit rates when many requests share common prefixes.

These would live on CompletionParams, not on content blocks:

completion = await client.acompletion(
    model="gpt-4o",
    messages=[...],
    prompt_cache_retention="24h",
    prompt_cache_key="user-123",
)

The OpenResponses spec defines both prompt_cache_key and prompt_cache_retention on CreateResponseBody. any-llm's ResponsesParams already mirrors these fields. Using the same names on CompletionParams keeps the two APIs consistent and aligned with the spec. OpenAI's Chat Completions endpoint also accepts both parameters, so the mapping is direct, no translation needed for the primary consumer.

OpenResponses spec: CreateResponseBody schema, PromptCacheRetentionEnum


Obviously those are just suggestions, I researched the topic a bit but having a fresh perspective on the design is super important here. Feel free to share your thoughts on whether you want to support this and how.

Related issues

While researching cache control support across providers, I found that Claude on Azure (Microsoft Foundry) is not supported by any-llm at all. This is tracked as a separate feature request: #804. Once that provider exists, it would be great to have it automatically benefit from any block-level cache control work done here.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions