-
Notifications
You must be signed in to change notification settings - Fork 6.9k
Description
Bug Description
When using AgentWorkflow with output_cls (structured output) and cache_idx set on the Anthropic LLM, the API returns:
invalid_request_error: A maximum of 4 blocks with cache_control may be provided. Found 29.
Root Cause
Two issues in llama_index/llms/anthropic/utils.py:
1. blocks_to_anthropic_blocks stamps cache_control on every block with no cap
When a message has cache_control in its additional_kwargs (injected by cache_idx), blocks_to_anthropic_blocks creates a global_cache_control and applies it to every TextBlock, ImageBlock, ToolUseBlock, etc. in that message:
# utils.py, blocks_to_anthropic_blocks()
if kwargs.get("cache_control"):
global_cache_control = CacheControlEphemeralParam(**kwargs["cache_control"])
for block in blocks:
if isinstance(block, TextBlock):
if block.text:
anthropic_blocks.append(_to_anthropic_text_block(block))
if global_cache_control:
anthropic_blocks[-1]["cache_control"] = global_cache_control # every block gets itThis is fine for typical messages with 1-2 blocks, but AgentWorkflow.generate_structured_response() flattens the entire conversation history into many TextBlocks in a single ChatMessage. In my case this produces ~29 blocks in one message, all stamped with cache_control, exceeding Anthropic's limit of 4.
2. System prompt cache_control is silently discarded
In messages_to_anthropic_messages, system messages are extracted as plain strings, discarding any cache_control markers that were set:
# utils.py, messages_to_anthropic_messages()
if message.role == MessageRole.SYSTEM:
system_prompt.extend(
[block.text for block in message.blocks if isinstance(block, TextBlock)]
)
# ...
return ..., "\n".join(system_prompt) # plain string, cache_control lostSo even when cache_idx covers the system message, the cache_control is set but then thrown away when the system prompt is extracted as a joined string.
Steps to Reproduce
from llama_index.llms.anthropic import Anthropic
from llama_index.core.agent.workflow import AgentWorkflow
from llama_index.core.tools import FunctionTool
from pydantic import BaseModel
class MyOutput(BaseModel):
result: str
def my_tool(query: str) -> str:
"""Look something up."""
return f"answer to {query}"
llm = Anthropic(
model="claude-sonnet-4-20250514",
max_tokens=4096,
cache_idx=1, # enable prompt caching
)
agent = AgentWorkflow.from_tools_or_functions(
tools_or_functions=[FunctionTool.from_defaults(fn=my_tool)],
llm=llm,
system_prompt="You are a helpful assistant.",
output_cls=MyOutput, # triggers generate_structured_response
)
import asyncio
async def run():
result = await agent.run(user_msg="Look up foo, then bar, then baz")
print(result)
asyncio.run(run())After a few tool call rounds, generate_structured_response() flattens the conversation into many TextBlocks in one message. With cache_idx=1, all blocks get cache_control, and the Anthropic API rejects the request.
Relevant Logs/Tracbacks
anthropic.BadRequestError: Error code: 400 - {'type': 'error', 'error': {'type': 'invalid_request_error', 'message': 'A maximum of 4 blocks with cache_control may be provided. Found 29.'}}
Suggested Fix
In blocks_to_anthropic_blocks, only apply cache_control to the last block in the message (matching Anthropic's recommended pattern for cache breakpoints), rather than every block:
# After building all anthropic_blocks:
if global_cache_control and anthropic_blocks:
anthropic_blocks[-1]["cache_control"] = global_cache_controlFor the system prompt issue, messages_to_anthropic_messages could return the system prompt as a list of content blocks (preserving cache_control) instead of a joined plain string, when cache markers are present.
Environment
- llama-index-llms-anthropic version: 0.10.10
- Python 3.12
- Anthropic API