Skip to content

Commit e8b3e36

Browse files
committed
docs: Add comprehensive research report on Claude Code token limit errors
Add deep research report on handling 'Claude's Response Exceeded the 4096 Output Token Maximum' error in Claude Code via Amazon Bedrock. Includes root causes, best practices, mitigation strategies, configuration templates, and monitoring recommendations. Add journal entry linking to new report page.
1 parent 3c0190c commit e8b3e36

File tree

2 files changed

+137
-1
lines changed

2 files changed

+137
-1
lines changed

journals/2025_11_04.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,4 +2,5 @@
22
- [[Chrome/Extension]] [[Browser/Extension]] #Markdown
33
- Evaluating [GitHub - deathau/markdownload: A Firefox and Google Chrome extension to clip websites and download them into a readable markdown file.](https://github.com/deathau/markdownload)
44
- [[Person/Gordon Pedersen]] created - Software developer and maintainer of markdownload browser extension, known online as death.au
5-
- [[CursorAI/Feature/Browser Control]] - Created page documenting Cursor's browser control feature through MCP server, allowing AI to interact with web pages directly
5+
- [[CursorAI/Feature/Browser Control]] - Created page documenting Cursor's browser control feature through MCP server, allowing AI to interact with web pages directly
6+
- [[Claude Code/EnvVar/CLAUDE_CODE_MAX_OUTPUT_TOKENS/Report/Avoid api error token max on bedrock]] - Created comprehensive research report on handling token limit errors in Claude Code via Amazon Bedrock
Lines changed: 135 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,135 @@
1+
tags:: [[Report]], [[AI Deep Research]], [[Claude Code]]
2+
alias:: [[Claude Code/EnvVar/CLAUDE_CODE_MAX_OUTPUT_TOKENS/Report/Avoid api error token max on bedrock]]
3+
4+
- # Handling "Claude's Response Exceeded the 4096 Output Token Maximum" in Claude Code via Amazon Bedrock: Causes, Best Practices, and Mitigation Strategies
5+
- ## Overview
6+
- The error message:
7+
- ```
8+
API Error: Claude's response exceeded the 4096 output token maximum. To configure this behavior, set the CLAUDE_CODE_MAX_OUTPUT_TOKENS environment variable.
9+
```
10+
- arises when using Claude Code on Amazon Bedrock, particularly in development environments leveraging Docker/devcontainer setups and interfacing with the Bedrock SDK. This report provides a comprehensive analysis of the factors driving this error, details on configuration and environment variables, mitigation strategies, regional/model nuances, cost/throttling considerations, and monitoring/observability best practices. Actionable configuration templates, code samples, and links to primary documentation and community threads are included.
11+
- ## Root Causes of Token Overrun Errors
12+
- ### 1. Interaction Between Agent Behaviors and Token Budgets
13+
- **Agent Output Patterns:** Claude Code agents, especially when performing complex tasks like file generation or extended multi-step reasoning (e.g., chain-of-thought with tool usage), may produce outputs that exceed the default token output cap.
14+
- **Default Output Cap:** Bedrock's standard configuration sets a 4096-token output maximum to manage resource consumption and model stability.
15+
- **Overrun Mechanism:** When the agent's generated response would exceed this threshold, Bedrock rejects the response with the given error, rather than truncating or streaming by default.
16+
- ### 2. Bedrock and Region/Model-Specific Token Caps
17+
- **Global Defaults:** As of the latest documentation, the 4096-token output limit applies universally for Claude models accessed via Bedrock, though this may differ per model version or AWS region as new releases occur. Always check the [Bedrock Service Limits][1] for the latest.
18+
- **Variance:** For newer Claude versions or other supported models in Bedrock, max tokens may be higher or adjustable, but platform-side enforcement remains strict—exceeding results in an immediate API error.
19+
- ### 3. Configuration Mismatch
20+
- **Environment Variables vs. SDK Parameters:** Setting the `CLAUDE_CODE_MAX_OUTPUT_TOKENS` environment variable configures the agent's desired output length, but the Bedrock SDK request must also specify a compatible `maxTokens` (naming may differ across SDKs: `maxTokens`, `max_output_tokens`, etc.).
21+
- **Thinking Budget:** Tools like `MAX_THINKING_TOKENS` or `thinking.budget_tokens` guard extended reasoning steps—if misaligned with output settings, they can cause unexpected budget breaches.
22+
- ## Best Practices and Configuration Guidelines
23+
- ### 1. Mapping Environment Variables and SDK Parameters
24+
- Ensure environment settings flow through to SDK calls:
25+
- **.env example:**
26+
- ```
27+
CLAUDE_CODE_MAX_OUTPUT_TOKENS=4096
28+
MAX_THINKING_TOKENS=1024
29+
```
30+
- **docker-compose snippet:**
31+
- ```yaml
32+
services:
33+
claude-code:
34+
environment:
35+
- CLAUDE_CODE_MAX_OUTPUT_TOKENS=${CLAUDE_CODE_MAX_OUTPUT_TOKENS:-4096}
36+
- MAX_THINKING_TOKENS=${MAX_THINKING_TOKENS:-1024}
37+
- CLAUDE_CODE_USE_BEDROCK=1
38+
- AWS_REGION=us-east-1
39+
```
40+
- **SDK Alignment:** In your application, explicitly set the `maxTokens` parameter of the Bedrock SDK in line with `CLAUDE_CODE_MAX_OUTPUT_TOKENS`.
41+
- ```python
42+
# Example: Python pseudocode
43+
bedrock_client.invoke_model(
44+
modelId="anthropic.claude-v2",
45+
body={
46+
# ... other parameters ...
47+
"maxTokens": int(os.environ["CLAUDE_CODE_MAX_OUTPUT_TOKENS"])
48+
}
49+
)
50+
```
51+
- **Consistency:** Failure to align these settings can cause the agent to overrun the hard cap.
52+
- ### 2. Recommended Token Budgets
53+
- **Default workloads:** Use `4096` tokens for output, `1024` for thinking—per [Anthropic][2] and [AWS][1] guidance.
54+
- **Heavy workloads (file generation, multi-tool):** Raise `CLAUDE_CODE_MAX_OUTPUT_TOKENS` (e.g., `16384`) and `MAX_THINKING_TOKENS` (e.g., `8192`) for workflows that require it, but monitor throughput and costs closely.
55+
- ## Mitigation Strategies
56+
- ### 1. Chunking and Progressive Output
57+
- **Prompt Chaining:** Instruct the agent to generate output in logical chunks (e.g., "Write Part 1 of N…") or section-wise (per file/module).
58+
- **Partial Output Consumption:** Use prompt engineering to indicate maximum allowed output, guiding the agent to produce responses fitting the cap.
59+
- ### 2. Streaming Output
60+
- **Use Streaming APIs:** Where supported (Bedrock streaming endpoints), consume output as a stream so that partial results can be buffered and the call retried with continuation prompts when the cap is hit. The Bedrock Python SDK and some HTTP endpoints support response streaming [3].
61+
- ### 3. Programmatic Retries
62+
- **Error Handling Logic:** Catch the specific token-exceeded exceptions, then automatically retry with a smaller output request or ask for a summary/condensed output if appropriate.
63+
- ```python
64+
try:
65+
response = call_claude_code(payload, max_tokens=4096)
66+
except ExceededTokenLimitError:
67+
# fallback: condensed output
68+
payload['system_prompt'] = 'Please summarize the result in under 800 tokens.'
69+
response = call_claude_code(payload, max_tokens=800)
70+
```
71+
- **Context Truncation:** If possible, reduce prompt context size to stay under budget.
72+
- ### 4. Agent and Prompt Design
73+
- **System Prompt Constraints:** Add explicit system instructions: "Limit all responses to N tokens or less. Only output the requested files."
74+
- ## Cost and Throttling Implications
75+
- **Cost Scaling:** Raising token caps will significantly increase inference cost per call. Each output (and input) token is billable [1].
76+
- **Throttling/Quotas:** Larger requests may trigger Bedrock burst/budget throttling faster, especially in high-QPS (queries-per-second) scenarios. Use application-level rate limiting and monitor Bedrock quota usage to avoid disruptions [4].
77+
- **Per-Workflow Tuning:** Avoid setting a high global token cap; instead, provide per-workflow overrides through feature flags or environment config injection for only high-need scenarios.
78+
- ## Monitoring and Observability
79+
- **Key Metrics:**
80+
- Number of requests hitting token cap errors
81+
- Requested vs. returned tokens per workflow
82+
- Model version, region, and user/session context for error analysis
83+
- Cost tracking: tokens generated and effective cost per workflow
84+
- **Dashboarding:** Implement dashboarding (e.g., with CloudWatch, Datadog, or Prometheus/Grafana) to correlate token overruns with workload patterns.
85+
- **Alerting:** Set up alerts for high error rates due to token cap breaches and for cost anomalies when increasing token limits.
86+
- ## Production Configuration Templates
87+
- ### Minimal: Baseline
88+
- **.env**
89+
- ```
90+
CLAUDE_CODE_MAX_OUTPUT_TOKENS=4096
91+
MAX_THINKING_TOKENS=1024
92+
```
93+
- **docker-compose.yml**
94+
- ```yaml
95+
environment:
96+
- CLAUDE_CODE_MAX_OUTPUT_TOKENS=4096
97+
- MAX_THINKING_TOKENS=1024
98+
- CLAUDE_CODE_USE_BEDROCK=1
99+
- AWS_REGION=us-east-1
100+
```
101+
- ### Heavy/Multifile Workload
102+
- **.env**
103+
- ```
104+
CLAUDE_CODE_MAX_OUTPUT_TOKENS=16384
105+
MAX_THINKING_TOKENS=8192
106+
```
107+
- **SDK parameter:**
108+
- ```python
109+
"maxTokens": 16384
110+
```
111+
- ## Community and Documentation References
112+
- **Anthropic Claude Code in Bedrock:** [Claude on Bedrock documentation][2]
113+
- **AWS Bedrock Service Limits:** [Bedrock Service Quotas][1]
114+
- **SDK Parameter Docs:** See language-specific SDK (Python example: [Boto3 Bedrock Docs][3])
115+
- **Community Threads:**
116+
- [GitHub: Claude output too large error thread][5]
117+
- [AWS re:Post: Output token cap discussions][6]
118+
- [Anthropic Discourse: Output limits & mitigation][7]
119+
- ## Summary of Actionable Recommendations
120+
- Set a sensible default of 4096 output tokens unless workloads frequently exceed this.
121+
- Always ensure environment configuration (env vars) and SDK request parameters (`maxTokens`, equivalents) are synchronized.
122+
- Use explicit system prompts and chunking for large or complex output.
123+
- Implement programmatic retries and fallback prompts on token cap errors.
124+
- Monitor request/response sizes, error rates, and costs per workflow.
125+
- Use higher token caps selectively and observe for cost and throttling issues.
126+
- Consult primary documentation and community forums for SDK-specific behaviors and updates.
127+
- ## Sources
128+
- [1] AWS Bedrock Service Quotas: https://docs.aws.amazon.com/bedrock/latest/userguide/quotas.html
129+
- [2] Anthropic Claude Code via Amazon Bedrock: https://docs.anthropic.com/claude/docs/amazon-bedrock
130+
- [3] Boto3 AWS SDK for Bedrock - Model Invocation: https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/bedrock-runtime.html
131+
- [4] AWS Bedrock Billing and Quotas FAQ: https://aws.amazon.com/bedrock/faqs/
132+
- [5] GitHub Issue: Claude output too large (Token Limit Error): https://github.com/anthropics/claude-code/issues/21
133+
- [6] AWS re:Post Discussion on Output Token Limits: https://repost.aws/questions/QU93whOKytQYOnaCWusbti0w
134+
- [7] Anthropic Community: Output Token Limit Discussion: https://community.anthropic.com/t/how-to-handle-output-token-limits
135+

0 commit comments

Comments
 (0)