fix: cap max_output_tokens when using max_tokens fallback#2264
fix: cap max_output_tokens when using max_tokens fallback#2264
Conversation
Some providers (e.g., OpenRouter) use 'max_tokens' to represent the total
context window rather than the output limit. When litellm model info has
max_tokens=200000 but max_output_tokens=None, the code was falling back
to using max_tokens directly. This caused requests to ask for 200000
output tokens, which exceeds the context limit when combined with any
input tokens.
The fix caps the fallback to DEFAULT_MAX_OUTPUT_TOKENS_CAP (16384), which
is a safe default that works for most models.
Error being fixed:
Hard context reset summarization failed with exception:
litellm.BadRequestError: OpenrouterException -
{"error":{"message":"This endpoint's maximum context length is
200000 tokens. However, you requested about 200648 tokens (648 of text
input, 200000 in the output)."}}
Co-authored-by: openhands <openhands@all-hands.dev>
API breakage checks (Griffe)Result: Failed Log excerpt (first 1000 characters) |
Agent server REST API breakage checks (OpenAPI)Result: Passed |
Coverage Report •
|
||||||||||||||||||||
all-hands-bot
left a comment
There was a problem hiding this comment.
🟢 Good taste - Clean fix for a real production bug.
| # Some providers use 'max_tokens' for the total context window, not output limit. | ||
| # This cap prevents requesting output that exceeds the context window. | ||
| # 16384 is a safe default that works for most models (GPT-4o: 16k, Claude: 8k). | ||
| DEFAULT_MAX_OUTPUT_TOKENS_CAP: Final[int] = 16384 |
There was a problem hiding this comment.
🟢 Pragmatic: 16384 is a sensible default that works for common models (GPT-4o: 16k, Claude: 8k). This solves the real problem where uncapped values cause OpenRouter errors.
| # 'max_tokens' is ambiguous: some providers use it for total | ||
| # context window, not output limit. Cap it to avoid requesting | ||
| # output that exceeds the context window. | ||
| max_tokens_value = self._model_info.get("max_tokens") | ||
| assert isinstance(max_tokens_value, int) # for type checker | ||
| self.max_output_tokens = min( | ||
| max_tokens_value, DEFAULT_MAX_OUTPUT_TOKENS_CAP | ||
| ) | ||
| if max_tokens_value > DEFAULT_MAX_OUTPUT_TOKENS_CAP: | ||
| logger.debug( | ||
| "Capping max_output_tokens from %s to %s for %s " | ||
| "(max_tokens may be context window, not output)", | ||
| max_tokens_value, | ||
| self.max_output_tokens, | ||
| self.model, | ||
| ) |
There was a problem hiding this comment.
🟢 Good Design: The capping logic is clean and preserves backward compatibility. Explicitly set max_output_tokens values are still respected (tested in test_explicit_max_output_tokens_not_overridden), so users can override if needed. Debug logging helps troubleshoot edge cases.
| # max_output_tokens Capping Tests | ||
|
|
||
|
|
||
| @patch("openhands.sdk.llm.llm.get_litellm_model_info") | ||
| def test_max_output_tokens_capped_when_using_max_tokens_fallback(mock_get_model_info): | ||
| """Test that max_output_tokens is capped when falling back to max_tokens. | ||
|
|
||
| Some providers (e.g., OpenRouter) set max_tokens to the context window size | ||
| rather than the output limit. Without capping, this could request output | ||
| that exceeds the context window. | ||
|
|
||
| See: https://github.com/OpenHands/software-agent-sdk/issues/XXX | ||
| """ | ||
| from openhands.sdk.llm.llm import DEFAULT_MAX_OUTPUT_TOKENS_CAP | ||
|
|
||
| # Simulate a model where max_tokens = context window (200k) but | ||
| # max_output_tokens is not set | ||
| mock_get_model_info.return_value = { | ||
| "max_tokens": 200000, # This is the context window, not output limit | ||
| "max_output_tokens": None, | ||
| "max_input_tokens": 200000, | ||
| } | ||
|
|
||
| llm = LLM( | ||
| model="openrouter/anthropic/claude-3-haiku", | ||
| api_key=SecretStr("test-key"), | ||
| usage_id="test-llm", | ||
| ) | ||
|
|
||
| # max_output_tokens should be capped, not set to 200000 | ||
| assert llm.max_output_tokens is not None | ||
| assert llm.max_output_tokens == DEFAULT_MAX_OUTPUT_TOKENS_CAP | ||
| assert llm.max_output_tokens < 200000 | ||
|
|
||
|
|
||
| @patch("openhands.sdk.llm.llm.get_litellm_model_info") | ||
| def test_max_output_tokens_uses_actual_value_when_available(mock_get_model_info): | ||
| """Test that actual max_output_tokens is used when available.""" | ||
| # Simulate a model with proper max_output_tokens | ||
| mock_get_model_info.return_value = { | ||
| "max_tokens": 8192, | ||
| "max_output_tokens": 8192, | ||
| "max_input_tokens": 200000, | ||
| } | ||
|
|
||
| llm = LLM( | ||
| model="anthropic/claude-3-5-sonnet-latest", | ||
| api_key=SecretStr("test-key"), | ||
| usage_id="test-llm", | ||
| ) | ||
|
|
||
| # Should use the actual max_output_tokens, not capped | ||
| assert llm.max_output_tokens == 8192 | ||
|
|
||
|
|
||
| @patch("openhands.sdk.llm.llm.get_litellm_model_info") | ||
| def test_max_output_tokens_small_max_tokens_not_capped(mock_get_model_info): | ||
| """Test that small max_tokens fallback is not unnecessarily capped.""" | ||
| from openhands.sdk.llm.llm import DEFAULT_MAX_OUTPUT_TOKENS_CAP | ||
|
|
||
| # Simulate a model where max_tokens is small (actual output limit) | ||
| mock_get_model_info.return_value = { | ||
| "max_tokens": 4096, # This is the actual output limit | ||
| "max_output_tokens": None, | ||
| "max_input_tokens": None, | ||
| } | ||
|
|
||
| llm = LLM( | ||
| model="openrouter/test/small-model", | ||
| api_key=SecretStr("test-key"), | ||
| usage_id="test-llm", | ||
| ) | ||
|
|
||
| # Should use the actual value since it's below the cap | ||
| assert llm.max_output_tokens == 4096 | ||
| assert llm.max_output_tokens < DEFAULT_MAX_OUTPUT_TOKENS_CAP | ||
|
|
||
|
|
||
| def test_explicit_max_output_tokens_not_overridden(): | ||
| """Test that explicitly set max_output_tokens is respected.""" | ||
| llm = LLM( | ||
| model="gpt-4o", | ||
| api_key=SecretStr("test-key"), | ||
| usage_id="test-llm", | ||
| max_output_tokens=32768, # Explicitly set higher than cap | ||
| ) | ||
|
|
||
| # Should respect the explicit value | ||
| assert llm.max_output_tokens == 32768 | ||
|
|
||
|
|
There was a problem hiding this comment.
🟢 Solid Tests: Comprehensive coverage of the capping logic with real behavior tests (not mocks). Tests verify:
- ✅ Capping when fallback > cap
- ✅ Using actual max_output_tokens when available
- ✅ Not capping when fallback < cap
- ✅ Explicit values override the cap
These tests would catch regressions.
enyst
left a comment
There was a problem hiding this comment.
LLM Land is terrible 😅
Thank you!
… test The test was added in OpenHands#2264; replace the invalid issues/XXX URL with the PR link. Made-with: Cursor
Summary
Fixes a production error where context reset summarization fails with
litellm.BadRequestError: OpenrouterExceptiondue to requesting more output tokens than the context window allows.Root Cause:
Some providers (e.g., OpenRouter) use
max_tokensin their model info to represent the total context window rather than the output limit. When litellm model info hasmax_tokens=200000butmax_output_tokens=None, the code was falling back to usingmax_tokensdirectly. This caused requests to ask for 200,000 output tokens, which exceeds the context limit when combined with any input tokens.Example problematic model:
openrouter/anthropic/claude-3-haikumax_tokens: 200,000 (context window)max_output_tokens: Nonemax_input_tokens: NoneError being fixed:
The Fix:
DEFAULT_MAX_OUTPUT_TOKENS_CAP = 16384constant - a safe default that works for most models (GPT-4o: 16k, Claude: 8k)max_tokensfallback to this valuemax_output_tokensvalues are respected and not overriddenChecklist
Agent Server images for this PR
• GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server
Variants & Base Images
eclipse-temurin:17-jdknikolaik/python-nodejs:python3.12-nodejs22golang:1.21-bookwormPull (multi-arch manifest)
# Each variant is a multi-arch manifest supporting both amd64 and arm64 docker pull ghcr.io/openhands/agent-server:65ed3fe-pythonRun
All tags pushed for this build
About Multi-Architecture Support
65ed3fe-python) is a multi-arch manifest supporting both amd64 and arm6465ed3fe-python-amd64) are also available if needed