Skip to content

fix: cap max_output_tokens when using max_tokens fallback#2264

Merged
csmith49 merged 2 commits intomainfrom
fix/cap-max-output-tokens-fallback
Mar 2, 2026
Merged

fix: cap max_output_tokens when using max_tokens fallback#2264
csmith49 merged 2 commits intomainfrom
fix/cap-max-output-tokens-fallback

Conversation

@csmith49
Copy link
Copy Markdown
Collaborator

@csmith49 csmith49 commented Mar 2, 2026

Summary

Fixes a production error where context reset summarization fails with litellm.BadRequestError: OpenrouterException due to requesting more output tokens than the context window allows.

Root Cause:
Some providers (e.g., OpenRouter) use max_tokens in their model info to represent the total context window rather than the output limit. When litellm model info has max_tokens=200000 but max_output_tokens=None, the code was falling back to using max_tokens directly. This caused requests to ask for 200,000 output tokens, which exceeds the context limit when combined with any input tokens.

Example problematic model: openrouter/anthropic/claude-3-haiku

  • max_tokens: 200,000 (context window)
  • max_output_tokens: None
  • max_input_tokens: None

Error being fixed:

Hard context reset summarization failed with exception: litellm.BadRequestError: 
OpenrouterException - {"error":{"message":"This endpoint's maximum context length is 
200000 tokens. However, you requested about 200648 tokens (648 of text input, 200000 
in the output)."}}

The Fix:

  1. Added DEFAULT_MAX_OUTPUT_TOKENS_CAP = 16384 constant - a safe default that works for most models (GPT-4o: 16k, Claude: 8k)
  2. Modified the fallback logic to cap the max_tokens fallback to this value
  3. Added debug logging when capping occurs
  4. Explicitly set max_output_tokens values are respected and not overridden

Checklist

  • If the PR is changing/adding functionality, are there tests to reflect this?
  • If there is an example, have you run the example to make sure that it works?
  • If there are instructions on how to run the code, have you followed the instructions and made sure that it works?
  • If the feature is significant enough to require documentation, is there a PR open on the OpenHands/docs repository with the same branch name?
  • Is the github CI passing?

Agent Server images for this PR

GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server

Variants & Base Images

Variant Architectures Base Image Docs / Tags
java amd64, arm64 eclipse-temurin:17-jdk Link
python amd64, arm64 nikolaik/python-nodejs:python3.12-nodejs22 Link
golang amd64, arm64 golang:1.21-bookworm Link

Pull (multi-arch manifest)

# Each variant is a multi-arch manifest supporting both amd64 and arm64
docker pull ghcr.io/openhands/agent-server:65ed3fe-python

Run

docker run -it --rm \
  -p 8000:8000 \
  --name agent-server-65ed3fe-python \
  ghcr.io/openhands/agent-server:65ed3fe-python

All tags pushed for this build

ghcr.io/openhands/agent-server:65ed3fe-golang-amd64
ghcr.io/openhands/agent-server:65ed3fe-golang_tag_1.21-bookworm-amd64
ghcr.io/openhands/agent-server:65ed3fe-golang-arm64
ghcr.io/openhands/agent-server:65ed3fe-golang_tag_1.21-bookworm-arm64
ghcr.io/openhands/agent-server:65ed3fe-java-amd64
ghcr.io/openhands/agent-server:65ed3fe-eclipse-temurin_tag_17-jdk-amd64
ghcr.io/openhands/agent-server:65ed3fe-java-arm64
ghcr.io/openhands/agent-server:65ed3fe-eclipse-temurin_tag_17-jdk-arm64
ghcr.io/openhands/agent-server:65ed3fe-python-amd64
ghcr.io/openhands/agent-server:65ed3fe-nikolaik_s_python-nodejs_tag_python3.12-nodejs22-amd64
ghcr.io/openhands/agent-server:65ed3fe-python-arm64
ghcr.io/openhands/agent-server:65ed3fe-nikolaik_s_python-nodejs_tag_python3.12-nodejs22-arm64
ghcr.io/openhands/agent-server:65ed3fe-golang
ghcr.io/openhands/agent-server:65ed3fe-java
ghcr.io/openhands/agent-server:65ed3fe-python

About Multi-Architecture Support

  • Each variant tag (e.g., 65ed3fe-python) is a multi-arch manifest supporting both amd64 and arm64
  • Docker automatically pulls the correct architecture for your platform
  • Individual architecture tags (e.g., 65ed3fe-python-amd64) are also available if needed

Some providers (e.g., OpenRouter) use 'max_tokens' to represent the total
context window rather than the output limit. When litellm model info has
max_tokens=200000 but max_output_tokens=None, the code was falling back
to using max_tokens directly. This caused requests to ask for 200000
output tokens, which exceeds the context limit when combined with any
input tokens.

The fix caps the fallback to DEFAULT_MAX_OUTPUT_TOKENS_CAP (16384), which
is a safe default that works for most models.

Error being fixed:
  Hard context reset summarization failed with exception:
  litellm.BadRequestError: OpenrouterException -
  {"error":{"message":"This endpoint's maximum context length is
  200000 tokens. However, you requested about 200648 tokens (648 of text
  input, 200000 in the output)."}}

Co-authored-by: openhands <openhands@all-hands.dev>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 2, 2026

API breakage checks (Griffe)

Result: Failed

Log excerpt (first 1000 characters)

============================================================
Checking openhands-sdk (openhands.sdk)
============================================================
Comparing openhands-sdk 1.11.5 against 1.11.4
::notice title=openhands-sdk API::Ignoring Field metadata-only change (non-breaking): load_public_skills
::notice title=openhands-sdk API::Ignoring Field metadata-only change (non-breaking): temperature
::warning file=openhands-sdk/openhands/sdk/llm/llm.py,line=196,title=LLM.top_p::Attribute value was changed: `Field(default=1.0, ge=0, le=1)` -> `Field(default=None, ge=0, le=1, description='Nucleus sampling parameter. Defaults to None (uses provider default). Set to a value between 0 and 1 to control diversity of outputs.')`
::error title=SemVer::Breaking changes detected (1); require at least minor version bump from 1.11.x, but new is 1.11.5

============================================================
Checking openhands-workspace (openhands.workspace)
============================

Action log

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 2, 2026

Agent server REST API breakage checks (OpenAPI)

Result: Passed

Action log

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 2, 2026

Coverage

Coverage Report •
FileStmtsMissCoverMissing
openhands-sdk/openhands/sdk/llm
   llm.py4807783%430, 483, 704, 810, 812–813, 841, 887, 898–900, 904–908, 916–918, 928–930, 933–934, 938, 940–941, 943, 1141–1142, 1339–1340, 1349, 1362, 1364–1369, 1371–1388, 1391–1395, 1397–1398, 1404–1413, 1464, 1466
TOTAL19058568870% 

@csmith49 csmith49 marked this pull request as ready for review March 2, 2026 19:07
Copy link
Copy Markdown
Collaborator

@all-hands-bot all-hands-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟢 Good taste - Clean fix for a real production bug.

⚠️ Eval Risk Flag: This changes max_output_tokens behavior for models using the max_tokens fallback. Should run lightweight evals before merge to confirm no unexpected impact on agent performance.

# Some providers use 'max_tokens' for the total context window, not output limit.
# This cap prevents requesting output that exceeds the context window.
# 16384 is a safe default that works for most models (GPT-4o: 16k, Claude: 8k).
DEFAULT_MAX_OUTPUT_TOKENS_CAP: Final[int] = 16384
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟢 Pragmatic: 16384 is a sensible default that works for common models (GPT-4o: 16k, Claude: 8k). This solves the real problem where uncapped values cause OpenRouter errors.

Comment on lines +1121 to +1136
# 'max_tokens' is ambiguous: some providers use it for total
# context window, not output limit. Cap it to avoid requesting
# output that exceeds the context window.
max_tokens_value = self._model_info.get("max_tokens")
assert isinstance(max_tokens_value, int) # for type checker
self.max_output_tokens = min(
max_tokens_value, DEFAULT_MAX_OUTPUT_TOKENS_CAP
)
if max_tokens_value > DEFAULT_MAX_OUTPUT_TOKENS_CAP:
logger.debug(
"Capping max_output_tokens from %s to %s for %s "
"(max_tokens may be context window, not output)",
max_tokens_value,
self.max_output_tokens,
self.model,
)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟢 Good Design: The capping logic is clean and preserves backward compatibility. Explicitly set max_output_tokens values are still respected (tested in test_explicit_max_output_tokens_not_overridden), so users can override if needed. Debug logging helps troubleshoot edge cases.

Comment on lines +1073 to +1163
# max_output_tokens Capping Tests


@patch("openhands.sdk.llm.llm.get_litellm_model_info")
def test_max_output_tokens_capped_when_using_max_tokens_fallback(mock_get_model_info):
"""Test that max_output_tokens is capped when falling back to max_tokens.

Some providers (e.g., OpenRouter) set max_tokens to the context window size
rather than the output limit. Without capping, this could request output
that exceeds the context window.

See: https://github.com/OpenHands/software-agent-sdk/issues/XXX
"""
from openhands.sdk.llm.llm import DEFAULT_MAX_OUTPUT_TOKENS_CAP

# Simulate a model where max_tokens = context window (200k) but
# max_output_tokens is not set
mock_get_model_info.return_value = {
"max_tokens": 200000, # This is the context window, not output limit
"max_output_tokens": None,
"max_input_tokens": 200000,
}

llm = LLM(
model="openrouter/anthropic/claude-3-haiku",
api_key=SecretStr("test-key"),
usage_id="test-llm",
)

# max_output_tokens should be capped, not set to 200000
assert llm.max_output_tokens is not None
assert llm.max_output_tokens == DEFAULT_MAX_OUTPUT_TOKENS_CAP
assert llm.max_output_tokens < 200000


@patch("openhands.sdk.llm.llm.get_litellm_model_info")
def test_max_output_tokens_uses_actual_value_when_available(mock_get_model_info):
"""Test that actual max_output_tokens is used when available."""
# Simulate a model with proper max_output_tokens
mock_get_model_info.return_value = {
"max_tokens": 8192,
"max_output_tokens": 8192,
"max_input_tokens": 200000,
}

llm = LLM(
model="anthropic/claude-3-5-sonnet-latest",
api_key=SecretStr("test-key"),
usage_id="test-llm",
)

# Should use the actual max_output_tokens, not capped
assert llm.max_output_tokens == 8192


@patch("openhands.sdk.llm.llm.get_litellm_model_info")
def test_max_output_tokens_small_max_tokens_not_capped(mock_get_model_info):
"""Test that small max_tokens fallback is not unnecessarily capped."""
from openhands.sdk.llm.llm import DEFAULT_MAX_OUTPUT_TOKENS_CAP

# Simulate a model where max_tokens is small (actual output limit)
mock_get_model_info.return_value = {
"max_tokens": 4096, # This is the actual output limit
"max_output_tokens": None,
"max_input_tokens": None,
}

llm = LLM(
model="openrouter/test/small-model",
api_key=SecretStr("test-key"),
usage_id="test-llm",
)

# Should use the actual value since it's below the cap
assert llm.max_output_tokens == 4096
assert llm.max_output_tokens < DEFAULT_MAX_OUTPUT_TOKENS_CAP


def test_explicit_max_output_tokens_not_overridden():
"""Test that explicitly set max_output_tokens is respected."""
llm = LLM(
model="gpt-4o",
api_key=SecretStr("test-key"),
usage_id="test-llm",
max_output_tokens=32768, # Explicitly set higher than cap
)

# Should respect the explicit value
assert llm.max_output_tokens == 32768


Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟢 Solid Tests: Comprehensive coverage of the capping logic with real behavior tests (not mocks). Tests verify:

  1. ✅ Capping when fallback > cap
  2. ✅ Using actual max_output_tokens when available
  3. ✅ Not capping when fallback < cap
  4. ✅ Explicit values override the cap

These tests would catch regressions.

Copy link
Copy Markdown
Collaborator

@enyst enyst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LLM Land is terrible 😅

Thank you!

@csmith49 csmith49 merged commit ffb6d60 into main Mar 2, 2026
25 of 26 checks passed
@csmith49 csmith49 deleted the fix/cap-max-output-tokens-fallback branch March 2, 2026 21:25
zparnold added a commit to zparnold/software-agent-sdk that referenced this pull request Mar 5, 2026
kushalsai-01 added a commit to kushalsai-01/software-agent-sdk that referenced this pull request Apr 7, 2026
… test

The test was added in OpenHands#2264; replace the invalid issues/XXX URL with the PR link.

Made-with: Cursor
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants