fix: cap max_output_tokens when using max_tokens fallback by csmith49 · Pull Request #2264 · OpenHands/software-agent-sdk

csmith49 · 2026-03-02T19:04:58Z

Summary

Fixes a production error where context reset summarization fails with litellm.BadRequestError: OpenrouterException due to requesting more output tokens than the context window allows.

Root Cause:
Some providers (e.g., OpenRouter) use max_tokens in their model info to represent the total context window rather than the output limit. When litellm model info has max_tokens=200000 but max_output_tokens=None, the code was falling back to using max_tokens directly. This caused requests to ask for 200,000 output tokens, which exceeds the context limit when combined with any input tokens.

Example problematic model: openrouter/anthropic/claude-3-haiku

max_tokens: 200,000 (context window)
max_output_tokens: None
max_input_tokens: None

Error being fixed:

Hard context reset summarization failed with exception: litellm.BadRequestError: 
OpenrouterException - {"error":{"message":"This endpoint's maximum context length is 
200000 tokens. However, you requested about 200648 tokens (648 of text input, 200000 
in the output)."}}

The Fix:

Added DEFAULT_MAX_OUTPUT_TOKENS_CAP = 16384 constant - a safe default that works for most models (GPT-4o: 16k, Claude: 8k)
Modified the fallback logic to cap the max_tokens fallback to this value
Added debug logging when capping occurs
Explicitly set max_output_tokens values are respected and not overridden

Checklist

If the PR is changing/adding functionality, are there tests to reflect this?
If there is an example, have you run the example to make sure that it works?
If there are instructions on how to run the code, have you followed the instructions and made sure that it works?
If the feature is significant enough to require documentation, is there a PR open on the OpenHands/docs repository with the same branch name?
Is the github CI passing?

Agent Server images for this PR

• GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server

Variants & Base Images

Variant	Architectures	Base Image	Docs / Tags
java	amd64, arm64	`eclipse-temurin:17-jdk`	Link
python	amd64, arm64	`nikolaik/python-nodejs:python3.12-nodejs22`	Link
golang	amd64, arm64	`golang:1.21-bookworm`	Link

Pull (multi-arch manifest)

# Each variant is a multi-arch manifest supporting both amd64 and arm64
docker pull ghcr.io/openhands/agent-server:65ed3fe-python

Run

docker run -it --rm \
  -p 8000:8000 \
  --name agent-server-65ed3fe-python \
  ghcr.io/openhands/agent-server:65ed3fe-python

All tags pushed for this build

ghcr.io/openhands/agent-server:65ed3fe-golang-amd64
ghcr.io/openhands/agent-server:65ed3fe-golang_tag_1.21-bookworm-amd64
ghcr.io/openhands/agent-server:65ed3fe-golang-arm64
ghcr.io/openhands/agent-server:65ed3fe-golang_tag_1.21-bookworm-arm64
ghcr.io/openhands/agent-server:65ed3fe-java-amd64
ghcr.io/openhands/agent-server:65ed3fe-eclipse-temurin_tag_17-jdk-amd64
ghcr.io/openhands/agent-server:65ed3fe-java-arm64
ghcr.io/openhands/agent-server:65ed3fe-eclipse-temurin_tag_17-jdk-arm64
ghcr.io/openhands/agent-server:65ed3fe-python-amd64
ghcr.io/openhands/agent-server:65ed3fe-nikolaik_s_python-nodejs_tag_python3.12-nodejs22-amd64
ghcr.io/openhands/agent-server:65ed3fe-python-arm64
ghcr.io/openhands/agent-server:65ed3fe-nikolaik_s_python-nodejs_tag_python3.12-nodejs22-arm64
ghcr.io/openhands/agent-server:65ed3fe-golang
ghcr.io/openhands/agent-server:65ed3fe-java
ghcr.io/openhands/agent-server:65ed3fe-python

About Multi-Architecture Support

Each variant tag (e.g., 65ed3fe-python) is a multi-arch manifest supporting both amd64 and arm64
Docker automatically pulls the correct architecture for your platform
Individual architecture tags (e.g., 65ed3fe-python-amd64) are also available if needed

Some providers (e.g., OpenRouter) use 'max_tokens' to represent the total context window rather than the output limit. When litellm model info has max_tokens=200000 but max_output_tokens=None, the code was falling back to using max_tokens directly. This caused requests to ask for 200000 output tokens, which exceeds the context limit when combined with any input tokens. The fix caps the fallback to DEFAULT_MAX_OUTPUT_TOKENS_CAP (16384), which is a safe default that works for most models. Error being fixed: Hard context reset summarization failed with exception: litellm.BadRequestError: OpenrouterException - {"error":{"message":"This endpoint's maximum context length is 200000 tokens. However, you requested about 200648 tokens (648 of text input, 200000 in the output)."}} Co-authored-by: openhands <openhands@all-hands.dev>

github-actions · 2026-03-02T19:05:26Z

API breakage checks (Griffe)

Result: Failed

Log excerpt (first 1000 characters)


============================================================
Checking openhands-sdk (openhands.sdk)
============================================================
Comparing openhands-sdk 1.11.5 against 1.11.4
::notice title=openhands-sdk API::Ignoring Field metadata-only change (non-breaking): load_public_skills
::notice title=openhands-sdk API::Ignoring Field metadata-only change (non-breaking): temperature
::warning file=openhands-sdk/openhands/sdk/llm/llm.py,line=196,title=LLM.top_p::Attribute value was changed: `Field(default=1.0, ge=0, le=1)` -> `Field(default=None, ge=0, le=1, description='Nucleus sampling parameter. Defaults to None (uses provider default). Set to a value between 0 and 1 to control diversity of outputs.')`
::error title=SemVer::Breaking changes detected (1); require at least minor version bump from 1.11.x, but new is 1.11.5

============================================================
Checking openhands-workspace (openhands.workspace)
============================

Action log

github-actions · 2026-03-02T19:05:40Z

Agent server REST API breakage checks (OpenAPI)

Result: Passed

Action log

github-actions · 2026-03-02T19:07:20Z

Coverage Report •

File	Stmts	Miss	Cover	Missing
openhands-sdk/openhands/sdk/llm
llm.py	480	77	83%	430, 483, 704, 810, 812–813, 841, 887, 898–900, 904–908, 916–918, 928–930, 933–934, 938, 940–941, 943, 1141–1142, 1339–1340, 1349, 1362, 1364–1369, 1371–1388, 1391–1395, 1397–1398, 1404–1413, 1464, 1466
TOTAL	19058	5688	70%

all-hands-bot

🟢 Good taste - Clean fix for a real production bug.

⚠️ Eval Risk Flag: This changes max_output_tokens behavior for models using the max_tokens fallback. Should run lightweight evals before merge to confirm no unexpected impact on agent performance.

all-hands-bot · 2026-03-02T19:10:23Z

openhands-sdk/openhands/sdk/llm/llm.py

+# Some providers use 'max_tokens' for the total context window, not output limit.
+# This cap prevents requesting output that exceeds the context window.
+# 16384 is a safe default that works for most models (GPT-4o: 16k, Claude: 8k).
+DEFAULT_MAX_OUTPUT_TOKENS_CAP: Final[int] = 16384


🟢 Pragmatic: 16384 is a sensible default that works for common models (GPT-4o: 16k, Claude: 8k). This solves the real problem where uncapped values cause OpenRouter errors.

all-hands-bot · 2026-03-02T19:10:23Z

openhands-sdk/openhands/sdk/llm/llm.py

+                    # 'max_tokens' is ambiguous: some providers use it for total
+                    # context window, not output limit. Cap it to avoid requesting
+                    # output that exceeds the context window.
+                    max_tokens_value = self._model_info.get("max_tokens")
+                    assert isinstance(max_tokens_value, int)  # for type checker
+                    self.max_output_tokens = min(
+                        max_tokens_value, DEFAULT_MAX_OUTPUT_TOKENS_CAP
+                    )
+                    if max_tokens_value > DEFAULT_MAX_OUTPUT_TOKENS_CAP:
+                        logger.debug(
+                            "Capping max_output_tokens from %s to %s for %s "
+                            "(max_tokens may be context window, not output)",
+                            max_tokens_value,
+                            self.max_output_tokens,
+                            self.model,
+                        )


🟢 Good Design: The capping logic is clean and preserves backward compatibility. Explicitly set max_output_tokens values are still respected (tested in test_explicit_max_output_tokens_not_overridden), so users can override if needed. Debug logging helps troubleshoot edge cases.

all-hands-bot · 2026-03-02T19:10:23Z

tests/sdk/llm/test_llm.py

+# max_output_tokens Capping Tests
+
+
+@patch("openhands.sdk.llm.llm.get_litellm_model_info")
+def test_max_output_tokens_capped_when_using_max_tokens_fallback(mock_get_model_info):
+    """Test that max_output_tokens is capped when falling back to max_tokens.
+
+    Some providers (e.g., OpenRouter) set max_tokens to the context window size
+    rather than the output limit. Without capping, this could request output
+    that exceeds the context window.
+
+    See: https://github.com/OpenHands/software-agent-sdk/issues/XXX
+    """
+    from openhands.sdk.llm.llm import DEFAULT_MAX_OUTPUT_TOKENS_CAP
+
+    # Simulate a model where max_tokens = context window (200k) but
+    # max_output_tokens is not set
+    mock_get_model_info.return_value = {
+        "max_tokens": 200000,  # This is the context window, not output limit
+        "max_output_tokens": None,
+        "max_input_tokens": 200000,
+    }
+
+    llm = LLM(
+        model="openrouter/anthropic/claude-3-haiku",
+        api_key=SecretStr("test-key"),
+        usage_id="test-llm",
+    )
+
+    # max_output_tokens should be capped, not set to 200000
+    assert llm.max_output_tokens is not None
+    assert llm.max_output_tokens == DEFAULT_MAX_OUTPUT_TOKENS_CAP
+    assert llm.max_output_tokens < 200000
+
+
+@patch("openhands.sdk.llm.llm.get_litellm_model_info")
+def test_max_output_tokens_uses_actual_value_when_available(mock_get_model_info):
+    """Test that actual max_output_tokens is used when available."""
+    # Simulate a model with proper max_output_tokens
+    mock_get_model_info.return_value = {
+        "max_tokens": 8192,
+        "max_output_tokens": 8192,
+        "max_input_tokens": 200000,
+    }
+
+    llm = LLM(
+        model="anthropic/claude-3-5-sonnet-latest",
+        api_key=SecretStr("test-key"),
+        usage_id="test-llm",
+    )
+
+    # Should use the actual max_output_tokens, not capped
+    assert llm.max_output_tokens == 8192
+
+
+@patch("openhands.sdk.llm.llm.get_litellm_model_info")
+def test_max_output_tokens_small_max_tokens_not_capped(mock_get_model_info):
+    """Test that small max_tokens fallback is not unnecessarily capped."""
+    from openhands.sdk.llm.llm import DEFAULT_MAX_OUTPUT_TOKENS_CAP
+
+    # Simulate a model where max_tokens is small (actual output limit)
+    mock_get_model_info.return_value = {
+        "max_tokens": 4096,  # This is the actual output limit
+        "max_output_tokens": None,
+        "max_input_tokens": None,
+    }
+
+    llm = LLM(
+        model="openrouter/test/small-model",
+        api_key=SecretStr("test-key"),
+        usage_id="test-llm",
+    )
+
+    # Should use the actual value since it's below the cap
+    assert llm.max_output_tokens == 4096
+    assert llm.max_output_tokens < DEFAULT_MAX_OUTPUT_TOKENS_CAP
+
+
+def test_explicit_max_output_tokens_not_overridden():
+    """Test that explicitly set max_output_tokens is respected."""
+    llm = LLM(
+        model="gpt-4o",
+        api_key=SecretStr("test-key"),
+        usage_id="test-llm",
+        max_output_tokens=32768,  # Explicitly set higher than cap
+    )
+
+    # Should respect the explicit value
+    assert llm.max_output_tokens == 32768
+
+


🟢 Solid Tests: Comprehensive coverage of the capping logic with real behavior tests (not mocks). Tests verify:

✅ Capping when fallback > cap

✅ Using actual max_output_tokens when available

✅ Not capping when fallback < cap

✅ Explicit values override the cap

These tests would catch regressions.

enyst

LLM Land is terrible 😅

Thank you!

…2264) Cherry-pick from upstream ffb6d60

… test The test was added in OpenHands#2264; replace the invalid issues/XXX URL with the PR link. Made-with: Cursor

csmith49 mentioned this pull request Mar 2, 2026

Condenser crashes with 'Cannot condense 0 events' when tool loop spans view #2255

Closed

csmith49 marked this pull request as ready for review March 2, 2026 19:07

all-hands-bot reviewed Mar 2, 2026

View reviewed changes

enyst approved these changes Mar 2, 2026

View reviewed changes

Merge branch 'main' into fix/cap-max-output-tokens-fallback

832cd60

csmith49 merged commit ffb6d60 into main Mar 2, 2026
25 of 26 checks passed

csmith49 deleted the fix/cap-max-output-tokens-fallback branch March 2, 2026 21:25

neubig mentioned this pull request Mar 3, 2026

Add learnings from code review analysis #2280

Merged

4 tasks

enyst mentioned this pull request Mar 5, 2026

Bedrock models fail when litellm reports max_output_tokens equal to context window size #2247

Closed

zparnold added a commit to zparnold/software-agent-sdk that referenced this pull request Mar 5, 2026

fix: cap max_output_tokens when using max_tokens fallback (OpenHands#…

08c7cba

…2264) Cherry-pick from upstream ffb6d60

kushalsai-01 mentioned this pull request Apr 7, 2026

chore(test): update docstring link to correct PR (#2264) #2733

Merged

3 tasks

VascoSch92 pushed a commit that referenced this pull request Apr 7, 2026

chore(test): update docstring link to correct PR (#2264) (#2733)

b619eae

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: cap max_output_tokens when using max_tokens fallback#2264

fix: cap max_output_tokens when using max_tokens fallback#2264
csmith49 merged 2 commits intomainfrom
fix/cap-max-output-tokens-fallback

csmith49 commented Mar 2, 2026 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Mar 2, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Mar 2, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Mar 2, 2026 •

edited

Loading

Uh oh!

all-hands-bot left a comment

Uh oh!

all-hands-bot Mar 2, 2026

Uh oh!

all-hands-bot Mar 2, 2026

Uh oh!

all-hands-bot Mar 2, 2026

Uh oh!

enyst left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

csmith49 commented Mar 2, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Checklist

Uh oh!

github-actions bot commented Mar 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

API breakage checks (Griffe)

Uh oh!

github-actions bot commented Mar 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Agent server REST API breakage checks (OpenAPI)

Uh oh!

github-actions bot commented Mar 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

all-hands-bot left a comment

Choose a reason for hiding this comment

Uh oh!

all-hands-bot Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

all-hands-bot Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

all-hands-bot Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

enyst left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

csmith49 commented Mar 2, 2026 •

edited by github-actions bot

Loading

github-actions bot commented Mar 2, 2026 •

edited

Loading

github-actions bot commented Mar 2, 2026 •

edited

Loading

github-actions bot commented Mar 2, 2026 •

edited

Loading