fix(llm): cap auto-detected max_output_tokens when it fills the entire context window by csmith49 · Pull Request #2747 · OpenHands/software-agent-sdk

csmith49 · 2026-04-07T16:17:22Z

What/Why

When litellm's model registry reports max_output_tokens >= max_input_tokens (e.g. Nemotron: both 262144), the SDK would request the entire context window for output, leaving zero tokens for input. Every provider call was rejected, the condenser misinterpreted this as context overflow, and crashed on the near-empty history with NoCondensationAvailableException.

Cap auto-detected max_output_tokens to half the context window when it would otherwise consume the full window. Explicitly user-set values are not affected.

Checklist

If the PR is changing/adding functionality, are there tests to reflect this?
If there is an example, have you run the example to make sure that it works?
If there are instructions on how to run the code, have you followed the instructions and made sure that it works?
If the feature is significant enough to require documentation, is there a PR open on the OpenHands/docs repository with the same branch name?
Is the github CI passing?

Agent Server images for this PR

• GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server

Variants & Base Images

Variant	Architectures	Base Image	Docs / Tags
java	amd64, arm64	`eclipse-temurin:17-jdk`	Link
python	amd64, arm64	`nikolaik/python-nodejs:python3.13-nodejs22-slim`	Link
golang	amd64, arm64	`golang:1.21-bookworm`	Link

Pull (multi-arch manifest)

# Each variant is a multi-arch manifest supporting both amd64 and arm64
docker pull ghcr.io/openhands/agent-server:3d9e6da-python

Run

docker run -it --rm \
  -p 8000:8000 \
  --name agent-server-3d9e6da-python \
  ghcr.io/openhands/agent-server:3d9e6da-python

All tags pushed for this build

ghcr.io/openhands/agent-server:3d9e6da-golang-amd64
ghcr.io/openhands/agent-server:3d9e6da-golang_tag_1.21-bookworm-amd64
ghcr.io/openhands/agent-server:3d9e6da-golang-arm64
ghcr.io/openhands/agent-server:3d9e6da-golang_tag_1.21-bookworm-arm64
ghcr.io/openhands/agent-server:3d9e6da-java-amd64
ghcr.io/openhands/agent-server:3d9e6da-eclipse-temurin_tag_17-jdk-amd64
ghcr.io/openhands/agent-server:3d9e6da-java-arm64
ghcr.io/openhands/agent-server:3d9e6da-eclipse-temurin_tag_17-jdk-arm64
ghcr.io/openhands/agent-server:3d9e6da-python-amd64
ghcr.io/openhands/agent-server:3d9e6da-nikolaik_s_python-nodejs_tag_python3.13-nodejs22-slim-amd64
ghcr.io/openhands/agent-server:3d9e6da-python-arm64
ghcr.io/openhands/agent-server:3d9e6da-nikolaik_s_python-nodejs_tag_python3.13-nodejs22-slim-arm64
ghcr.io/openhands/agent-server:3d9e6da-golang
ghcr.io/openhands/agent-server:3d9e6da-java
ghcr.io/openhands/agent-server:3d9e6da-python

About Multi-Architecture Support

Each variant tag (e.g., 3d9e6da-python) is a multi-arch manifest supporting both amd64 and arm64
Docker automatically pulls the correct architecture for your platform
Individual architecture tags (e.g., 3d9e6da-python-amd64) are also available if needed

…e context window When litellm's model registry reports max_output_tokens >= max_input_tokens (e.g. Nemotron: both 262144), the SDK would request the entire context window for output, leaving zero tokens for input. Every provider call was rejected, the condenser misinterpreted this as context overflow, and crashed on the near-empty history with NoCondensationAvailableException. Cap auto-detected max_output_tokens to half the context window when it would otherwise consume the full window. Explicitly user-set values are not affected. Co-authored-by: openhands <openhands@all-hands.dev>

github-actions · 2026-04-07T16:17:47Z

Python API breakage checks — ✅ PASSED

Result: ✅ PASSED

Action log

github-actions · 2026-04-07T16:18:04Z

REST API breakage checks (OpenAPI) — ✅ PASSED

Result: ✅ PASSED

Action log

all-hands-bot

🟢 Good taste - Pragmatic fix for broken model registry data.

Analysis:

This solves a real problem: when model registry reports max_output_tokens >= max_input_tokens (Nemotron: both 262144), every LLM call fails because the entire context window is reserved for output, leaving zero room for input.

The fix is minimal and pragmatic: cap auto-detected values to half the context window. This is consistent with existing max_tokens handling (line 1227 already does // 2).

Verdict: ✅ Worth merging - solves a real bug without over-engineering.

Important: This PR affects LLM call behavior and condenser behavior (mentioned in description), which puts it in the eval risk category. A human maintainer should verify via lightweight evals before merging. Using COMMENT review per repo guidelines rather than APPROVE.

github-actions · 2026-04-07T16:20:06Z

Coverage Report •

File	Stmts	Miss	Cover	Missing
openhands-sdk/openhands/sdk/llm
llm.py	516	78	84%	466, 485, 541, 797, 903, 905–906, 934, 980, 991–993, 997–1001, 1009–1011, 1021–1023, 1026–1027, 1031, 1033–1034, 1036, 1260–1261, 1458–1459, 1468, 1481, 1483–1488, 1490–1507, 1510–1514, 1516–1517, 1523–1532, 1587, 1589
TOTAL	21977	6318	71%

…e context window When litellm's model registry reports max_output_tokens >= max_input_tokens (e.g. Nemotron: both 262144), the SDK would request the entire context window for output, leaving zero tokens for input. Every provider call was rejected, the condenser misinterpreted this as context overflow, and crashed on the near-empty history with NoCondensationAvailableException. Cap auto-detected max_output_tokens to half the context window when it would otherwise consume the full window. Explicitly user-set values are not affected. Co-authored-by: openhands <openhands@all-hands.dev>

juanmichelini · 2026-04-07T18:22:32Z

cool! testing here https://openhands-eval-monitor.vercel.app/?run=swebench%2Flitellm_proxy-converse-nemotron-super-3-120b%2F24097380443%2F&days=1&text=juan

enyst · 2026-04-07T18:27:08Z

openhands-sdk/openhands/sdk/llm/llm.py

+                        and self.max_output_tokens is not None
+                        and self.max_output_tokens >= context_window
+                    ):
+                        capped = self.max_output_tokens // 2


I think that's why we sometimes had 4096 or something like that, output tokens are not typically all that much in a single call. This works though! 🤔

It just means the history is smaller when it reaches context error, than if we put some value like 4096, because half is more

Does setting the max like that encourage models to generate more? Honestly I'm not sure. I'd expect we'll end up with very similarly-sized events as if we had set it at 4096.

🤷 I don't know, I'm thinking about the reverse: setting half means that the LLM API provider will error sooner, because it adds that value to the prompt I think? So to the input tokens at the time of the request.

At least, I'm pretty sure Anthropic and OpenAI do that, and I thought the error message suggested it... I could be wrong though

The relevant error message here is:

You passed 2468 input characters and requested 262144 output tokens.
However, the model's context length is only 262144 tokens, resulting in
a maximum input length of 0 tokens (at most 0 characters). Please reduce
the length of the input prompt.

Maybe the "requested" suggests that behavior? In which case this is probably worth escalating to LiteLLM, considering it's their registry that sets the output tokens the way it is.

juanmichelini

Tested and LGTM

all-hands-bot reviewed Apr 7, 2026

View reviewed changes

csmith49 mentioned this pull request Apr 7, 2026

NoCondensationAvailableException: 100% failure rate with Converse Nemotron on main branch #2703

Closed

csmith49 and others added 2 commits April 7, 2026 10:30

removing refs to nemotron

3396aa0

enyst reviewed Apr 7, 2026

View reviewed changes

Merge branch 'main' into fix/nemotron-max-output-tokens-headroom

09db9b2

juanmichelini approved these changes Apr 7, 2026

View reviewed changes

juanmichelini mentioned this pull request Apr 7, 2026

Fix NoCondensationAvailableException when condensation triggered with small view #2727

Closed

3 tasks

csmith49 merged commit f5fcef8 into main Apr 7, 2026
31 of 32 checks passed

csmith49 deleted the fix/nemotron-max-output-tokens-headroom branch April 7, 2026 19:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(llm): cap auto-detected max_output_tokens when it fills the entire context window#2747

fix(llm): cap auto-detected max_output_tokens when it fills the entire context window#2747
csmith49 merged 4 commits intomainfrom
fix/nemotron-max-output-tokens-headroom

csmith49 commented Apr 7, 2026 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Apr 7, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Apr 7, 2026 •

edited

Loading

Uh oh!

all-hands-bot left a comment

Uh oh!

github-actions bot commented Apr 7, 2026 •

edited

Loading

Uh oh!

juanmichelini commented Apr 7, 2026

Uh oh!

enyst Apr 7, 2026

Uh oh!

csmith49 Apr 7, 2026

Uh oh!

enyst Apr 7, 2026

Uh oh!

csmith49 Apr 7, 2026

Uh oh!

juanmichelini left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

csmith49 commented Apr 7, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What/Why

Checklist

Uh oh!

github-actions bot commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Python API breakage checks — ✅ PASSED

Uh oh!

github-actions bot commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

REST API breakage checks (OpenAPI) — ✅ PASSED

Uh oh!

all-hands-bot left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

juanmichelini commented Apr 7, 2026

Uh oh!

enyst Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

csmith49 Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

enyst Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

csmith49 Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

juanmichelini left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

csmith49 commented Apr 7, 2026 •

edited by github-actions bot

Loading

github-actions bot commented Apr 7, 2026 •

edited

Loading

github-actions bot commented Apr 7, 2026 •

edited

Loading

github-actions bot commented Apr 7, 2026 •

edited

Loading