Skip to content

fix(llm): cap auto-detected max_output_tokens when it fills the entire context window#2747

Merged
csmith49 merged 4 commits intomainfrom
fix/nemotron-max-output-tokens-headroom
Apr 7, 2026
Merged

fix(llm): cap auto-detected max_output_tokens when it fills the entire context window#2747
csmith49 merged 4 commits intomainfrom
fix/nemotron-max-output-tokens-headroom

Conversation

@csmith49
Copy link
Copy Markdown
Collaborator

@csmith49 csmith49 commented Apr 7, 2026

What/Why

When litellm's model registry reports max_output_tokens >= max_input_tokens (e.g. Nemotron: both 262144), the SDK would request the entire context window for output, leaving zero tokens for input. Every provider call was rejected, the condenser misinterpreted this as context overflow, and crashed on the near-empty history with NoCondensationAvailableException.

Cap auto-detected max_output_tokens to half the context window when it would otherwise consume the full window. Explicitly user-set values are not affected.

Checklist

  • If the PR is changing/adding functionality, are there tests to reflect this?
  • If there is an example, have you run the example to make sure that it works?
  • If there are instructions on how to run the code, have you followed the instructions and made sure that it works?
  • If the feature is significant enough to require documentation, is there a PR open on the OpenHands/docs repository with the same branch name?
  • Is the github CI passing?

Agent Server images for this PR

GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server

Variants & Base Images

Variant Architectures Base Image Docs / Tags
java amd64, arm64 eclipse-temurin:17-jdk Link
python amd64, arm64 nikolaik/python-nodejs:python3.13-nodejs22-slim Link
golang amd64, arm64 golang:1.21-bookworm Link

Pull (multi-arch manifest)

# Each variant is a multi-arch manifest supporting both amd64 and arm64
docker pull ghcr.io/openhands/agent-server:3d9e6da-python

Run

docker run -it --rm \
  -p 8000:8000 \
  --name agent-server-3d9e6da-python \
  ghcr.io/openhands/agent-server:3d9e6da-python

All tags pushed for this build

ghcr.io/openhands/agent-server:3d9e6da-golang-amd64
ghcr.io/openhands/agent-server:3d9e6da-golang_tag_1.21-bookworm-amd64
ghcr.io/openhands/agent-server:3d9e6da-golang-arm64
ghcr.io/openhands/agent-server:3d9e6da-golang_tag_1.21-bookworm-arm64
ghcr.io/openhands/agent-server:3d9e6da-java-amd64
ghcr.io/openhands/agent-server:3d9e6da-eclipse-temurin_tag_17-jdk-amd64
ghcr.io/openhands/agent-server:3d9e6da-java-arm64
ghcr.io/openhands/agent-server:3d9e6da-eclipse-temurin_tag_17-jdk-arm64
ghcr.io/openhands/agent-server:3d9e6da-python-amd64
ghcr.io/openhands/agent-server:3d9e6da-nikolaik_s_python-nodejs_tag_python3.13-nodejs22-slim-amd64
ghcr.io/openhands/agent-server:3d9e6da-python-arm64
ghcr.io/openhands/agent-server:3d9e6da-nikolaik_s_python-nodejs_tag_python3.13-nodejs22-slim-arm64
ghcr.io/openhands/agent-server:3d9e6da-golang
ghcr.io/openhands/agent-server:3d9e6da-java
ghcr.io/openhands/agent-server:3d9e6da-python

About Multi-Architecture Support

  • Each variant tag (e.g., 3d9e6da-python) is a multi-arch manifest supporting both amd64 and arm64
  • Docker automatically pulls the correct architecture for your platform
  • Individual architecture tags (e.g., 3d9e6da-python-amd64) are also available if needed

…e context window

When litellm's model registry reports max_output_tokens >= max_input_tokens
(e.g. Nemotron: both 262144), the SDK would request the entire context window
for output, leaving zero tokens for input. Every provider call was rejected,
the condenser misinterpreted this as context overflow, and crashed on the
near-empty history with NoCondensationAvailableException.

Cap auto-detected max_output_tokens to half the context window when it would
otherwise consume the full window. Explicitly user-set values are not affected.

Co-authored-by: openhands <openhands@all-hands.dev>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 7, 2026

Python API breakage checks — ✅ PASSED

Result:PASSED

Action log

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 7, 2026

REST API breakage checks (OpenAPI) — ✅ PASSED

Result:PASSED

Action log

Copy link
Copy Markdown
Collaborator

@all-hands-bot all-hands-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟢 Good taste - Pragmatic fix for broken model registry data.

Analysis:

This solves a real problem: when model registry reports max_output_tokens >= max_input_tokens (Nemotron: both 262144), every LLM call fails because the entire context window is reserved for output, leaving zero room for input.

The fix is minimal and pragmatic: cap auto-detected values to half the context window. This is consistent with existing max_tokens handling (line 1227 already does // 2).

Verdict: ✅ Worth merging - solves a real bug without over-engineering.

Important: This PR affects LLM call behavior and condenser behavior (mentioned in description), which puts it in the eval risk category. A human maintainer should verify via lightweight evals before merging. Using COMMENT review per repo guidelines rather than APPROVE.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 7, 2026

Coverage

Coverage Report •
FileStmtsMissCoverMissing
openhands-sdk/openhands/sdk/llm
   llm.py5167884%466, 485, 541, 797, 903, 905–906, 934, 980, 991–993, 997–1001, 1009–1011, 1021–1023, 1026–1027, 1031, 1033–1034, 1036, 1260–1261, 1458–1459, 1468, 1481, 1483–1488, 1490–1507, 1510–1514, 1516–1517, 1523–1532, 1587, 1589
TOTAL21977631871% 

csmith49 and others added 2 commits April 7, 2026 10:30
…e context window

When litellm's model registry reports max_output_tokens >= max_input_tokens
(e.g. Nemotron: both 262144), the SDK would request the entire context window
for output, leaving zero tokens for input. Every provider call was rejected,
the condenser misinterpreted this as context overflow, and crashed on the
near-empty history with NoCondensationAvailableException.

Cap auto-detected max_output_tokens to half the context window when it would
otherwise consume the full window. Explicitly user-set values are not affected.

Co-authored-by: openhands <openhands@all-hands.dev>
@juanmichelini
Copy link
Copy Markdown
Collaborator

and self.max_output_tokens is not None
and self.max_output_tokens >= context_window
):
capped = self.max_output_tokens // 2
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's why we sometimes had 4096 or something like that, output tokens are not typically all that much in a single call. This works though! 🤔

It just means the history is smaller when it reaches context error, than if we put some value like 4096, because half is more

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does setting the max like that encourage models to generate more? Honestly I'm not sure. I'd expect we'll end up with very similarly-sized events as if we had set it at 4096.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤷 I don't know, I'm thinking about the reverse: setting half means that the LLM API provider will error sooner, because it adds that value to the prompt I think? So to the input tokens at the time of the request.

At least, I'm pretty sure Anthropic and OpenAI do that, and I thought the error message suggested it... I could be wrong though

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The relevant error message here is:

You passed 2468 input characters and requested 262144 output tokens.
However, the model's context length is only 262144 tokens, resulting in
a maximum input length of 0 tokens (at most 0 characters). Please reduce
the length of the input prompt.

Maybe the "requested" suggests that behavior? In which case this is probably worth escalating to LiteLLM, considering it's their registry that sets the output tokens the way it is.

Copy link
Copy Markdown
Collaborator

@juanmichelini juanmichelini left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested and LGTM

@csmith49 csmith49 merged commit f5fcef8 into main Apr 7, 2026
31 of 32 checks passed
@csmith49 csmith49 deleted the fix/nemotron-max-output-tokens-headroom branch April 7, 2026 19:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants