Conversation
…e context window When litellm's model registry reports max_output_tokens >= max_input_tokens (e.g. Nemotron: both 262144), the SDK would request the entire context window for output, leaving zero tokens for input. Every provider call was rejected, the condenser misinterpreted this as context overflow, and crashed on the near-empty history with NoCondensationAvailableException. Cap auto-detected max_output_tokens to half the context window when it would otherwise consume the full window. Explicitly user-set values are not affected. Co-authored-by: openhands <openhands@all-hands.dev>
Python API breakage checks — ✅ PASSEDResult: ✅ PASSED |
REST API breakage checks (OpenAPI) — ✅ PASSEDResult: ✅ PASSED |
all-hands-bot
left a comment
There was a problem hiding this comment.
🟢 Good taste - Pragmatic fix for broken model registry data.
Analysis:
This solves a real problem: when model registry reports max_output_tokens >= max_input_tokens (Nemotron: both 262144), every LLM call fails because the entire context window is reserved for output, leaving zero room for input.
The fix is minimal and pragmatic: cap auto-detected values to half the context window. This is consistent with existing max_tokens handling (line 1227 already does // 2).
Verdict: ✅ Worth merging - solves a real bug without over-engineering.
Important: This PR affects LLM call behavior and condenser behavior (mentioned in description), which puts it in the eval risk category. A human maintainer should verify via lightweight evals before merging. Using COMMENT review per repo guidelines rather than APPROVE.
Coverage Report •
|
||||||||||||||||||||
…e context window When litellm's model registry reports max_output_tokens >= max_input_tokens (e.g. Nemotron: both 262144), the SDK would request the entire context window for output, leaving zero tokens for input. Every provider call was rejected, the condenser misinterpreted this as context overflow, and crashed on the near-empty history with NoCondensationAvailableException. Cap auto-detected max_output_tokens to half the context window when it would otherwise consume the full window. Explicitly user-set values are not affected. Co-authored-by: openhands <openhands@all-hands.dev>
| and self.max_output_tokens is not None | ||
| and self.max_output_tokens >= context_window | ||
| ): | ||
| capped = self.max_output_tokens // 2 |
There was a problem hiding this comment.
I think that's why we sometimes had 4096 or something like that, output tokens are not typically all that much in a single call. This works though! 🤔
It just means the history is smaller when it reaches context error, than if we put some value like 4096, because half is more
There was a problem hiding this comment.
Does setting the max like that encourage models to generate more? Honestly I'm not sure. I'd expect we'll end up with very similarly-sized events as if we had set it at 4096.
There was a problem hiding this comment.
🤷 I don't know, I'm thinking about the reverse: setting half means that the LLM API provider will error sooner, because it adds that value to the prompt I think? So to the input tokens at the time of the request.
At least, I'm pretty sure Anthropic and OpenAI do that, and I thought the error message suggested it... I could be wrong though
There was a problem hiding this comment.
The relevant error message here is:
You passed 2468 input characters and requested 262144 output tokens.
However, the model's context length is only 262144 tokens, resulting in
a maximum input length of 0 tokens (at most 0 characters). Please reduce
the length of the input prompt.
Maybe the "requested" suggests that behavior? In which case this is probably worth escalating to LiteLLM, considering it's their registry that sets the output tokens the way it is.
What/Why
When litellm's model registry reports max_output_tokens >= max_input_tokens (e.g. Nemotron: both 262144), the SDK would request the entire context window for output, leaving zero tokens for input. Every provider call was rejected, the condenser misinterpreted this as context overflow, and crashed on the near-empty history with NoCondensationAvailableException.
Cap auto-detected max_output_tokens to half the context window when it would otherwise consume the full window. Explicitly user-set values are not affected.
Checklist
Agent Server images for this PR
• GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server
Variants & Base Images
eclipse-temurin:17-jdknikolaik/python-nodejs:python3.13-nodejs22-slimgolang:1.21-bookwormPull (multi-arch manifest)
# Each variant is a multi-arch manifest supporting both amd64 and arm64 docker pull ghcr.io/openhands/agent-server:3d9e6da-pythonRun
All tags pushed for this build
About Multi-Architecture Support
3d9e6da-python) is a multi-arch manifest supporting both amd64 and arm643d9e6da-python-amd64) are also available if needed