feat: Support a dynamic default max_tokens for VLLM backend #4156

flpanbin · 2025-11-06T14:53:56Z

Overview:

This PR introduces a dynamic default max_tokens for VLLM backend when the request doesn't specify a value.

The new default is calculated as max(1, model_max_len - input_length), ensuring token generation respects the model's maximum sequence length while accounting for the current input length.

Details:

Added logic to compute max_tokens dynamically using the model's maximum sequence length and the current input length when the request omits max_tokens
Modified build_sampling_paramsto accept model_max_len and implement the dynamic default logic
Updated handler constructors to pass the model’s maximum length configuration

Where should the reviewer start?

components/src/dynamo/vllm/handlers.py: Focus on build_sampling_params changes and handler modifications
components/src/dynamo/vllm/main.py: Check how model_max_len is passed to handler constructors

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

closes GitHub issue: [BUG]: Incomplete Response When max_tokens parameter is omitted #3876

Summary by CodeRabbit

Improvements
- Output token limits are now computed dynamically from model capacity and input length, improving token allocation and adherence to model constraints.
Reliability
- Sampling parameter computation is more robust—fallback logic prevents failures during default calculation, reducing run-time errors and unexpected truncation.

copy-pr-bot · 2025-11-06T14:54:00Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

github-actions · 2025-11-06T14:54:06Z

👋 Hi flpanbin! Thank you for contributing to ai-dynamo/dynamo.

Just a reminder: The NVIDIA Test Github Validation CI runs an essential subset of the testing framework to quickly catch errors.Your PR reviewers may elect to test the changes comprehensively before approving your changes.

🚀

coderabbitai · 2025-11-06T14:58:24Z

Walkthrough

Added an optional model_max_len parameter to build_sampling_params and propagated it through BaseWorkerHandler, DecodeWorkerHandler, and PrefillWorkerHandler constructors; main.py now passes the model max length into those handlers.

Changes

Cohort / File(s)	Summary
Handler implementation updates `components/src/dynamo/vllm/handlers.py`	Added optional `model_max_len` argument to `build_sampling_params` and used it to compute a guarded dynamic default for `SamplingParams.max_tokens` (ensuring a minimum of 1); extended `BaseWorkerHandler.__init__` to accept and store `model_max_len`; updated `DecodeWorkerHandler` and `PrefillWorkerHandler` constructors to accept and forward `model_max_len`; updated all local `build_sampling_params(...)` call sites to pass the new parameter.
Handler instantiation `components/src/dynamo/vllm/main.py`	Updated calls that initialize `PrefillWorkerHandler` and `DecodeWorkerHandler` to pass `vllm_config.model_config.max_model_len` (as `max_model_len` / `model_max_len`) into the handler constructors.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Areas to inspect closely:
- The guarded dynamic-default computation in build_sampling_params (try/except behavior and correctness when model_max_len or input length are missing or malformed).
- All updated call sites to ensure the correct value is passed and named consistently (max_model_len vs model_max_len).
- Tests or runtime paths where model_max_len is None to verify the fallback behavior (ensuring max_tokens remains valid).

Poem

🐰 I nibble at tokens, count them with care,
A model's full length now floats in the air.
If inputs grow long, I trim with delight,
Ensuring at least one hop through the night. ✨

Pre-merge checks

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and concisely describes the main change: adding support for dynamic default max_tokens in the VLLM backend.
Description check	✅ Passed	The description follows the template structure with all required sections completed: Overview explains the feature, Details describe the changes, Where should the reviewer start provides specific files, and Related Issues references the closed GitHub issue #3876.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a629b86 and 54735f3.

📒 Files selected for processing (2)

components/src/dynamo/vllm/handlers.py (7 hunks)
components/src/dynamo/vllm/main.py (2 hunks)

🧰 Additional context used

🧠 Learnings (3)

📚 Learning: 2025-06-24T20:59:35.725Z

Learnt from: ishandhanani
Repo: ai-dynamo/dynamo PR: 1626
File: lib/llm/src/preprocessor.rs:238-239
Timestamp: 2025-06-24T20:59:35.725Z
Learning: In lib/llm/src/preprocessor.rs, the `sampling_options` call in the `preprocess_request` method is placed in the common section after the match statement on `request.prompt_input_type()`, meaning it applies to both `PromptInput::Tokens` and `PromptInput::Text` request types.

Applied to files:

components/src/dynamo/vllm/handlers.py

📚 Learning: 2025-06-08T08:28:20.100Z

Learnt from: PeaBrane
Repo: ai-dynamo/dynamo PR: 1409
File: examples/router_standalone/router.py:113-118
Timestamp: 2025-06-08T08:28:20.100Z
Learning: In vLLM, TokensPrompt objects support dictionary-style access (e.g., prompt["prompt_token_ids"]) rather than attribute access (e.g., prompt.prompt_token_ids). The dictionary-style access is the correct way to extract prompt_token_ids from TokensPrompt objects.

Applied to files:

components/src/dynamo/vllm/handlers.py

📚 Learning: 2025-06-08T08:28:20.100Z

Learnt from: PeaBrane
Repo: ai-dynamo/dynamo PR: 1409
File: examples/router_standalone/router.py:113-118
Timestamp: 2025-06-08T08:28:20.100Z
Learning: In vLLM, TokensPrompt objects support dictionary-style access (e.g., prompt["prompt_token_ids"]) rather than attribute access (e.g., prompt.prompt_token_ids). The dictionary-style access is the correct way to extract prompt_token_ids from TokensPrompt objects. Attempting to use attribute access (prompt.prompt_token_ids) will result in an error.

Applied to files:

components/src/dynamo/vllm/handlers.py

🪛 Ruff (0.14.3)

components/src/dynamo/vllm/handlers.py

65-66: try-except-pass detected, consider logging the exception

(S110)

65-65: Do not catch blind exception: Exception

(BLE001)

coderabbitai · 2025-11-06T15:06:50Z

components/src/dynamo/vllm/handlers.py

+   # If max_tokens wasn't provided (None or missing), compute a dynamic default
+    try:
+        provided_max_tokens = request.get("stop_conditions", {}).get("max_tokens", None)
+        token_ids = request.get("token_ids", [])
+        input_length = len(token_ids)
+        if (
+            model_max_len is not None
+            and (provided_max_tokens is None)
+        ):
+            # Ensure at least 1 token generation by default when possible
+            dynamic_default = max(1, model_max_len - input_length)
+            sampling_params.max_tokens = dynamic_default
+    except Exception:


⚠️ Potential issue | 🟠 Major

Default should not overwrite explicit max_tokens.

The new defaulting path only checks stop_conditions["max_tokens"]. If a caller sets max_tokens through sampling_options (which this helper already supports), provided_max_tokens stays None and we overwrite the caller’s explicit value, contrary to the comment “If max_tokens wasn't provided.” Please gate the dynamic fallback on both stop_conditions and sampling_options, or on the fact that sampling_params.max_tokens is still None, before setting the derived default. That keeps the new behavior from clobbering intentional overrides.

🧰 Tools

🪛 Ruff (0.14.3)

65-65: Do not catch blind exception: Exception

(BLE001)

🤖 Prompt for AI Agents

In components/src/dynamo/vllm/handlers.py around lines 53 to 65, the dynamic default for max_tokens only checks stop_conditions["max_tokens"] and can overwrite an explicit max_tokens set via sampling_options; change the guard so we only compute and assign dynamic_default when sampling_params.max_tokens is still None (and stop_conditions doesn't provide one) — i.e., check both request.get("stop_conditions",{}).get("max_tokens") is None AND sampling_params.max_tokens is None before setting sampling_params.max_tokens to max(1, model_max_len - input_length); keep the existing input_length/model_max_len logic and preserve current exception handling.

Based on my observation, when a user sets max_tokens, it is only added to stop_conditions(StopConditions) and not to sampling_options(SamplingOptions). Therefore, I think we only need to check stop_conditions["max_tokens"]to determine whether max_tokens was explicitly provided — and not sampling_options. This should prevent the default from overwriting an explicit user setting.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Fetch the Rust source file to examine the struct definitions cat lib/llm/src/protocols/common.rs | head -n 350 | tail -n +200

Length of output: 5792

@flpanbin, you're absolutely right! I've verified the Rust struct definitions in lib/llm/src/protocols/common.rs, and I can confirm:

StopConditions includes max_tokens as a field

SamplingOptions does not include max_tokens (it contains fields like temperature, top_p, top_k, presence_penalty, etc.)

Your implementation is correct. Since max_tokens only comes through stop_conditions in the protocol definition, checking request.get("stop_conditions", {}).get("max_tokens", None) is the proper way to determine whether a user explicitly provided this parameter. The dynamic default will only apply when the user hasn't set max_tokens, which is exactly the intended behavior.

Thank you for the clarification and for pointing to the source definitions! 🐰

/> /> (◕.◕) /> />

curious

🧬 Code Graph Analysis Results

lib/llm/src/kv_router/prefill_router.rs (lines 207-256)
async fn generate(
&self,
request: SingleIn,
next: ServerStreamingEngine<PreprocessedRequest, Annotated>,
) -> Result<ManyOut<Annotated>> {
// Extract request data while preserving context
let (req, context) = request.into_parts();
let request_id = context.id().to_string();

// Save original max_tokens for decode let original_max_tokens = req.stop_conditions.max_tokens; // Prepare prefill request with max_tokens = 1 let mut prefill_req = req.clone(); prefill_req.stop_conditions.max_tokens = Some(1); let prefill_context = Context::with_id(prefill_req, request_id.clone()); // Link the prefill context as a child so that kill signals propagate context.controller().link_child(prefill_context.context()); let prefill_request = prefill_context; // Attempt prefill and handle results match self.call_prefill(prefill_request).await { Ok(disaggregated_params) => { tracing::debug!("Prefill succeeded, using disaggregated params for decode"); // Update request with disaggregated_params and router config let mut decode_req = req; decode_req.disaggregated_params = Some(disaggregated_params); // Restore original max_tokens for decode decode_req.stop_conditions.max_tokens = original_max_tokens; // Set router_config_override for decode: overlap_score_weight = 0 let existing_override = decode_req.router_config_override.take(); decode_req.router_config_override = Some(RouterConfigOverride { overlap_score_weight: Some(0.0), ..existing_override.unwrap_or_default() }); // Map the modified request through with preserved context let decode_request = context.map(|_| decode_req); next.generate(decode_request).await } Err(e) => { tracing::debug!(error = %e, "Remote prefill failed, falling back to decode-only"); next.generate(context.map(|_| req)).await } } }

lib/llm/src/kv_router/prefill_router.rs (lines 207-256) (full snippet above)

components/src/dynamo/vllm/handlers.py (summary)

Function: build_sampling_params(request: Dict[str, Any], default_sampling_params: Dict[str, Any], model_max_len: int | None = None) -> SamplingParams

Purpose: Construct a SamplingParams object from a PreprocessedRequest, applying user-provided sampling_options and stop_conditions.

Key behavior:

Initialize SamplingParams from default_sampling_params and set detokenize to False.

Apply sampling_options: for each key/value in request["sampling_options"], if value is not None and SamplingParams has that attribute, set it.

Apply stop_conditions: for each key/value in request["stop_conditions"], if value is not None and SamplingParams has that attribute, set it; explicitly skip the key "stop" (comment notes that stop conditions are handled directly by Dynamo).

Dynamic default for max_tokens: if model_max_len is provided and request["stop_conditions"].get("max_tokens") is None, calculate a dynamic default as max(1, model_max_len - input_length) where input_length is len(request.get("token_ids", [])); set sampling_params.max_tokens to this dynamic_default.

Returns: A SamplingParams object configured from the request.

Notes: Includes a try/except to swallow any exceptions during the dynamic-default calculation.

File: components/src/dynamo/vllm/engine_monitor.py (context)

While not a direct code change for max_tokens, this file is relevant for understanding how the runtime monitors engine health and may influence configuration flow, which is tangentially related to how generation parameters (including max_tokens) are validated and used at runtime.

🧠 Learnings used

Learnt from: ishandhanani Repo: ai-dynamo/dynamo PR: 1626 File: lib/llm/src/preprocessor.rs:238-239 Timestamp: 2025-06-24T20:59:35.725Z Learning: In lib/llm/src/preprocessor.rs, the `sampling_options` call in the `preprocess_request` method is placed in the common section after the match statement on `request.prompt_input_type()`, meaning it applies to both `PromptInput::Tokens` and `PromptInput::Text` request types.

Signed-off-by: bin <[email protected]>

rmccorm4 · 2025-11-07T02:45:45Z

/ok to test 5cb6890

flpanbin requested review from a team as code owners November 6, 2025 14:53

pull-request-size bot added the size/M label Nov 6, 2025

github-actions bot added the external-contribution Pull request is from an external contributor label Nov 6, 2025

flpanbin force-pushed the optimize-max-tokens branch from 82f817a to 54735f3 Compare November 6, 2025 14:56

flpanbin changed the title ~~Support a dynamic default max_tokens for VLLM backend~~ feat: support a dynamic default max_tokens for VLLM backend Nov 6, 2025

github-actions bot added the feat label Nov 6, 2025

flpanbin changed the title ~~feat: support a dynamic default max_tokens for VLLM backend~~ feat: Support a dynamic default max_tokens for VLLM backend Nov 6, 2025

coderabbitai bot reviewed Nov 6, 2025

View reviewed changes

flpanbin force-pushed the optimize-max-tokens branch from 3996f18 to 0b2e0cb Compare November 7, 2025 00:59

rmccorm4 requested review from GuanLuo, alec-flowers, ayushag-nv and ptarasiewiczNV November 7, 2025 01:09

support a dynamic default max_tokens for VLLM backend

5cb6890

Signed-off-by: bin <[email protected]>

flpanbin force-pushed the optimize-max-tokens branch from 0b2e0cb to 5cb6890 Compare November 7, 2025 02:40

copy-pr-bot bot temporarily deployed to GITLAB November 7, 2025 02:45 Inactive

copy-pr-bot bot temporarily deployed to GITLAB November 7, 2025 02:49 Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Support a dynamic default max_tokens for VLLM backend #4156

feat: Support a dynamic default max_tokens for VLLM backend #4156

flpanbin commented Nov 6, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

copy-pr-bot bot commented Nov 6, 2025

Uh oh!

github-actions bot commented Nov 6, 2025

Uh oh!

coderabbitai bot commented Nov 6, 2025 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Nov 6, 2025 •

edited

Loading

Uh oh!

flpanbin Nov 7, 2025

Uh oh!

coderabbitai bot Nov 7, 2025

Uh oh!

rmccorm4 commented Nov 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

feat: Support a dynamic default max_tokens for VLLM backend #4156

Are you sure you want to change the base?

feat: Support a dynamic default max_tokens for VLLM backend #4156

Conversation

flpanbin commented Nov 6, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview:

Details:

Where should the reviewer start?

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

Summary by CodeRabbit

Uh oh!

copy-pr-bot bot commented Nov 6, 2025

Uh oh!

github-actions bot commented Nov 6, 2025

Uh oh!

coderabbitai bot commented Nov 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

Pre-merge checks

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Nov 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

flpanbin Nov 7, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Nov 7, 2025

Choose a reason for hiding this comment

Uh oh!

rmccorm4 commented Nov 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

flpanbin commented Nov 6, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Nov 6, 2025 •

edited

Loading

coderabbitai bot Nov 6, 2025 •

edited

Loading