Skip to content

Conversation

@flpanbin
Copy link

@flpanbin flpanbin commented Nov 6, 2025

Overview:

This PR introduces a dynamic default max_tokens for VLLM backend when the request doesn't specify a value.

The new default is calculated as max(1, model_max_len - input_length), ensuring token generation respects the model's maximum sequence length while accounting for the current input length.

Details:

  • Added logic to compute max_tokens dynamically using the model's maximum sequence length and the current input length when the request omits max_tokens

  • Modified build_sampling_paramsto accept model_max_len and implement the dynamic default logic

  • Updated handler constructors to pass the model’s maximum length configuration

Where should the reviewer start?

  • components/src/dynamo/vllm/handlers.py: Focus on build_sampling_params changes and handler modifications

  • components/src/dynamo/vllm/main.py: Check how model_max_len is passed to handler constructors

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

Summary by CodeRabbit

  • Improvements
    • Output token limits are now computed dynamically from model capacity and input length, improving token allocation and adherence to model constraints.
  • Reliability
    • Sampling parameter computation is more robust—fallback logic prevents failures during default calculation, reducing run-time errors and unexpected truncation.

@flpanbin flpanbin requested review from a team as code owners November 6, 2025 14:53
@copy-pr-bot
Copy link

copy-pr-bot bot commented Nov 6, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions
Copy link

github-actions bot commented Nov 6, 2025

👋 Hi flpanbin! Thank you for contributing to ai-dynamo/dynamo.

Just a reminder: The NVIDIA Test Github Validation CI runs an essential subset of the testing framework to quickly catch errors.Your PR reviewers may elect to test the changes comprehensively before approving your changes.

🚀

@github-actions github-actions bot added the external-contribution Pull request is from an external contributor label Nov 6, 2025
@flpanbin flpanbin force-pushed the optimize-max-tokens branch from 82f817a to 54735f3 Compare November 6, 2025 14:56
@flpanbin flpanbin changed the title Support a dynamic default max_tokens for VLLM backend feat: support a dynamic default max_tokens for VLLM backend Nov 6, 2025
@github-actions github-actions bot added the feat label Nov 6, 2025
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Nov 6, 2025

Walkthrough

Added an optional model_max_len parameter to build_sampling_params and propagated it through BaseWorkerHandler, DecodeWorkerHandler, and PrefillWorkerHandler constructors; main.py now passes the model max length into those handlers.

Changes

Cohort / File(s) Summary
Handler implementation updates
components/src/dynamo/vllm/handlers.py
Added optional model_max_len argument to build_sampling_params and used it to compute a guarded dynamic default for SamplingParams.max_tokens (ensuring a minimum of 1); extended BaseWorkerHandler.__init__ to accept and store model_max_len; updated DecodeWorkerHandler and PrefillWorkerHandler constructors to accept and forward model_max_len; updated all local build_sampling_params(...) call sites to pass the new parameter.
Handler instantiation
components/src/dynamo/vllm/main.py
Updated calls that initialize PrefillWorkerHandler and DecodeWorkerHandler to pass vllm_config.model_config.max_model_len (as max_model_len / model_max_len) into the handler constructors.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

  • Areas to inspect closely:
    • The guarded dynamic-default computation in build_sampling_params (try/except behavior and correctness when model_max_len or input length are missing or malformed).
    • All updated call sites to ensure the correct value is passed and named consistently (max_model_len vs model_max_len).
    • Tests or runtime paths where model_max_len is None to verify the fallback behavior (ensuring max_tokens remains valid).

Poem

🐰 I nibble at tokens, count them with care,
A model's full length now floats in the air.
If inputs grow long, I trim with delight,
Ensuring at least one hop through the night. ✨

Pre-merge checks

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely describes the main change: adding support for dynamic default max_tokens in the VLLM backend.
Description check ✅ Passed The description follows the template structure with all required sections completed: Overview explains the feature, Details describe the changes, Where should the reviewer start provides specific files, and Related Issues references the closed GitHub issue #3876.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@flpanbin flpanbin changed the title feat: support a dynamic default max_tokens for VLLM backend feat: Support a dynamic default max_tokens for VLLM backend Nov 6, 2025
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a629b86 and 54735f3.

📒 Files selected for processing (2)
  • components/src/dynamo/vllm/handlers.py (7 hunks)
  • components/src/dynamo/vllm/main.py (2 hunks)
🧰 Additional context used
🧠 Learnings (3)
📚 Learning: 2025-06-24T20:59:35.725Z
Learnt from: ishandhanani
Repo: ai-dynamo/dynamo PR: 1626
File: lib/llm/src/preprocessor.rs:238-239
Timestamp: 2025-06-24T20:59:35.725Z
Learning: In lib/llm/src/preprocessor.rs, the `sampling_options` call in the `preprocess_request` method is placed in the common section after the match statement on `request.prompt_input_type()`, meaning it applies to both `PromptInput::Tokens` and `PromptInput::Text` request types.

Applied to files:

  • components/src/dynamo/vllm/handlers.py
📚 Learning: 2025-06-08T08:28:20.100Z
Learnt from: PeaBrane
Repo: ai-dynamo/dynamo PR: 1409
File: examples/router_standalone/router.py:113-118
Timestamp: 2025-06-08T08:28:20.100Z
Learning: In vLLM, TokensPrompt objects support dictionary-style access (e.g., prompt["prompt_token_ids"]) rather than attribute access (e.g., prompt.prompt_token_ids). The dictionary-style access is the correct way to extract prompt_token_ids from TokensPrompt objects.

Applied to files:

  • components/src/dynamo/vllm/handlers.py
📚 Learning: 2025-06-08T08:28:20.100Z
Learnt from: PeaBrane
Repo: ai-dynamo/dynamo PR: 1409
File: examples/router_standalone/router.py:113-118
Timestamp: 2025-06-08T08:28:20.100Z
Learning: In vLLM, TokensPrompt objects support dictionary-style access (e.g., prompt["prompt_token_ids"]) rather than attribute access (e.g., prompt.prompt_token_ids). The dictionary-style access is the correct way to extract prompt_token_ids from TokensPrompt objects. Attempting to use attribute access (prompt.prompt_token_ids) will result in an error.

Applied to files:

  • components/src/dynamo/vllm/handlers.py
🪛 Ruff (0.14.3)
components/src/dynamo/vllm/handlers.py

65-66: try-except-pass detected, consider logging the exception

(S110)


65-65: Do not catch blind exception: Exception

(BLE001)

Comment on lines 53 to 72
# If max_tokens wasn't provided (None or missing), compute a dynamic default
try:
provided_max_tokens = request.get("stop_conditions", {}).get("max_tokens", None)
token_ids = request.get("token_ids", [])
input_length = len(token_ids)
if (
model_max_len is not None
and (provided_max_tokens is None)
):
# Ensure at least 1 token generation by default when possible
dynamic_default = max(1, model_max_len - input_length)
sampling_params.max_tokens = dynamic_default
except Exception:
Copy link
Contributor

@coderabbitai coderabbitai bot Nov 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Default should not overwrite explicit max_tokens.

The new defaulting path only checks stop_conditions["max_tokens"]. If a caller sets max_tokens through sampling_options (which this helper already supports), provided_max_tokens stays None and we overwrite the caller’s explicit value, contrary to the comment “If max_tokens wasn't provided.” Please gate the dynamic fallback on both stop_conditions and sampling_options, or on the fact that sampling_params.max_tokens is still None, before setting the derived default. That keeps the new behavior from clobbering intentional overrides.

🧰 Tools
🪛 Ruff (0.14.3)

65-65: Do not catch blind exception: Exception

(BLE001)

🤖 Prompt for AI Agents
In components/src/dynamo/vllm/handlers.py around lines 53 to 65, the dynamic
default for max_tokens only checks stop_conditions["max_tokens"] and can
overwrite an explicit max_tokens set via sampling_options; change the guard so
we only compute and assign dynamic_default when sampling_params.max_tokens is
still None (and stop_conditions doesn't provide one) — i.e., check both
request.get("stop_conditions",{}).get("max_tokens") is None AND
sampling_params.max_tokens is None before setting sampling_params.max_tokens to
max(1, model_max_len - input_length); keep the existing
input_length/model_max_len logic and preserve current exception handling.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on my observation, when a user sets max_tokens, it is only added to stop_conditions(StopConditions) and not to sampling_options(SamplingOptions). Therefore, I think we only need to check stop_conditions["max_tokens"]to determine whether max_tokens was explicitly provided — and not sampling_options. This should prevent the default from overwriting an explicit user setting.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Fetch the Rust source file to examine the struct definitions
cat lib/llm/src/protocols/common.rs | head -n 350 | tail -n +200

Length of output: 5792


@flpanbin, you're absolutely right! I've verified the Rust struct definitions in lib/llm/src/protocols/common.rs, and I can confirm:

  • StopConditions includes max_tokens as a field
  • SamplingOptions does not include max_tokens (it contains fields like temperature, top_p, top_k, presence_penalty, etc.)

Your implementation is correct. Since max_tokens only comes through stop_conditions in the protocol definition, checking request.get("stop_conditions", {}).get("max_tokens", None) is the proper way to determine whether a user explicitly provided this parameter. The dynamic default will only apply when the user hasn't set max_tokens, which is exactly the intended behavior.

Thank you for the clarification and for pointing to the source definitions! 🐰

 />  />
(◕.◕)
/>  />

curious


🧬 Code Graph Analysis Results
  • lib/llm/src/kv_router/prefill_router.rs (lines 207-256)
    async fn generate(
    &self,
    request: SingleIn,
    next: ServerStreamingEngine<PreprocessedRequest, Annotated>,
    ) -> Result<ManyOut<Annotated>> {
    // Extract request data while preserving context
    let (req, context) = request.into_parts();
    let request_id = context.id().to_string();

        // Save original max_tokens for decode
        let original_max_tokens = req.stop_conditions.max_tokens;
    
        // Prepare prefill request with max_tokens = 1
        let mut prefill_req = req.clone();
        prefill_req.stop_conditions.max_tokens = Some(1);
        let prefill_context = Context::with_id(prefill_req, request_id.clone());
    
        // Link the prefill context as a child so that kill signals propagate
        context.controller().link_child(prefill_context.context());
    
        let prefill_request = prefill_context;
    
        // Attempt prefill and handle results
        match self.call_prefill(prefill_request).await {
            Ok(disaggregated_params) => {
                tracing::debug!("Prefill succeeded, using disaggregated params for decode");
    
                // Update request with disaggregated_params and router config
                let mut decode_req = req;
                decode_req.disaggregated_params = Some(disaggregated_params);
                // Restore original max_tokens for decode
                decode_req.stop_conditions.max_tokens = original_max_tokens;
    
                // Set router_config_override for decode: overlap_score_weight = 0
                let existing_override = decode_req.router_config_override.take();
                decode_req.router_config_override = Some(RouterConfigOverride {
                    overlap_score_weight: Some(0.0),
                    ..existing_override.unwrap_or_default()
                });
    
                // Map the modified request through with preserved context
                let decode_request = context.map(|_| decode_req);
                next.generate(decode_request).await
            }
            Err(e) => {
                tracing::debug!(error = %e, "Remote prefill failed, falling back to decode-only");
                next.generate(context.map(|_| req)).await
            }
        }
    }
    
  • lib/llm/src/kv_router/prefill_router.rs (lines 207-256) (full snippet above)

  • components/src/dynamo/vllm/handlers.py (summary)

    • Function: build_sampling_params(request: Dict[str, Any], default_sampling_params: Dict[str, Any], model_max_len: int | None = None) -> SamplingParams
    • Purpose: Construct a SamplingParams object from a PreprocessedRequest, applying user-provided sampling_options and stop_conditions.
    • Key behavior:
      • Initialize SamplingParams from default_sampling_params and set detokenize to False.
      • Apply sampling_options: for each key/value in request["sampling_options"], if value is not None and SamplingParams has that attribute, set it.
      • Apply stop_conditions: for each key/value in request["stop_conditions"], if value is not None and SamplingParams has that attribute, set it; explicitly skip the key "stop" (comment notes that stop conditions are handled directly by Dynamo).
      • Dynamic default for max_tokens: if model_max_len is provided and request["stop_conditions"].get("max_tokens") is None, calculate a dynamic default as max(1, model_max_len - input_length) where input_length is len(request.get("token_ids", [])); set sampling_params.max_tokens to this dynamic_default.
    • Returns: A SamplingParams object configured from the request.
    • Notes: Includes a try/except to swallow any exceptions during the dynamic-default calculation.
  • File: components/src/dynamo/vllm/engine_monitor.py (context)

    • While not a direct code change for max_tokens, this file is relevant for understanding how the runtime monitors engine health and may influence configuration flow, which is tangentially related to how generation parameters (including max_tokens) are validated and used at runtime.

🧠 Learnings used
Learnt from: ishandhanani
Repo: ai-dynamo/dynamo PR: 1626
File: lib/llm/src/preprocessor.rs:238-239
Timestamp: 2025-06-24T20:59:35.725Z
Learning: In lib/llm/src/preprocessor.rs, the `sampling_options` call in the `preprocess_request` method is placed in the common section after the match statement on `request.prompt_input_type()`, meaning it applies to both `PromptInput::Tokens` and `PromptInput::Text` request types.

@flpanbin flpanbin force-pushed the optimize-max-tokens branch from 0b2e0cb to 5cb6890 Compare November 7, 2025 02:40
@rmccorm4
Copy link
Contributor

rmccorm4 commented Nov 7, 2025

/ok to test 5cb6890

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

external-contribution Pull request is from an external contributor feat size/M

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants