-
Notifications
You must be signed in to change notification settings - Fork 674
feat: Support a dynamic default max_tokens for VLLM backend #4156
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
👋 Hi flpanbin! Thank you for contributing to ai-dynamo/dynamo. Just a reminder: The 🚀 |
82f817a to
54735f3
Compare
WalkthroughAdded an optional Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes
Poem
Pre-merge checks❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
components/src/dynamo/vllm/handlers.py(7 hunks)components/src/dynamo/vllm/main.py(2 hunks)
🧰 Additional context used
🧠 Learnings (3)
📚 Learning: 2025-06-24T20:59:35.725Z
Learnt from: ishandhanani
Repo: ai-dynamo/dynamo PR: 1626
File: lib/llm/src/preprocessor.rs:238-239
Timestamp: 2025-06-24T20:59:35.725Z
Learning: In lib/llm/src/preprocessor.rs, the `sampling_options` call in the `preprocess_request` method is placed in the common section after the match statement on `request.prompt_input_type()`, meaning it applies to both `PromptInput::Tokens` and `PromptInput::Text` request types.
Applied to files:
components/src/dynamo/vllm/handlers.py
📚 Learning: 2025-06-08T08:28:20.100Z
Learnt from: PeaBrane
Repo: ai-dynamo/dynamo PR: 1409
File: examples/router_standalone/router.py:113-118
Timestamp: 2025-06-08T08:28:20.100Z
Learning: In vLLM, TokensPrompt objects support dictionary-style access (e.g., prompt["prompt_token_ids"]) rather than attribute access (e.g., prompt.prompt_token_ids). The dictionary-style access is the correct way to extract prompt_token_ids from TokensPrompt objects.
Applied to files:
components/src/dynamo/vllm/handlers.py
📚 Learning: 2025-06-08T08:28:20.100Z
Learnt from: PeaBrane
Repo: ai-dynamo/dynamo PR: 1409
File: examples/router_standalone/router.py:113-118
Timestamp: 2025-06-08T08:28:20.100Z
Learning: In vLLM, TokensPrompt objects support dictionary-style access (e.g., prompt["prompt_token_ids"]) rather than attribute access (e.g., prompt.prompt_token_ids). The dictionary-style access is the correct way to extract prompt_token_ids from TokensPrompt objects. Attempting to use attribute access (prompt.prompt_token_ids) will result in an error.
Applied to files:
components/src/dynamo/vllm/handlers.py
🪛 Ruff (0.14.3)
components/src/dynamo/vllm/handlers.py
65-66: try-except-pass detected, consider logging the exception
(S110)
65-65: Do not catch blind exception: Exception
(BLE001)
| # If max_tokens wasn't provided (None or missing), compute a dynamic default | ||
| try: | ||
| provided_max_tokens = request.get("stop_conditions", {}).get("max_tokens", None) | ||
| token_ids = request.get("token_ids", []) | ||
| input_length = len(token_ids) | ||
| if ( | ||
| model_max_len is not None | ||
| and (provided_max_tokens is None) | ||
| ): | ||
| # Ensure at least 1 token generation by default when possible | ||
| dynamic_default = max(1, model_max_len - input_length) | ||
| sampling_params.max_tokens = dynamic_default | ||
| except Exception: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Default should not overwrite explicit max_tokens.
The new defaulting path only checks stop_conditions["max_tokens"]. If a caller sets max_tokens through sampling_options (which this helper already supports), provided_max_tokens stays None and we overwrite the caller’s explicit value, contrary to the comment “If max_tokens wasn't provided.” Please gate the dynamic fallback on both stop_conditions and sampling_options, or on the fact that sampling_params.max_tokens is still None, before setting the derived default. That keeps the new behavior from clobbering intentional overrides.
🧰 Tools
🪛 Ruff (0.14.3)
65-65: Do not catch blind exception: Exception
(BLE001)
🤖 Prompt for AI Agents
In components/src/dynamo/vllm/handlers.py around lines 53 to 65, the dynamic
default for max_tokens only checks stop_conditions["max_tokens"] and can
overwrite an explicit max_tokens set via sampling_options; change the guard so
we only compute and assign dynamic_default when sampling_params.max_tokens is
still None (and stop_conditions doesn't provide one) — i.e., check both
request.get("stop_conditions",{}).get("max_tokens") is None AND
sampling_params.max_tokens is None before setting sampling_params.max_tokens to
max(1, model_max_len - input_length); keep the existing
input_length/model_max_len logic and preserve current exception handling.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on my observation, when a user sets max_tokens, it is only added to stop_conditions(StopConditions) and not to sampling_options(SamplingOptions). Therefore, I think we only need to check stop_conditions["max_tokens"]to determine whether max_tokens was explicitly provided — and not sampling_options. This should prevent the default from overwriting an explicit user setting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Fetch the Rust source file to examine the struct definitions
cat lib/llm/src/protocols/common.rs | head -n 350 | tail -n +200
Length of output: 5792
@flpanbin, you're absolutely right! I've verified the Rust struct definitions in lib/llm/src/protocols/common.rs, and I can confirm:
StopConditionsincludesmax_tokensas a fieldSamplingOptionsdoes not includemax_tokens(it contains fields liketemperature,top_p,top_k,presence_penalty, etc.)
Your implementation is correct. Since max_tokens only comes through stop_conditions in the protocol definition, checking request.get("stop_conditions", {}).get("max_tokens", None) is the proper way to determine whether a user explicitly provided this parameter. The dynamic default will only apply when the user hasn't set max_tokens, which is exactly the intended behavior.
Thank you for the clarification and for pointing to the source definitions! 🐰
/> />
(◕.◕)
/> />
curious
🧬 Code Graph Analysis Results
-
lib/llm/src/kv_router/prefill_router.rs (lines 207-256)
async fn generate(
&self,
request: SingleIn,
next: ServerStreamingEngine<PreprocessedRequest, Annotated>,
) -> Result<ManyOut<Annotated>> {
// Extract request data while preserving context
let (req, context) = request.into_parts();
let request_id = context.id().to_string();// Save original max_tokens for decode let original_max_tokens = req.stop_conditions.max_tokens; // Prepare prefill request with max_tokens = 1 let mut prefill_req = req.clone(); prefill_req.stop_conditions.max_tokens = Some(1); let prefill_context = Context::with_id(prefill_req, request_id.clone()); // Link the prefill context as a child so that kill signals propagate context.controller().link_child(prefill_context.context()); let prefill_request = prefill_context; // Attempt prefill and handle results match self.call_prefill(prefill_request).await { Ok(disaggregated_params) => { tracing::debug!("Prefill succeeded, using disaggregated params for decode"); // Update request with disaggregated_params and router config let mut decode_req = req; decode_req.disaggregated_params = Some(disaggregated_params); // Restore original max_tokens for decode decode_req.stop_conditions.max_tokens = original_max_tokens; // Set router_config_override for decode: overlap_score_weight = 0 let existing_override = decode_req.router_config_override.take(); decode_req.router_config_override = Some(RouterConfigOverride { overlap_score_weight: Some(0.0), ..existing_override.unwrap_or_default() }); // Map the modified request through with preserved context let decode_request = context.map(|_| decode_req); next.generate(decode_request).await } Err(e) => { tracing::debug!(error = %e, "Remote prefill failed, falling back to decode-only"); next.generate(context.map(|_| req)).await } } } -
lib/llm/src/kv_router/prefill_router.rs (lines 207-256) (full snippet above)
-
components/src/dynamo/vllm/handlers.py (summary)
- Function: build_sampling_params(request: Dict[str, Any], default_sampling_params: Dict[str, Any], model_max_len: int | None = None) -> SamplingParams
- Purpose: Construct a SamplingParams object from a PreprocessedRequest, applying user-provided sampling_options and stop_conditions.
- Key behavior:
- Initialize SamplingParams from default_sampling_params and set detokenize to False.
- Apply sampling_options: for each key/value in request["sampling_options"], if value is not None and SamplingParams has that attribute, set it.
- Apply stop_conditions: for each key/value in request["stop_conditions"], if value is not None and SamplingParams has that attribute, set it; explicitly skip the key "stop" (comment notes that stop conditions are handled directly by Dynamo).
- Dynamic default for max_tokens: if model_max_len is provided and request["stop_conditions"].get("max_tokens") is None, calculate a dynamic default as max(1, model_max_len - input_length) where input_length is len(request.get("token_ids", [])); set sampling_params.max_tokens to this dynamic_default.
- Returns: A SamplingParams object configured from the request.
- Notes: Includes a try/except to swallow any exceptions during the dynamic-default calculation.
-
File: components/src/dynamo/vllm/engine_monitor.py (context)
- While not a direct code change for max_tokens, this file is relevant for understanding how the runtime monitors engine health and may influence configuration flow, which is tangentially related to how generation parameters (including max_tokens) are validated and used at runtime.
🧠 Learnings used
Learnt from: ishandhanani
Repo: ai-dynamo/dynamo PR: 1626
File: lib/llm/src/preprocessor.rs:238-239
Timestamp: 2025-06-24T20:59:35.725Z
Learning: In lib/llm/src/preprocessor.rs, the `sampling_options` call in the `preprocess_request` method is placed in the common section after the match statement on `request.prompt_input_type()`, meaning it applies to both `PromptInput::Tokens` and `PromptInput::Text` request types.
3996f18 to
0b2e0cb
Compare
Signed-off-by: bin <[email protected]>
0b2e0cb to
5cb6890
Compare
|
/ok to test 5cb6890 |
Overview:
This PR introduces a dynamic default max_tokens for VLLM backend when the request doesn't specify a value.
The new default is calculated as max(1, model_max_len - input_length), ensuring token generation respects the model's maximum sequence length while accounting for the current input length.
Details:
Added logic to compute max_tokens dynamically using the model's maximum sequence length and the current input length when the request omits max_tokens
Modified build_sampling_paramsto accept model_max_len and implement the dynamic default logic
Updated handler constructors to pass the model’s maximum length configuration
Where should the reviewer start?
components/src/dynamo/vllm/handlers.py: Focus on build_sampling_params changes and handler modifications
components/src/dynamo/vllm/main.py: Check how model_max_len is passed to handler constructors
Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)
Summary by CodeRabbit