-
Notifications
You must be signed in to change notification settings - Fork 675
feat: Add support for skip_special_tokens parameter in v1/completions and v1/chat/completions endpoints #4175
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
@2ez4bz can you give me an example model + curl request that would reproduce this In the PR description I started a Qwen3 model with reasoning parser, but didn't see any special tokens in the responses, so I don't have an easily noticeable effect to demonstrate when setting skip_special_tokens to true. python -m dynamo.vllm --model Qwen/Qwen3-0.6B --dyn-reasoning-parser qwen3 --enforce-eager --connector noneI was however able to see it working when set to true (due to your other bug about add_generation_prompt) on a tool call example with a llama model like so: python -m dynamo.vllm --model meta-llama/Llama-3.1-8B-Instruct --dyn-tool-call-parser llama3_json --custom-jinja-template ~/dynamo/tool_calling/llama3/tool_chat_template_llama3.1_json.jinja --connector none --enforce-eagerWith the add_generation_prompt bug, this would output special tokens like However, I think those special tokens shouldn't show up in the output at all after #4168 is merged, which is why I'm looking for a different test case that doesn't rely on a bug to observe. |
WalkthroughA new Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes
Poem
Pre-merge checks✅ Passed checks (3 passed)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (6)
lib/llm/src/backend.rs(2 hunks)lib/llm/src/protocols/common.rs(0 hunks)lib/llm/src/protocols/openai/chat_completions.rs(3 hunks)lib/llm/src/protocols/openai/common_ext.rs(7 hunks)lib/llm/src/protocols/openai/completions.rs(3 hunks)lib/llm/tests/tokenizers.rs(1 hunks)
💤 Files with no reviewable changes (1)
- lib/llm/src/protocols/common.rs
🧰 Additional context used
🧠 Learnings (6)
📓 Common learnings
Learnt from: ishandhanani
Repo: ai-dynamo/dynamo PR: 0
File: :0-0
Timestamp: 2025-09-19T07:32:44.210Z
Learning: The skip_tokenizer_init=True path in SGLang backend bypasses tokenization but has array slicing overhead in _process_token_stream that creates O(n) memory copying on every stream chunk, potentially causing quadratic behavior for long sequences.
📚 Learning: 2025-09-02T16:46:54.015Z
Learnt from: GuanLuo
Repo: ai-dynamo/dynamo PR: 2714
File: lib/llm/src/discovery/model_entry.rs:38-42
Timestamp: 2025-09-02T16:46:54.015Z
Learning: In lib/llm/src/discovery/model_entry.rs, GuanLuo prefers not to add serde defaults for model_type and model_input fields to keep the specification explicit and avoid user errors, relying on atomic deployment strategy to avoid backward compatibility issues.
Applied to files:
lib/llm/src/protocols/openai/common_ext.rs
📚 Learning: 2025-09-08T21:18:43.478Z
Learnt from: nachiketb-nvidia
Repo: ai-dynamo/dynamo PR: 2936
File: lib/parsers/src/reasoning/granite_parser.rs:42-46
Timestamp: 2025-09-08T21:18:43.478Z
Learning: In GraniteReasoningParser in lib/parsers/src/reasoning/granite_parser.rs, the think_start_tokens and think_end_tokens are hardcoded in the constructor with fixed values, so unwrap() calls on these vectors are safe and won't panic.
Applied to files:
lib/llm/src/protocols/openai/common_ext.rs
📚 Learning: 2025-08-22T19:55:41.608Z
Learnt from: nachiketb-nvidia
Repo: ai-dynamo/dynamo PR: 2656
File: lib/llm/src/protocols/openai/chat_completions/delta.rs:320-327
Timestamp: 2025-08-22T19:55:41.608Z
Learning: There are two separate DeltaGenerator classes in the codebase: one for chat completions (lib/llm/src/protocols/openai/chat_completions/delta.rs with object "chat.completion.chunk") and one for text completions (lib/llm/src/protocols/openai/completions/delta.rs with object "text_completion"). They have different create_choice method signatures and serve different OpenAI API endpoints. The reasoning parsing functionality is only relevant to the chat completions DeltaGenerator.
Applied to files:
lib/llm/src/protocols/openai/chat_completions.rslib/llm/src/protocols/openai/completions.rs
📚 Learning: 2025-08-22T19:55:41.608Z
Learnt from: nachiketb-nvidia
Repo: ai-dynamo/dynamo PR: 2656
File: lib/llm/src/protocols/openai/chat_completions/delta.rs:320-327
Timestamp: 2025-08-22T19:55:41.608Z
Learning: The create_choice method exists on multiple different objects in the codebase. The DeltaGenerator::create_choice in lib/llm/src/protocols/openai/chat_completions/delta.rs has its own signature that was updated to include reasoning_content, but other objects in lib/llm/src/engines.rs have their own separate create_choice methods with different signatures that are not related to chat completions.
Applied to files:
lib/llm/src/protocols/openai/chat_completions.rs
📚 Learning: 2025-09-19T07:32:44.210Z
Learnt from: ishandhanani
Repo: ai-dynamo/dynamo PR: 0
File: :0-0
Timestamp: 2025-09-19T07:32:44.210Z
Learning: The skip_tokenizer_init=True path in SGLang backend bypasses tokenization but has array slicing overhead in _process_token_stream that creates O(n) memory copying on every stream chunk, potentially causing quadratic behavior for long sequences.
Applied to files:
lib/llm/src/backend.rs
🧬 Code graph analysis (5)
lib/llm/tests/tokenizers.rs (2)
lib/llm/src/tokenizers.rs (5)
tokenizer(340-342)from_file(83-85)text(348-353)token_ids(38-43)token_ids(344-346)lib/llm/src/tokenizers/hf.rs (1)
from_file(16-21)
lib/llm/src/protocols/openai/common_ext.rs (3)
lib/llm/src/protocols/openai/completions.rs (2)
get_skip_special_tokens(193-195)get_skip_special_tokens(370-372)lib/llm/src/protocols/openai/chat_completions.rs (2)
get_skip_special_tokens(202-204)get_skip_special_tokens(269-271)lib/llm/src/protocols/openai.rs (1)
get_skip_special_tokens(85-85)
lib/llm/src/protocols/openai/chat_completions.rs (2)
lib/llm/src/protocols/openai/completions.rs (5)
get_skip_special_tokens(193-195)get_skip_special_tokens(370-372)test_skip_special_tokens_none(422-438)test_skip_special_tokens_true_propagates(441-456)test_skip_special_tokens_false_propagates(459-474)lib/llm/src/protocols/openai/common_ext.rs (1)
get_skip_special_tokens(112-112)
lib/llm/src/backend.rs (1)
lib/llm/src/tokenizers.rs (4)
tokenizer(340-342)new(176-191)new(268-275)new(495-503)
lib/llm/src/protocols/openai/completions.rs (3)
lib/llm/src/protocols/openai/chat_completions.rs (5)
get_skip_special_tokens(202-204)get_skip_special_tokens(269-271)test_skip_special_tokens_none(331-349)test_skip_special_tokens_true_propagates(352-369)test_skip_special_tokens_false_propagates(372-389)lib/llm/src/protocols/openai/common_ext.rs (1)
get_skip_special_tokens(112-112)lib/llm/src/protocols/openai.rs (1)
get_skip_special_tokens(85-85)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (10)
- GitHub Check: trtllm (arm64)
- GitHub Check: vllm (amd64)
- GitHub Check: Mirror Repository to GitLab
- GitHub Check: clippy (.)
- GitHub Check: tests (lib/bindings/python)
- GitHub Check: clippy (launch/dynamo-run)
- GitHub Check: tests (launch/dynamo-run)
- GitHub Check: tests (.)
- GitHub Check: tests (lib/runtime/examples)
- GitHub Check: Build and Test - dynamo
🔇 Additional comments (9)
lib/llm/tests/tokenizers.rs (1)
182-255: LGTM! Comprehensive test for skip_special_tokens.The test properly validates both decoding modes (with and without special tokens) and includes appropriate assertions for token markers, content preservation, and length comparison.
lib/llm/src/protocols/openai/common_ext.rs (2)
111-112: LGTM!The trait method signature is consistent with other methods in CommonExtProvider.
238-284: LGTM! Comprehensive test coverage for skip_special_tokens.The tests properly validate field assignment, serialization/deserialization, and the skip_serializing_if behavior for None values.
lib/llm/src/backend.rs (2)
92-112: LGTM! skip_special_tokens properly plumbed to decoder.The parameter is correctly added to the decoder signature and passed through to the tokenizer's decode_stream.
133-144: LGTM! Default behavior is intentionally conservative.The TODO comment appropriately documents the future consideration to change the default to true (matching vLLM/TRTLLM). The current default of false preserves existing behavior to prevent breaking changes.
lib/llm/src/protocols/openai/chat_completions.rs (2)
202-204: LGTM! Accessor correctly returns the field value.Implementation is consistent with other CommonExtProvider methods.
269-271: LGTM! Delegation and test coverage are appropriate.The OpenAIOutputOptionsProvider correctly delegates to CommonExtProvider, and the tests comprehensively validate propagation of None, Some(true), and Some(false) through the extract_output_options pipeline.
Also applies to: 324-390
lib/llm/src/protocols/openai/completions.rs (2)
193-195: LGTM! Accessor implementation is correct.Consistent with the chat_completions.rs implementation.
370-372: LGTM! Implementation and tests are comprehensive.The delegation pattern is correct, and tests thoroughly validate skip_special_tokens propagation for the completions endpoint, mirroring the coverage in chat_completions.rs.
Also applies to: 415-475
| /// Whether to skip special tokens in the decoded output. | ||
| /// When true, special tokens (like EOS, BOS, PAD) are removed from the output text. | ||
| /// When false, special tokens are included in the output text. | ||
| /// Defaults to true if not specified (matching vLLM/TensorRT-LLM behavior). | ||
| #[serde(default, skip_serializing_if = "Option::is_none")] | ||
| #[builder(default, setter(strip_option))] | ||
| pub skip_special_tokens: Option<bool>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix documentation inconsistency with actual default behavior.
The documentation states "Defaults to true if not specified (matching vLLM/TensorRT-LLM behavior)" but the actual implementation in lib/llm/src/backend.rs Line 134 uses unwrap_or(false), defaulting to false. This creates confusion about the expected behavior.
Update the documentation to reflect the current default:
- /// Whether to skip special tokens in the decoded output.
- /// When true, special tokens (like EOS, BOS, PAD) are removed from the output text.
- /// When false, special tokens are included in the output text.
- /// Defaults to true if not specified (matching vLLM/TensorRT-LLM behavior).
+ /// Whether to skip special tokens in the decoded output.
+ /// When true, special tokens (like EOS, BOS, PAD) are removed from the output text.
+ /// When false, special tokens are included in the output text.
+ /// Defaults to false if not specified. A future change may update this to true to match vLLM/TensorRT-LLM behavior.🤖 Prompt for AI Agents
In lib/llm/src/protocols/openai/common_ext.rs around lines 76 to 82, the
docstring incorrectly states the default is true while the implementation uses
unwrap_or(false); update the documentation to state the default is false (i.e.,
"Defaults to false if not specified"), remove the parenthetical about matching
vLLM/TensorRT-LLM or change it to reflect the actual behavior, and ensure
serde/builder attributes remain unchanged.
Overview:
falsewhen not provided to avoid any unintended breakagestrueto match other frameworks like vLLM/TRTLLM - but this needs to be further verified that nothing breaks before updating.Details:
Example request setting it to true on a Qwen3 model with reasoning parser enabled:
Existing validation is supported on this new field to make sure it is a
bool:Where should the reviewer start?
Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)
Summary by CodeRabbit
New Features
skip_special_tokensparameter to both chat completions and completions endpoints. Users can now optionally exclude special tokens from decoded output, with the default behavior unchanged.Tests