Skip to content

Conversation

@rmccorm4
Copy link
Contributor

@rmccorm4 rmccorm4 commented Nov 7, 2025

Overview:

  • Adds support for skip_special_tokens bool parameter in completions and chat_completions requests
  • Default remains false when not provided to avoid any unintended breakages
  • There is a TODO left in the code for future follow-up to consider updating the default to true to match other frameworks like vLLM/TRTLLM - but this needs to be further verified that nothing breaks before updating.

Details:

Example request setting it to true on a Qwen3 model with reasoning parser enabled:

(dynamo) rmccormick@ced35d0-lcedt:~/dynamo/dynamo$ curl localhost:8000/v1/chat/completions -H 'Content-Type: application/json' -d '
{
  "model": "Qwen/Qwen3-0.6B",
  "messages": [{"role": "user", "content": "What is a good name for a dog?"}],
  "stream": false,
  "max_tokens": 1024,
  "skip_special_tokens": true
}' | jq

{
  "id": "chatcmpl-54ec61b1-8dfb-4eff-aac5-b7b5c620ba31",
  "choices": [
    {
      "index": 0,
      "message": {
        "content": "A good name for a dog could be something that reflects its traits, is easy to pronounce, and has a positive meaning. Here are a few suggestions:1. **Charlie** – Known for its friendly and loyal nature.2. **Buddy** – Symbolizes companionship and loyalty.3. **Max** – A classic and simple name.4. **Whiskers** – A playful and affectionate name.5. **Poodle** – A unique and popular choice for a dog.Choose a name that resonates with your dog’s personality and ensures it’s not taken by another dog. Let me know if you'd like more options!",
        "role": "assistant",
        "reasoning_content": "\nOkay, the user is asking for a good name for a dog. Let me start by thinking about common dog names. They might be looking for something catchy and memorable. Maybe something that's easy to spell and pronounce. Also, considering cultural aspects, some names are more popular in certain regions.\n\nI should think about the meaning behind the name. Maybe something that represents the dog's traits, like loyalty or companionship. Words like \"Buddy\" or \"Charlie\" come to mind. But I need to make sure the name is unique and not taken by another dog. \n\nWait, the user might be a dog owner looking for a name for their dog, so the name should be appropriate. Maybe include a word that means something positive, like \"Happy\" or \"Lucky.\" Also, considering different languages, maybe some names are more common in English, like \"Max\" or \"Buddy.\" \n\nI should also think about the name's length. Maybe two or three syllables. Let me brainstorm a few options: \"Buddy,\" \"Charlie,\" \"Max,\" \"Whiskers,\" \"Poodle.\" Are there any that I'm missing? Maybe \"Sphynx\" as a unique name, but that's more for a cat. \n\nI need to check if these names are commonly used. For example, \"Charlie\" is a popular name, and \"Buddy\" is also widely used. \"Max\" is another common choice. Maybe suggest a few options and explain why they're good. Also, mention that it's best to pick something that's not taken. \n\nWait, the user might not know what's popular, so providing a list of options with explanations would be helpful. Also, maybe include a note about considering the name's meaning and uniqueness. That should cover their needs.\n"
      },
      "finish_reason": "stop"
    }
  ],
  "created": 1762482887,
  "model": "Qwen/Qwen3-0.6B",
  "object": "chat.completion",
  "usage": {
    "prompt_tokens": 17,
    "completion_tokens": 510,
    "total_tokens": 527
  }
}

Existing validation is supported on this new field to make sure it is a bool:

(dynamo) rmccormick@ced35d0-lcedt:~/dynamo/dynamo$ curl localhost:8000/v1/chat/completions -H 'Content-Type: application/json' -d '
{
  "model": "Qwen/Qwen3-0.6B",
  "messages": [{"role": "user", "content": "What is a good name for a dog?"}],
  "stream": false,
  "max_tokens": 1024,
  "skip_special_tokens": 123   
}' | jq

{
  "message": "Failed to deserialize the JSON body into the target type: invalid type: integer `123`, expected a boolean at line 9 column 1",
  "type": "Bad Request",
  "code": 400
}

Where should the reviewer start?

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

Summary by CodeRabbit

  • New Features

    • Added skip_special_tokens parameter to both chat completions and completions endpoints. Users can now optionally exclude special tokens from decoded output, with the default behavior unchanged.
  • Tests

    • Added tests verifying special token filtering during decoding.

@rmccorm4
Copy link
Contributor Author

rmccorm4 commented Nov 7, 2025

This has impact on user flows. For example, with reasoning parsers turned on, the content field of the response might contain content="<|im_end|>" instead of content="\n". This puts the onus on the clients to do model-specific postprocessing of outputs.

@2ez4bz can you give me an example model + curl request that would reproduce this <|im_end|> in the output so I can verify the scenario you described?

In the PR description I started a Qwen3 model with reasoning parser, but didn't see any special tokens in the responses, so I don't have an easily noticeable effect to demonstrate when setting skip_special_tokens to true.

python -m dynamo.vllm --model Qwen/Qwen3-0.6B --dyn-reasoning-parser qwen3 --enforce-eager --connector none

I was however able to see it working when set to true (due to your other bug about add_generation_prompt) on a tool call example with a llama model like so:

python -m dynamo.vllm --model meta-llama/Llama-3.1-8B-Instruct --dyn-tool-call-parser llama3_json --custom-jinja-template ~/dynamo/tool_calling/llama3/tool_chat_template_llama3.1_json.jinja --connector none --enforce-eager

With the add_generation_prompt bug, this would output special tokens like content='<|start_header_id|>assistant<|end_header_id|> which when I set skip_special_tokens=True after the changes in this PR would end up like content='assistant instead.

However, I think those special tokens shouldn't show up in the output at all after #4168 is merged, which is why I'm looking for a different test case that doesn't rely on a bug to observe.

@rmccorm4 rmccorm4 marked this pull request as ready for review November 7, 2025 06:24
@rmccorm4 rmccorm4 requested a review from a team as a code owner November 7, 2025 06:24
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Nov 7, 2025

Walkthrough

A new skip_special_tokens feature is added to control token decoding behavior. The optional boolean field is introduced to the CommonExt struct, propagated through OpenAI protocol providers for chat and completions endpoints, and plumbed into the backend decoder function.

Changes

Cohort / File(s) Summary
Core Backend Integration
lib/llm/src/backend.rs
Added skip_special_tokens parameter to Backend::decoder function signature; updated generate flow to compute and pass skip_special_tokens from request.output_options to decoder.
Output Options Protocol Definition
lib/llm/src/protocols/openai/common_ext.rs
Added new optional skip_special_tokens: Option<bool> field to CommonExt struct with serde and builder annotations; introduced get_skip_special_tokens(&self) -> Option<bool> accessor method to CommonExtProvider trait.
Documentation Updates
lib/llm/src/protocols/common.rs
Removed inline documentation describing the "spaces_between_special_tokens" field from OutputOptions struct.
OpenAI Chat Completions Integration
lib/llm/src/protocols/openai/chat_completions.rs
Added get_skip_special_tokens accessor to CommonExtProvider implementation for NvCreateChatCompletionRequest; updated OpenAIOutputOptionsProvider to delegate get_skip_special_tokens to the accessor; added unit tests covering None, Some(true), and Some(false) propagation scenarios.
OpenAI Completions Integration
lib/llm/src/protocols/openai/completions.rs
Added get_skip_special_tokens accessor to CommonExtProvider implementation for NvCreateCompletionRequest; updated OpenAIOutputOptionsProvider to delegate get_skip_special_tokens; added unit tests verifying propagation for None, Some(true), and Some(false) cases.
Integration Testing
lib/llm/tests/tokenizers.rs
Added test_decode_with_skip_special_tokens integration test verifying that tokens are decoded with and without special token markers based on the flag setting.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

  • Verify skip_special_tokens is correctly threaded from OpenAI request objects through providers to the backend decoder function.
  • Confirm the CommonExt struct field is properly serialized/deserialized by serde, and that builder annotations work correctly.
  • Validate that both NvCreateChatCompletionRequest and NvCreateCompletionRequest implementations are consistent and complete.
  • Review test coverage across all three propagation scenarios (None, Some(true), Some(false)) in both chat and completions modules.
  • Verify the integration test correctly asserts behavior differences when decoding with and without special tokens.

Poem

🐰 A flag hops through the code with care,
Special tokens skip through the air!
From CommonExt to decoder's call,
Skipping markers, one and all.
Tests confirm the journey's true! 🎉

Pre-merge checks

✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main feature: adding skip_special_tokens parameter support to both v1/completions and v1/chat/completions endpoints.
Description check ✅ Passed The description covers the key sections with details about the feature, example requests/responses, validation behavior, and related issue reference.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@rmccorm4
Copy link
Contributor Author

rmccorm4 commented Nov 7, 2025

@coderabbitai review

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Nov 7, 2025

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f509493 and f498b0c.

📒 Files selected for processing (6)
  • lib/llm/src/backend.rs (2 hunks)
  • lib/llm/src/protocols/common.rs (0 hunks)
  • lib/llm/src/protocols/openai/chat_completions.rs (3 hunks)
  • lib/llm/src/protocols/openai/common_ext.rs (7 hunks)
  • lib/llm/src/protocols/openai/completions.rs (3 hunks)
  • lib/llm/tests/tokenizers.rs (1 hunks)
💤 Files with no reviewable changes (1)
  • lib/llm/src/protocols/common.rs
🧰 Additional context used
🧠 Learnings (6)
📓 Common learnings
Learnt from: ishandhanani
Repo: ai-dynamo/dynamo PR: 0
File: :0-0
Timestamp: 2025-09-19T07:32:44.210Z
Learning: The skip_tokenizer_init=True path in SGLang backend bypasses tokenization but has array slicing overhead in _process_token_stream that creates O(n) memory copying on every stream chunk, potentially causing quadratic behavior for long sequences.
📚 Learning: 2025-09-02T16:46:54.015Z
Learnt from: GuanLuo
Repo: ai-dynamo/dynamo PR: 2714
File: lib/llm/src/discovery/model_entry.rs:38-42
Timestamp: 2025-09-02T16:46:54.015Z
Learning: In lib/llm/src/discovery/model_entry.rs, GuanLuo prefers not to add serde defaults for model_type and model_input fields to keep the specification explicit and avoid user errors, relying on atomic deployment strategy to avoid backward compatibility issues.

Applied to files:

  • lib/llm/src/protocols/openai/common_ext.rs
📚 Learning: 2025-09-08T21:18:43.478Z
Learnt from: nachiketb-nvidia
Repo: ai-dynamo/dynamo PR: 2936
File: lib/parsers/src/reasoning/granite_parser.rs:42-46
Timestamp: 2025-09-08T21:18:43.478Z
Learning: In GraniteReasoningParser in lib/parsers/src/reasoning/granite_parser.rs, the think_start_tokens and think_end_tokens are hardcoded in the constructor with fixed values, so unwrap() calls on these vectors are safe and won't panic.

Applied to files:

  • lib/llm/src/protocols/openai/common_ext.rs
📚 Learning: 2025-08-22T19:55:41.608Z
Learnt from: nachiketb-nvidia
Repo: ai-dynamo/dynamo PR: 2656
File: lib/llm/src/protocols/openai/chat_completions/delta.rs:320-327
Timestamp: 2025-08-22T19:55:41.608Z
Learning: There are two separate DeltaGenerator classes in the codebase: one for chat completions (lib/llm/src/protocols/openai/chat_completions/delta.rs with object "chat.completion.chunk") and one for text completions (lib/llm/src/protocols/openai/completions/delta.rs with object "text_completion"). They have different create_choice method signatures and serve different OpenAI API endpoints. The reasoning parsing functionality is only relevant to the chat completions DeltaGenerator.

Applied to files:

  • lib/llm/src/protocols/openai/chat_completions.rs
  • lib/llm/src/protocols/openai/completions.rs
📚 Learning: 2025-08-22T19:55:41.608Z
Learnt from: nachiketb-nvidia
Repo: ai-dynamo/dynamo PR: 2656
File: lib/llm/src/protocols/openai/chat_completions/delta.rs:320-327
Timestamp: 2025-08-22T19:55:41.608Z
Learning: The create_choice method exists on multiple different objects in the codebase. The DeltaGenerator::create_choice in lib/llm/src/protocols/openai/chat_completions/delta.rs has its own signature that was updated to include reasoning_content, but other objects in lib/llm/src/engines.rs have their own separate create_choice methods with different signatures that are not related to chat completions.

Applied to files:

  • lib/llm/src/protocols/openai/chat_completions.rs
📚 Learning: 2025-09-19T07:32:44.210Z
Learnt from: ishandhanani
Repo: ai-dynamo/dynamo PR: 0
File: :0-0
Timestamp: 2025-09-19T07:32:44.210Z
Learning: The skip_tokenizer_init=True path in SGLang backend bypasses tokenization but has array slicing overhead in _process_token_stream that creates O(n) memory copying on every stream chunk, potentially causing quadratic behavior for long sequences.

Applied to files:

  • lib/llm/src/backend.rs
🧬 Code graph analysis (5)
lib/llm/tests/tokenizers.rs (2)
lib/llm/src/tokenizers.rs (5)
  • tokenizer (340-342)
  • from_file (83-85)
  • text (348-353)
  • token_ids (38-43)
  • token_ids (344-346)
lib/llm/src/tokenizers/hf.rs (1)
  • from_file (16-21)
lib/llm/src/protocols/openai/common_ext.rs (3)
lib/llm/src/protocols/openai/completions.rs (2)
  • get_skip_special_tokens (193-195)
  • get_skip_special_tokens (370-372)
lib/llm/src/protocols/openai/chat_completions.rs (2)
  • get_skip_special_tokens (202-204)
  • get_skip_special_tokens (269-271)
lib/llm/src/protocols/openai.rs (1)
  • get_skip_special_tokens (85-85)
lib/llm/src/protocols/openai/chat_completions.rs (2)
lib/llm/src/protocols/openai/completions.rs (5)
  • get_skip_special_tokens (193-195)
  • get_skip_special_tokens (370-372)
  • test_skip_special_tokens_none (422-438)
  • test_skip_special_tokens_true_propagates (441-456)
  • test_skip_special_tokens_false_propagates (459-474)
lib/llm/src/protocols/openai/common_ext.rs (1)
  • get_skip_special_tokens (112-112)
lib/llm/src/backend.rs (1)
lib/llm/src/tokenizers.rs (4)
  • tokenizer (340-342)
  • new (176-191)
  • new (268-275)
  • new (495-503)
lib/llm/src/protocols/openai/completions.rs (3)
lib/llm/src/protocols/openai/chat_completions.rs (5)
  • get_skip_special_tokens (202-204)
  • get_skip_special_tokens (269-271)
  • test_skip_special_tokens_none (331-349)
  • test_skip_special_tokens_true_propagates (352-369)
  • test_skip_special_tokens_false_propagates (372-389)
lib/llm/src/protocols/openai/common_ext.rs (1)
  • get_skip_special_tokens (112-112)
lib/llm/src/protocols/openai.rs (1)
  • get_skip_special_tokens (85-85)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (10)
  • GitHub Check: trtllm (arm64)
  • GitHub Check: vllm (amd64)
  • GitHub Check: Mirror Repository to GitLab
  • GitHub Check: clippy (.)
  • GitHub Check: tests (lib/bindings/python)
  • GitHub Check: clippy (launch/dynamo-run)
  • GitHub Check: tests (launch/dynamo-run)
  • GitHub Check: tests (.)
  • GitHub Check: tests (lib/runtime/examples)
  • GitHub Check: Build and Test - dynamo
🔇 Additional comments (9)
lib/llm/tests/tokenizers.rs (1)

182-255: LGTM! Comprehensive test for skip_special_tokens.

The test properly validates both decoding modes (with and without special tokens) and includes appropriate assertions for token markers, content preservation, and length comparison.

lib/llm/src/protocols/openai/common_ext.rs (2)

111-112: LGTM!

The trait method signature is consistent with other methods in CommonExtProvider.


238-284: LGTM! Comprehensive test coverage for skip_special_tokens.

The tests properly validate field assignment, serialization/deserialization, and the skip_serializing_if behavior for None values.

lib/llm/src/backend.rs (2)

92-112: LGTM! skip_special_tokens properly plumbed to decoder.

The parameter is correctly added to the decoder signature and passed through to the tokenizer's decode_stream.


133-144: LGTM! Default behavior is intentionally conservative.

The TODO comment appropriately documents the future consideration to change the default to true (matching vLLM/TRTLLM). The current default of false preserves existing behavior to prevent breaking changes.

lib/llm/src/protocols/openai/chat_completions.rs (2)

202-204: LGTM! Accessor correctly returns the field value.

Implementation is consistent with other CommonExtProvider methods.


269-271: LGTM! Delegation and test coverage are appropriate.

The OpenAIOutputOptionsProvider correctly delegates to CommonExtProvider, and the tests comprehensively validate propagation of None, Some(true), and Some(false) through the extract_output_options pipeline.

Also applies to: 324-390

lib/llm/src/protocols/openai/completions.rs (2)

193-195: LGTM! Accessor implementation is correct.

Consistent with the chat_completions.rs implementation.


370-372: LGTM! Implementation and tests are comprehensive.

The delegation pattern is correct, and tests thoroughly validate skip_special_tokens propagation for the completions endpoint, mirroring the coverage in chat_completions.rs.

Also applies to: 415-475

Comment on lines +76 to +82
/// Whether to skip special tokens in the decoded output.
/// When true, special tokens (like EOS, BOS, PAD) are removed from the output text.
/// When false, special tokens are included in the output text.
/// Defaults to true if not specified (matching vLLM/TensorRT-LLM behavior).
#[serde(default, skip_serializing_if = "Option::is_none")]
#[builder(default, setter(strip_option))]
pub skip_special_tokens: Option<bool>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Fix documentation inconsistency with actual default behavior.

The documentation states "Defaults to true if not specified (matching vLLM/TensorRT-LLM behavior)" but the actual implementation in lib/llm/src/backend.rs Line 134 uses unwrap_or(false), defaulting to false. This creates confusion about the expected behavior.

Update the documentation to reflect the current default:

-    /// Whether to skip special tokens in the decoded output.
-    /// When true, special tokens (like EOS, BOS, PAD) are removed from the output text.
-    /// When false, special tokens are included in the output text.
-    /// Defaults to true if not specified (matching vLLM/TensorRT-LLM behavior).
+    /// Whether to skip special tokens in the decoded output.
+    /// When true, special tokens (like EOS, BOS, PAD) are removed from the output text.
+    /// When false, special tokens are included in the output text.
+    /// Defaults to false if not specified. A future change may update this to true to match vLLM/TensorRT-LLM behavior.
🤖 Prompt for AI Agents
In lib/llm/src/protocols/openai/common_ext.rs around lines 76 to 82, the
docstring incorrectly states the default is true while the implementation uses
unwrap_or(false); update the documentation to state the default is false (i.e.,
"Defaults to false if not specified"), remove the parenthetical about matching
vLLM/TensorRT-LLM or change it to reflect the actual behavior, and ensure
serde/builder attributes remain unchanged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants