Skip to content

[GOLD][vLLM] On-policy generation decodes prompts with skip_special_tokens=True (chat template stripped) — intentional? #5241

@siyan-zhao

Description

@siyan-zhao

Hi TRL team, thanks for GOLD and the vLLM integration.

I’m trying to understand the intended behavior of GOLD’s on-policy generation with vLLM, which was originally identified by @simran135 and discussed in https://github.com/siyan-zhao/OPSD/issues/3.

In v0.29.0, this line says:

https://github.com/huggingface/trl/blob/v0.29.0/trl/experimental/gold/gold_trainer.py#L1680

"Decode prompts for vLLM (without special tokens - vLLM expects clean text)"

In _generate_on_policy_outputs_vllm, prompts are decoded with skip_special_tokens=True before being sent to vLLM:

prompts_text_for_vllm = self.processing_class.batch_decode(
    inputs["prompts"],
    skip_special_tokens=True,  # strips <|im_start|>, <|im_end|>, etc.
)

This strips chat-template special tokens (e.g., <|im_start|>, <|im_end|>, role markers) before vLLM receives the prompt. Then vLLM gets raw text and tokenizes it, rather than receiving the original templated prompt IDs.

For chat-template-driven models (e.g., Qwen3), this may change rollout behavior. In particular, if thinking mode depends on template-time controls (such as template kwargs), it may not be activated during on-policy vLLM generation.

Questions

  1. Is using skip_special_tokens=True here intentional?
  2. What is the reason for requiring “clean text” instead of preserving templated prompt IDs/text?

Looking forward to your response. Thanks!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions