Hi TRL team, thanks for GOLD and the vLLM integration.
I’m trying to understand the intended behavior of GOLD’s on-policy generation with vLLM, which was originally identified by @simran135 and discussed in https://github.com/siyan-zhao/OPSD/issues/3.
In v0.29.0, this line says:
https://github.com/huggingface/trl/blob/v0.29.0/trl/experimental/gold/gold_trainer.py#L1680
"Decode prompts for vLLM (without special tokens - vLLM expects clean text)"
In _generate_on_policy_outputs_vllm, prompts are decoded with skip_special_tokens=True before being sent to vLLM:
prompts_text_for_vllm = self.processing_class.batch_decode(
inputs["prompts"],
skip_special_tokens=True, # strips <|im_start|>, <|im_end|>, etc.
)
This strips chat-template special tokens (e.g., <|im_start|>, <|im_end|>, role markers) before vLLM receives the prompt. Then vLLM gets raw text and tokenizes it, rather than receiving the original templated prompt IDs.
For chat-template-driven models (e.g., Qwen3), this may change rollout behavior. In particular, if thinking mode depends on template-time controls (such as template kwargs), it may not be activated during on-policy vLLM generation.
Questions
- Is using
skip_special_tokens=True here intentional?
- What is the reason for requiring “clean text” instead of preserving templated prompt IDs/text?
Looking forward to your response. Thanks!