Skip to content

Commit bd7a4a1

Browse files
authored
Merge branch 'main' into VESPO
2 parents 76d1d1f + 370eee7 commit bd7a4a1

22 files changed

+733
-259
lines changed

docs/source/grpo_trainer.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -695,7 +695,8 @@ trainer.train()
695695

696696
Tested with:
697697

698-
- **Qwen3** — e.g., `Qwen/Qwen3-0.6B`
698+
- [**Qwen3**](https://huggingface.co/collections/Qwen/qwen3) — e.g., `Qwen/Qwen3-0.6B`
699+
- [**Qwen3.5**](https://huggingface.co/collections/Qwen/qwen35) — e.g., `Qwen/Qwen3.5-2B`
699700

700701
> [!TIP]
701702
> Compatibility with all LLMs is not guaranteed. If you believe a model should be supported, feel free to open an issue on GitHub — or better yet, submit a pull request with the required changes.

docs/source/vllm_integration.md

Lines changed: 13 additions & 27 deletions
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@
33
This document will guide you through the process of using vLLM with TRL for faster generation in online methods like GRPO and Online DPO. We first summarize a tl;dr on how to use vLLM with TRL, and then we will go into the details of how it works under the hood.
44

55
> [!WARNING]
6-
> TRL currently only supports vLLM versions `0.10.2`, `0.11.0`, `0.11.1`, `0.11.2` and `0.12.0`. Please ensure you have one of these versions installed to avoid compatibility issues.
6+
> TRL currently only supports vLLM versions from `0.10.2` to `0.14.1`. Please ensure you have a version in this range installed to avoid compatibility issues.
77
88
> [!TIP]
99
> The following trainers currently support generation with vLLM:
@@ -31,12 +31,12 @@ pip install "trl[vllm]"
3131
Then run the server on specific GPUs (e.g., GPUs 0-3):
3232

3333
```sh
34-
CUDA_VISIBLE_DEVICES=0,1,2,3 trl vllm-serve --model Qwen/Qwen2.5-7B --tensor-parallel-size 2 --data-parallel-size 2
34+
CUDA_VISIBLE_DEVICES=0,1,2,3 trl vllm-serve --model Qwen/Qwen2.5-7B --tensor-parallel-size 4
3535
```
3636

3737
Once the server is running, you can use it to generate completions for training. In the example below, we are using the different supported trainers using the vLLM server for generation. The `--tensor-parallel-size` and `--data-parallel-size` arguments control how the model and data are sharded across GPUs.
3838

39-
In this example, we are sharding two copies of the model across 4 GPUs. Increasing data parallelism increases throughput, while increasing tensor parallelism allows for serving larger models. Then, run the training script on different GPUs (e.g., GPUs 4-7) by passing `use_vllm=True` in the training arguments as follows:
39+
In this example, we shard one model across 4 GPUs with tensor parallelism. Then, run the training script on different GPUs (e.g., GPUs 4-7) by passing `use_vllm=True` in the training arguments as follows:
4040

4141
Sample of a simple `train.py` script:
4242

@@ -166,19 +166,15 @@ If you've ever done autoregressive decoder training, you know all the input toke
166166
When you run for example
167167

168168
```sh
169-
CUDA_VISIBLE_DEVICES=0,1,2,3 trl vllm-serve --model Qwen/Qwen2.5-7B --tensor-parallel-size 1 --data-parallel-size 4
169+
CUDA_VISIBLE_DEVICES=0,1,2,3 trl vllm-serve --model Qwen/Qwen2.5-7B --tensor-parallel-size 4
170170
```
171171

172-
the following happens:
172+
1. vLLM first spawns multiple workers to handle incoming requests in parallel. The number of workers is determined by multiplying the `--tensor-parallel-size` and `--data-parallel-size` values. In this example, it spawns 4 workers (4 × 1).
173+
Each worker operates independently and processes a chunk of the incoming requests — which are basically the prompts sent to the server for generation.
173174

174-
![vllm](https://huggingface.co/datasets/trl-lib/documentation-images/resolve/main/vllm-doc.png)
175+
2. Once the incoming requests (prompts) are distributed across the workers, the model starts generating completions. Internally, the model’s weights are split across multiple GPUs based on the `--tensor-parallel-size` argument — this is how tensor parallelism is handled.
175176

176-
1. vLLM first spawns multiple workers to handle incoming requests in parallel. The number of workers is determined by multiplying the `--tensor-parallel-size` and `--data-parallel-size` values. In this example, it spawns 4 workers (1 × 4).
177-
Each worker operates independently and processes a chunk of the incoming requests — which are basically the prompts sent to the server for generation. A key point to understand is that these 4 workers are running in parallel, and each one is responsible for handling a subset of the total incoming load.
178-
179-
2. Once the incoming requests (prompts) are distributed across the workers, the model starts generating completions. Internally, the model’s weights are split across multiple GPUs based on the `--tensor-parallel-size` argument — this is how tensor parallelism is handled. Meanwhile, data parallelism (controlled by `--data-parallel-size`) ensures that different sets of requests are processed independently across the workers. In short: tensor parallelism splits the model across GPUs, and data parallelism splits the batch of requests across different model replicas.
180-
181-
3. Although the GPUs process requests independently and in parallel, they still need to communicate with each other. Remember that each GPU handles only a slice of the incoming prompts (for example, with 4 GPUs and 8 prompts using `--data-parallel-size=4`, each GPU processes 2 prompts).
177+
3. Although the GPUs process requests independently and in parallel, they still need to communicate with each other. Remember that each GPU handles only a slice of the incoming prompts (for example, with 4 GPUs and 8 prompts using `--tensor-parallel-size=4`, each GPU participates in serving the full model).
182178
This GPU-to-GPU communication is managed efficiently by NVIDIA’s NCCL library. The communication mainly ensures that each GPU gets its correct portion of the incoming requests — it’s lightweight and doesn’t interfere with generation itself.
183179
Separately, the number of completions to generate per prompt is controlled by the `num_generations` setting in the GRPO config. For instance, if you set `num_generations=2` (like in the picture above), each prompt will have 2 completions. So, with 8 prompts and `num_generations=2`, you would end up with 16 completions total — regardless of the number of GPUs or parallelism settings.
184180

@@ -224,7 +220,9 @@ options:
224220
--tensor_parallel_size TENSOR_PARALLEL_SIZE, --tensor-parallel-size TENSOR_PARALLEL_SIZE
225221
Number of tensor parallel workers to use. (default: 1)
226222
--data_parallel_size DATA_PARALLEL_SIZE, --data-parallel-size DATA_PARALLEL_SIZE
227-
Number of data parallel workers to use. (default: 1)
223+
Number of data parallel workers to use. For dense models, keep this at 1. Starting from vLLM `0.14.0`, setting
224+
this above `1` for dense models is no longer supported/useful and will error out (see vLLM PR #30739).
225+
(default: 1)
228226
--host HOST Host address to run the server on. (default: 0.0.0.0)
229227
--port PORT Port to run the server on. (default: 8000)
230228
--gpu_memory_utilization GPU_MEMORY_UTILIZATION, --gpu-memory-utilization GPU_MEMORY_UTILIZATION
@@ -259,20 +257,8 @@ options:
259257
![tp dp throughput 8 gpus](https://huggingface.co/datasets/trl-lib/documentation-images/resolve/main/tp_dp_throughput_8_gpus.png)
260258
![tp dp throughput 4 gpus](https://huggingface.co/datasets/trl-lib/documentation-images/resolve/main/tp_dp_throughput_4_gpus.png)
261259

262-
First and foremost, always remember that the optimal setup depends on:
263-
264-
- The model size
265-
- The number of GPUs you have
266-
- The GPU memory size
267-
- The batch size you are using
268-
- The number of requests you are sending to the server (prompts)
269-
- The `max_model_len` you are using (this is the max length of the input sequence that the model can process, a.k.a. the context window size)
270-
- The number of completions you are generating for each request (`num_generations`)
271-
272-
Given these factors, our experiments on the Qwen model family (3B, 7B, 14B, 32B) using 8 H100 GPUs show that:
273-
274-
- For reasonable-sized models (3B–14B) and a moderate context window (`max_len < 8k`), using full capacity for data parallelism gives better throughput. The setup `(tp=1, dp=8)` yields the best results.
275-
- For larger models (32B) and longer context windows (`max_len > 8k`), a smaller DP size combined with some model-side parallelism performs better. For example, `(tp=2, dp=4)` is a good setup for 32B models with a larger context window.
260+
> [!WARNING]
261+
> The benchmark plots above were collected with older vLLM versions. Starting with [vLLM PR #30739](https://github.com/vllm-project/vllm/pull/30739) (released in `0.14.0`), offline data parallel scaling for non-MoE (dense) models is no longer supported. To follow the latest recommendations, do not scale DP for non-MoE models.
276262
277263
### vLLM with Transformers Backend
278264

pyproject.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -83,7 +83,7 @@ test = [
8383
"pytest"
8484
]
8585
vllm = [
86-
"vllm>=0.10.2,<0.13.0",
86+
"vllm>=0.10.2,<=0.14.1",
8787
"fastapi",
8888
"pydantic",
8989
"requests",

scripts/generate_tiny_models.py

Lines changed: 27 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -71,6 +71,8 @@
7171
Qwen2ForSequenceClassification,
7272
Qwen2VLConfig,
7373
Qwen2VLForConditionalGeneration,
74+
Qwen3_5Config,
75+
Qwen3_5ForConditionalGeneration,
7476
Qwen3Config,
7577
Qwen3ForCausalLM,
7678
Qwen3ForSequenceClassification,
@@ -325,9 +327,10 @@ def init_weights_tiny_model(model):
325327
("Qwen/Qwen2-VL-2B-Instruct", Qwen2VLForConditionalGeneration, torch.bfloat16),
326328
("Qwen/Qwen2.5-VL-3B-Instruct", Qwen2_5_VLForConditionalGeneration, torch.bfloat16),
327329
("Qwen/Qwen3-VL-2B-Instruct", Qwen3VLForConditionalGeneration, torch.bfloat16),
330+
("Qwen/Qwen3.5-0.8B", Qwen3_5ForConditionalGeneration, torch.bfloat16),
328331
]:
329332
processor = AutoProcessor.from_pretrained(model_id)
330-
generation_config = GenerationConfig.from_pretrained(model_id)
333+
generation_config = GenerationConfig.from_pretrained(model_id) if model_id != "Qwen/Qwen3.5-0.8B" else None
331334

332335
text_config = {
333336
"num_hidden_layers": 2,
@@ -371,13 +374,36 @@ def init_weights_tiny_model(model):
371374
vision_config["depth"] = 2
372375
vision_config["out_hidden_size"] = 16
373376

377+
if issubclass(model_class.config_class, Qwen3_5Config):
378+
# For tiny layer counts, default `layer_types` can end up with no full-attention layers (e.g. 2 layers and
379+
# default interval 4), which breaks Qwen3.5 dynamic cache logic. Keep one full-attention layer at the end.
380+
text_config["layer_types"] = ["linear_attention", "full_attention"]
381+
text_config["full_attention_interval"] = 2
382+
# Qwen3.5-VL vision config expects `depth`/`num_heads`, not `num_hidden_layers`/`num_attention_heads`.
383+
vision_config.pop("num_hidden_layers", None)
384+
vision_config.pop("num_attention_heads", None)
385+
vision_config.pop("num_key_value_heads", None)
386+
vision_config.pop("embed_dim", None)
387+
vision_config["depth"] = 2
388+
vision_config["num_heads"] = 4
389+
vision_config["intermediate_size"] = 32
390+
vision_config["out_hidden_size"] = 16
391+
374392
if model_id == "llava-hf/llava-v1.6-mistral-7b-hf":
375393
# Hotfix: llava-hf/llava-v1.6-mistral-7b-hf mistakesly sets text_config.dtype to "bfloat16".
376394
# See https://huggingface.co/llava-hf/llava-v1.6-mistral-7b-hf/discussions/46
377395
text_config["dtype"] = None
378396

379397
config = AutoConfig.from_pretrained(model_id, text_config=text_config, vision_config=vision_config, **kwargs)
380398
model = model_class(config).to(dtype=dtype)
399+
400+
if issubclass(model_class.config_class, Qwen3_5Config):
401+
# Qwen3.5 models has some weights in float32, to mirror this in the tiny model we need to convert them to float32 manually.
402+
for layer in model.model.language_model.layers:
403+
if hasattr(layer, "linear_attn"): # applies to linear attention layers only
404+
layer.linear_attn.A_log.data = layer.linear_attn.A_log.data.float()
405+
layer.linear_attn.norm.weight.data = layer.linear_attn.norm.weight.data.float()
406+
381407
push_to_hub(model, processor, generation_config, "tiny")
382408

383409
# PEFT models

tests/test_chat_template_utils.py

Lines changed: 29 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -114,6 +114,7 @@ def test_clone_with_sequence_classification_model(self):
114114
"tokenizer_name",
115115
[
116116
pytest.param("trl-internal-testing/tiny-Qwen3MoeForSequenceClassification", id="qwen3"),
117+
pytest.param("trl-internal-testing/tiny-Qwen3_5ForConditionalGeneration", id="qwen35"),
117118
],
118119
)
119120
@pytest.mark.xfail(
@@ -217,6 +218,14 @@ def test_non_prefix_preserving_template(self):
217218
"tokenizer_name",
218219
[
219220
pytest.param("trl-internal-testing/tiny-Qwen3MoeForSequenceClassification", id="qwen3"),
221+
pytest.param(
222+
"trl-internal-testing/tiny-Qwen3_5ForConditionalGeneration",
223+
id="qwen35",
224+
marks=pytest.mark.skipif(
225+
Version(transformers.__version__) < Version("5.0.0"),
226+
reason="Qwen3.5 tokenizer requires transformers>=5.0.0",
227+
),
228+
),
220229
],
221230
)
222231
class TestGetTrainingChatTemplate:
@@ -302,21 +311,6 @@ def test_behavior_unchanged_assistant_with_tool_calls(self, tokenizer_name):
302311
after = tokenizer.apply_chat_template(messages, tokenize=False, chat_template=new_chat_template)
303312
assert before == after
304313

305-
def test_behavior_unchanged_assistant_with_tool_calls_with_string_arguments(self, tokenizer_name):
306-
tokenizer = AutoTokenizer.from_pretrained(tokenizer_name)
307-
messages = [
308-
{"role": "user", "content": "Multiply 3 by 4."},
309-
{
310-
"role": "assistant",
311-
"content": "I will call a tool.",
312-
"tool_calls": [{"name": "multiply", "arguments": '{"a": 3, "b": 4}'}],
313-
},
314-
]
315-
before = tokenizer.apply_chat_template(messages, tokenize=False)
316-
new_chat_template = get_training_chat_template(tokenizer)
317-
after = tokenizer.apply_chat_template(messages, tokenize=False, chat_template=new_chat_template)
318-
assert before == after
319-
320314
def test_behavior_unchanged_with_tools_with_and_without_system_message(self, tokenizer_name):
321315
tokenizer = AutoTokenizer.from_pretrained(tokenizer_name)
322316
tools = [
@@ -388,6 +382,7 @@ def test_behavior_unchanged_generation_prompt_with_enable_thinking_false(self, t
388382
"tokenizer_name",
389383
[
390384
pytest.param("trl-internal-testing/tiny-Qwen3MoeForSequenceClassification", id="qwen3"),
385+
pytest.param("trl-internal-testing/tiny-Qwen3_5ForConditionalGeneration", id="qwen35"),
391386
],
392387
)
393388
@pytest.mark.xfail(
@@ -417,7 +412,11 @@ def test_parse_response_with_reasoning_content(self, tokenizer_name):
417412
{"role": "user", "content": "What is 3*4?"},
418413
{"role": "assistant", "reasoning_content": "Hmmm.", "content": "12"},
419414
]
420-
prefix = tokenizer.apply_chat_template(messages[:1], add_generation_prompt=True).input_ids
415+
# enable_thinking=True is required here because for Qwen3.5, the thinking is disabled by default for the
416+
# generation prompt.
417+
prefix = tokenizer.apply_chat_template(
418+
messages[:1], add_generation_prompt=True, enable_thinking=True
419+
).input_ids
421420
text = tokenizer.apply_chat_template(messages).input_ids
422421
response = text[len(prefix) :]
423422
parsed = parse_response(tokenizer, response)
@@ -451,6 +450,20 @@ def test_parse_response_tool_call_with_content(self, tokenizer_name):
451450
parsed = parse_response(tokenizer, response)
452451
assert parsed == messages[-1]
453452

453+
def test_parse_response_tool_call_without_arguments(self, tokenizer_name):
454+
tokenizer = AutoTokenizer.from_pretrained(tokenizer_name)
455+
tokenizer = add_response_schema(tokenizer)
456+
tool_calls = [{"type": "function", "function": {"name": "ping", "arguments": {}}}]
457+
messages = [
458+
{"role": "user", "content": "Ping the service."},
459+
{"role": "assistant", "tool_calls": tool_calls},
460+
]
461+
prefix = tokenizer.apply_chat_template(messages[:1], add_generation_prompt=True).input_ids
462+
text = tokenizer.apply_chat_template(messages).input_ids
463+
response = text[len(prefix) :]
464+
parsed = parse_response(tokenizer, response)
465+
assert parsed == {"role": "assistant", "content": "", "tool_calls": tool_calls}
466+
454467
def test_parse_response_multiple_tool_calls(self, tokenizer_name):
455468
tokenizer = AutoTokenizer.from_pretrained(tokenizer_name)
456469
tokenizer = add_response_schema(tokenizer)

tests/test_data_utils.py

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -481,6 +481,13 @@ class TestApplyChatTemplate(TrlTestCase):
481481
"trl-internal-testing/tiny-Phi3ForCausalLM",
482482
"trl-internal-testing/tiny-Qwen2ForCausalLM-2.5",
483483
"trl-internal-testing/tiny-Qwen3ForCausalLM",
484+
pytest.param(
485+
"trl-internal-testing/tiny-Qwen3_5ForConditionalGeneration",
486+
marks=pytest.mark.skipif(
487+
Version(transformers.__version__) < Version("5.0.0"),
488+
reason="Qwen3.5 tokenizer requires transformers>=5.0.0",
489+
),
490+
),
484491
]
485492

486493
conversational_examples = [

tests/test_dpo_trainer.py

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1009,6 +1009,13 @@ def test_tag_added_peft(self):
10091009
),
10101010
],
10111011
),
1012+
pytest.param(
1013+
"trl-internal-testing/tiny-Qwen3_5ForConditionalGeneration",
1014+
marks=pytest.mark.skipif(
1015+
Version(transformers.__version__) < Version("5.2.0"),
1016+
reason="Qwen3.5 models were introduced in transformers-5.2.0",
1017+
),
1018+
),
10121019
],
10131020
)
10141021
@require_vision

0 commit comments

Comments
 (0)