Skip to content

Commit 5147625

Browse files
authored
Merge branch 'main' into fix-retokenization-tool-loop
2 parents f3f0f8d + c0eabc4 commit 5147625

File tree

11 files changed

+94
-77
lines changed

11 files changed

+94
-77
lines changed

MIGRATION.md

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,20 @@
1+
# Migrating from TRL v0 to v1
2+
3+
This guide covers the breaking changes introduced in TRL v1 and how to update your code. Most structural changes (trainers moved to experimental, removed model classes, etc.) already shipped in v0.29 — if you're already on v0.29, this migration is minimal.
4+
5+
## Changed defaults
6+
7+
| Config | Parameter | v0 default | v1 default | Action needed |
8+
| --- | --- | --- | --- | --- |
9+
| `GRPOConfig` | `vllm_mode` | `"server"` | `"colocate"` | If you use `use_vllm=True` without specifying `vllm_mode`, vLLM will now run in the same process instead of connecting to a separate server. Set `vllm_mode="server"` explicitly if you rely on server mode. |
10+
| `RLOOConfig` | `vllm_mode` | `"server"` | `"colocate"` | Same as above. |
11+
12+
## Renamed options
13+
14+
| Config | Parameter | v0 value | v1 value | Action needed |
15+
| --- | --- | --- | --- | --- |
16+
| `SFTConfig` | `packing` | `"bfd-requeue"` | `"bfd_split"` | Replace `packing="bfd-requeue"` with `packing="bfd_split"`. The old value will still be accepted for a few versions but will be removed in a future release. |
17+
18+
## Migrating from an earlier version
19+
20+
Depending on which version you're migrating from, refer to the [release notes](https://github.com/huggingface/trl/releases) for v0.29 and earlier for version-specific changes.

docs/source/grpo_trainer.md

Lines changed: 16 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -206,7 +206,20 @@ We support two ways of using vLLM during training: **server mode** and **colocat
206206
> [!TIP]
207207
> By default, Truncated Importance Sampling is activated for vLLM generation to address the generation-training mismatch that occurs when using different frameworks. This can be turned off by setting `vllm_importance_sampling_correction=False`. For more information, see [Truncated Importance Sampling](paper_index#truncated-importance-sampling)
208208
209-
#### 🔌 Option 1: Server mode
209+
#### Option 1: Colocate mode
210+
211+
In this mode, vLLM runs inside the trainer process and shares GPU memory with the training model. This avoids launching a separate server and can improve GPU utilization, but may lead to memory contention on the training GPUs. This is the default mode.
212+
213+
```python
214+
from trl import GRPOConfig
215+
216+
training_args = GRPOConfig(
217+
...,
218+
use_vllm=True, # vllm_mode="colocate" by default
219+
)
220+
```
221+
222+
#### Option 2: Server mode
210223

211224
In this mode, vLLM runs in a separate process (and using separate GPUs) and communicates with the trainer via HTTP. This is ideal if you have dedicated GPUs for inference.
212225

@@ -224,27 +237,13 @@ In this mode, vLLM runs in a separate process (and using separate GPUs) and comm
224237
training_args = GRPOConfig(
225238
...,
226239
use_vllm=True,
227-
vllm_mode="server", # default value, can be omitted
240+
vllm_mode="server",
228241
)
229242
```
230243

231244
> [!WARNING]
232245
> Make sure that the server is using different GPUs than the trainer, otherwise you may run into NCCL errors. You can specify the GPUs to use with the `CUDA_VISIBLE_DEVICES` environment variable.
233246
234-
#### 🧩 Option 2: Colocate mode
235-
236-
In this mode, vLLM runs inside the trainer process and shares GPU memory with the training model. This avoids launching a separate server and can improve GPU utilization, but may lead to memory contention on the training GPUs.
237-
238-
```python
239-
from trl import GRPOConfig
240-
241-
training_args = GRPOConfig(
242-
...,
243-
use_vllm=True,
244-
vllm_mode="colocate",
245-
)
246-
```
247-
248247
> [!TIP]
249248
> Depending on the model size and the overall GPU memory requirements for training, you may need to adjust the `vllm_gpu_memory_utilization` parameter in [`GRPOConfig`] to avoid underutilization or out-of-memory errors.
250249
>
@@ -349,6 +348,7 @@ def main():
349348
training_args = GRPOConfig(
350349
per_device_train_batch_size=4,
351350
use_vllm=True,
351+
vllm_mode="server",
352352
vllm_server_host=args.vllm_server_host.replace("ip-", "").replace("-", "."), # from ip-X-X-X-X to X.X.X.X
353353
)
354354

docs/source/rloo_trainer.md

Lines changed: 16 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -161,7 +161,20 @@ pip install trl[vllm]
161161

162162
We support two ways of using vLLM during training: **server mode** and **colocate mode**.
163163

164-
#### 🔌 Option 1: Server mode
164+
#### Option 1: Colocate mode
165+
166+
In this mode, vLLM runs inside the trainer process and shares GPU memory with the training model. This avoids launching a separate server and can improve GPU utilization, but may lead to memory contention on the training GPUs. This is the default mode.
167+
168+
```python
169+
from trl import RLOOConfig
170+
171+
training_args = RLOOConfig(
172+
...,
173+
use_vllm=True, # vllm_mode="colocate" by default
174+
)
175+
```
176+
177+
#### Option 2: Server mode
165178

166179
In this mode, vLLM runs in a separate process (and using separate GPUs) and communicates with the trainer via HTTP. This is ideal if you have dedicated GPUs for inference.
167180

@@ -179,27 +192,13 @@ In this mode, vLLM runs in a separate process (and using separate GPUs) and comm
179192
training_args = RLOOConfig(
180193
...,
181194
use_vllm=True,
182-
vllm_mode="server", # default value, can be omitted
195+
vllm_mode="server",
183196
)
184197
```
185198

186199
> [!WARNING]
187200
> Make sure that the server is using different GPUs than the trainer, otherwise you may run into NCCL errors. You can specify the GPUs to use with the `CUDA_VISIBLE_DEVICES` environment variable.
188201
189-
#### 🧩 Option 2: Colocate mode
190-
191-
In this mode, vLLM runs inside the trainer process and shares GPU memory with the training model. This avoids launching a separate server and can improve GPU utilization, but may lead to memory contention on the training GPUs.
192-
193-
```python
194-
from trl import RLOOConfig
195-
196-
training_args = RLOOConfig(
197-
...,
198-
use_vllm=True,
199-
vllm_mode="colocate",
200-
)
201-
```
202-
203202
> [!TIP]
204203
> Depending on the model size and the overall GPU memory requirements for training, you may need to adjust the `vllm_gpu_memory_utilization` parameter in [`RLOOConfig`] to avoid underutilization or out-of-memory errors.
205204
>
@@ -278,6 +277,7 @@ def main():
278277
per_device_train_batch_size=4,
279278
bf16=True,
280279
use_vllm=True,
280+
vllm_mode="server",
281281
vllm_server_host=args.vllm_server_host.replace("ip-", "").replace("-", "."), # from ip-X-X-X-X to X.X.X.X
282282
)
283283

docs/source/speeding_up_training.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -27,7 +27,7 @@ Then, run the training script and pass `use_vllm=True` in the training arguments
2727
```python
2828
from trl.experimental.online_dpo import OnlineDPOConfig
2929

30-
training_args = OnlineDPOConfig(..., use_vllm=True)
30+
training_args = OnlineDPOConfig(..., use_vllm=True, vllm_mode="server")
3131
```
3232

3333
</hfoption>
@@ -44,7 +44,7 @@ Then, run the training script and pass `use_vllm=True` in the training arguments
4444
```python
4545
from trl import GRPOConfig
4646

47-
training_args = GRPOConfig(..., use_vllm=True)
47+
training_args = GRPOConfig(..., use_vllm=True, vllm_mode="server")
4848
```
4949

5050
You can customize the server configuration by passing additional arguments. For more information, see [vLLM integration](vllm_integration).
@@ -78,7 +78,7 @@ Then, run the training script and pass `use_vllm=True` in the training arguments
7878
```python
7979
from trl import RLOOConfig
8080

81-
training_args = RLOOConfig(..., use_vllm=True)
81+
training_args = RLOOConfig(..., use_vllm=True, vllm_mode="server")
8282
```
8383

8484
You can customize the server configuration by passing additional arguments. For more information, see [vLLM integration](vllm_integration).

docs/source/vllm_integration.md

Lines changed: 22 additions & 27 deletions
Original file line numberDiff line numberDiff line change
@@ -52,7 +52,7 @@ dataset = load_dataset("trl-lib/DeepMath-103K", split="train")
5252

5353
trainer = GRPOTrainer(
5454
model="Qwen/Qwen2.5-7B",
55-
args=GRPOConfig(use_vllm=True),
55+
args=GRPOConfig(use_vllm=True, vllm_mode="server"),
5656
reward_funcs=accuracy_reward,
5757
train_dataset=dataset,
5858
)
@@ -72,7 +72,7 @@ dataset = load_dataset("trl-lib/DeepMath-103K", split="train")
7272

7373
trainer = OnlineDPOTrainer(
7474
model="Qwen/Qwen2.5-7B",
75-
args=OnlineDPOConfig(use_vllm=True),
75+
args=OnlineDPOConfig(use_vllm=True, vllm_mode="server"),
7676
reward_funcs=accuracy_reward,
7777
train_dataset=dataset,
7878
)
@@ -92,7 +92,7 @@ dataset = load_dataset("trl-lib/DeepMath-103K", split="train")
9292

9393
trainer = NashMDTrainer(
9494
model="Qwen/Qwen2.5-7B",
95-
args=NashMDConfig(use_vllm=True),
95+
args=NashMDConfig(use_vllm=True, vllm_mode="server"),
9696
reward_funcs=accuracy_reward,
9797
train_dataset=dataset,
9898
)
@@ -112,7 +112,7 @@ dataset = load_dataset("trl-lib/DeepMath-103K", split="train")
112112

113113
trainer = XPOTrainer(
114114
model="Qwen/Qwen2.5-7B",
115-
args=XPOConfig(use_vllm=True),
115+
args=XPOConfig(use_vllm=True, vllm_mode="server"),
116116
reward_funcs=accuracy_reward,
117117
train_dataset=dataset,
118118
)
@@ -132,7 +132,7 @@ dataset = load_dataset("trl-lib/DeepMath-103K", split="train")
132132

133133
trainer = RLOOTrainer(
134134
model="Qwen/Qwen2.5-7B",
135-
args=RLOOConfig(use_vllm=True),
135+
args=RLOOConfig(use_vllm=True, vllm_mode="server"),
136136
reward_funcs=accuracy_reward,
137137
train_dataset=dataset,
138138
)
@@ -276,12 +276,12 @@ CUDA_DEVICE_ORDER=PCI_BUS_ID CUDA_VISIBLE_DEVICES=0 trl vllm-serve --model Qwen/
276276

277277
### Modes of Using vLLM During Training
278278

279-
TRL supports **two modes** for integrating vLLM during training: **server mode** and **colocate mode**.
279+
TRL supports **two modes** for integrating vLLM during training: **colocate mode** (default) and **server mode**.
280280

281-
#### Server Mode
281+
#### Colocate Mode
282282

283-
In **server mode**, vLLM runs as a separate process on dedicated GPUs and communicates with the trainer via HTTP.
284-
This setup is ideal if you have GPUs dedicated to inference.
283+
In **colocate mode**, vLLM runs inside the trainer process and shares GPU memory with the training model.
284+
This avoids launching a separate server and can improve GPU utilization, but may lead to memory contention on the training GPUs. This is the default mode.
285285

286286
Example configuration:
287287

@@ -293,8 +293,7 @@ from trl import GRPOConfig
293293

294294
training_args = GRPOConfig(
295295
...,
296-
use_vllm=True,
297-
vllm_mode="server", # default value, can be omitted
296+
use_vllm=True, # vllm_mode="colocate" by default
298297
)
299298
```
300299

@@ -306,8 +305,7 @@ from trl.experimental.online_dpo import OnlineDPOConfig
306305

307306
training_args = OnlineDPOConfig(
308307
...,
309-
use_vllm=True,
310-
vllm_mode="server", # default value, can be omitted
308+
use_vllm=True, # vllm_mode="colocate" by default
311309
)
312310
```
313311

@@ -319,8 +317,7 @@ from trl.experimental.nash_md import NashMDConfig
319317

320318
training_args = NashMDConfig(
321319
...,
322-
use_vllm=True,
323-
vllm_mode="server", # default value, can be omitted
320+
use_vllm=True, # vllm_mode="colocate" by default
324321
)
325322
```
326323

@@ -332,8 +329,7 @@ from trl.experimental.xpo import XPOConfig
332329

333330
training_args = XPOConfig(
334331
...,
335-
use_vllm=True,
336-
vllm_mode="server", # default value, can be omitted
332+
use_vllm=True, # vllm_mode="colocate" by default
337333
)
338334
```
339335

@@ -345,18 +341,17 @@ from trl import RLOOConfig
345341

346342
training_args = RLOOConfig(
347343
...,
348-
use_vllm=True,
349-
vllm_mode="server", # default value, can be omitted
344+
use_vllm=True, # vllm_mode="colocate" by default
350345
)
351346
```
352347

353348
</hfoption>
354349
</hfoptions>
355350

356-
#### Colocate Mode
351+
#### Server Mode
357352

358-
In **colocate mode**, vLLM runs inside the trainer process and shares GPU memory with the training model.
359-
This avoids launching a separate server and can improve GPU utilization, but may lead to memory contention on the training GPUs.
353+
In **server mode**, vLLM runs as a separate process on dedicated GPUs and communicates with the trainer via HTTP.
354+
This setup is ideal if you have GPUs dedicated to inference.
360355

361356
Example configuration:
362357

@@ -369,7 +364,7 @@ from trl import GRPOConfig
369364
training_args = GRPOConfig(
370365
...,
371366
use_vllm=True,
372-
vllm_mode="colocate",
367+
vllm_mode="server",
373368
)
374369
```
375370

@@ -382,7 +377,7 @@ from trl.experimental.online_dpo import OnlineDPOConfig
382377
training_args = OnlineDPOConfig(
383378
...,
384379
use_vllm=True,
385-
vllm_mode="colocate",
380+
vllm_mode="server",
386381
)
387382
```
388383

@@ -395,7 +390,7 @@ from trl.experimental.nash_md import NashMDConfig
395390
training_args = NashMDConfig(
396391
...,
397392
use_vllm=True,
398-
vllm_mode="colocate",
393+
vllm_mode="server",
399394
)
400395
```
401396

@@ -408,7 +403,7 @@ from trl.experimental.xpo import XPOConfig
408403
training_args = XPOConfig(
409404
...,
410405
use_vllm=True,
411-
vllm_mode="colocate",
406+
vllm_mode="server",
412407
)
413408
```
414409

@@ -421,7 +416,7 @@ from trl import RLOOConfig
421416
training_args = RLOOConfig(
422417
...,
423418
use_vllm=True,
424-
vllm_mode="colocate",
419+
vllm_mode="server",
425420
)
426421
```
427422

tests/experimental/test_online_dpo_trainer.py

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -241,7 +241,7 @@ def test_training_with_judge(self, config_name):
241241
@require_torch_accelerator
242242
@require_vllm
243243
@pytest.mark.slow
244-
def test_training_with_vllm(self, config_name):
244+
def test_training_with_vllm_server(self, config_name):
245245
def cleanup_vllm_communicator(trainer):
246246
"""Clean up vLLM communicator to avoid conflicts between test runs"""
247247
try:
@@ -258,6 +258,7 @@ def cleanup_vllm_communicator(trainer):
258258
training_args = OnlineDPOConfig(
259259
output_dir=self.tmp_dir,
260260
use_vllm=True,
261+
vllm_mode="server",
261262
vllm_gpu_memory_utilization=0.2,
262263
report_to="none",
263264
)
@@ -351,7 +352,7 @@ def test_vllm_config_validation(self):
351352

352353
# Test default values
353354
config = OnlineDPOConfig()
354-
assert config.vllm_mode == "server"
355+
assert config.vllm_mode == "colocate"
355356
assert config.vllm_server_base_url is None
356357
assert config.vllm_server_host == "0.0.0.0"
357358
assert config.vllm_server_port == 8000

trl/experimental/gold/gold_config.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -68,7 +68,7 @@ class GOLDConfig(SFTConfig):
6868
Whether to skip EOS token for teacher in ULD loss computation.
6969
use_vllm (`bool`, *optional*, defaults to `False`):
7070
Whether to use vLLM for generating completions from the student model. Requires `vllm` to be installed.
71-
vllm_mode (`str`, *optional*, defaults to `"server"`):
71+
vllm_mode (`str`, *optional*, defaults to `"colocate"`):
7272
Mode for student vLLM integration. Either `"server"` (connect to a running TRL vLLM server) or `"colocate"`
7373
(run vLLM in the same process).
7474
vllm_server_host (`str`, *optional*, defaults to `"0.0.0.0"`):
@@ -274,7 +274,7 @@ class GOLDConfig(SFTConfig):
274274
metadata={"help": "Whether to use vLLM for generating completions. Requires `vllm` to be installed."},
275275
)
276276
vllm_mode: str = field(
277-
default="server",
277+
default="colocate",
278278
metadata={
279279
"help": 'Mode for vLLM integration. Either "server" (connect to a running TRL vLLM server) or "colocate" (run vLLM in the same process).'
280280
},

trl/experimental/online_dpo/online_dpo_config.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -101,7 +101,7 @@ class may differ from those in [`~transformers.TrainingArguments`].
101101
Model implementation to use for vLLM. Must be one of `"transformers"` or `"vllm"`. `"transformers"`: Use
102102
the `transformers` backend for model implementation. `"vllm"`: Use the `vllm` library for model
103103
implementation.
104-
vllm_mode (`str`, *optional*, defaults to `"server"`):
104+
vllm_mode (`str`, *optional*, defaults to `"colocate"`):
105105
Mode to use for vLLM integration when `use_vllm` is set to `True`. Must be one of `"server"` or
106106
`"colocate"`.
107107
@@ -303,7 +303,7 @@ class may differ from those in [`~transformers.TrainingArguments`].
303303
},
304304
)
305305
vllm_mode: str = field(
306-
default="server",
306+
default="colocate",
307307
metadata={
308308
"help": "Mode to use for vLLM integration when `use_vllm` is set to `True`. Must be one of `'server'` or "
309309
"`'colocate'`. `'server'`: The trainer will send generation requests to a separate vLLM server. Make sure "

0 commit comments

Comments
 (0)