Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 10 additions & 0 deletions MIGRATION.md
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure of fully understanding this: are we creating a migration guide for changes that were already released in the previous minor version?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I was a bit hesitant about including those pre-0.29 changes. I realize now it might be confusing, since they’re already effective. I see how that could mislead, so I’ll take them out and keep the guide focused on what’s truly new.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your clarifying explanation: that makes sense to me.

That said, I also understand the original motivation. I think having a place summarizing the recent breaking changes can be useful for users upgrading after skipping a few versions.

What about this compromise?

  • Keep MIGRATION.md focused only on the v0 to v1 migration
  • Document the v0.29 breaking changes elsewhere
    • For now, we document these in the release notes: is this sufficient?
    • What about introducing a CHANGELOG.md to track changes across versions?
    • Even though for v1.x series we will try to avoid breaking changes, maybe we could have a clear place to document them in case any become unavoidable
  • We could optionally add a short note in the migration guide pointing users to the v0.29 release notes in case they are upgrading from an earlier version

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on the CHANGELOG.md idea.

I usually think about migration from a major to another if they were breaking changes. So I think this file is more of "nice to have" for users and mandatory for now ?

Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
# Migrating from TRL v0 to v1

This guide covers the breaking changes introduced in TRL v1 and how to update your code. Most structural changes (trainers moved to experimental, removed model classes, etc.) already shipped in v0.29 — if you're already on v0.29, this migration is minimal.

## Changed defaults

| Config | Parameter | v0 default | v1 default | Action needed |
| --- | --- | --- | --- | --- |
| `GRPOConfig` | `vllm_mode` | `"server"` | `"colocate"` | If you use `use_vllm=True` without specifying `vllm_mode`, vLLM will now run in the same process instead of connecting to a separate server. Set `vllm_mode="server"` explicitly if you rely on server mode. |
| `RLOOConfig` | `vllm_mode` | `"server"` | `"colocate"` | Same as above. |
32 changes: 16 additions & 16 deletions docs/source/grpo_trainer.md
Original file line number Diff line number Diff line change
Expand Up @@ -206,7 +206,20 @@ We support two ways of using vLLM during training: **server mode** and **colocat
> [!TIP]
> By default, Truncated Importance Sampling is activated for vLLM generation to address the generation-training mismatch that occurs when using different frameworks. This can be turned off by setting `vllm_importance_sampling_correction=False`. For more information, see [Truncated Importance Sampling](paper_index#truncated-importance-sampling)

#### 🔌 Option 1: Server mode
#### Option 1: Colocate mode

In this mode, vLLM runs inside the trainer process and shares GPU memory with the training model. This avoids launching a separate server and can improve GPU utilization, but may lead to memory contention on the training GPUs. This is the default mode.

```python
from trl import GRPOConfig

training_args = GRPOConfig(
...,
use_vllm=True, # vllm_mode="colocate" by default
)
```

#### Option 2: Server mode

In this mode, vLLM runs in a separate process (and using separate GPUs) and communicates with the trainer via HTTP. This is ideal if you have dedicated GPUs for inference.

Expand All @@ -224,27 +237,13 @@ In this mode, vLLM runs in a separate process (and using separate GPUs) and comm
training_args = GRPOConfig(
...,
use_vllm=True,
vllm_mode="server", # default value, can be omitted
vllm_mode="server",
)
```

> [!WARNING]
> Make sure that the server is using different GPUs than the trainer, otherwise you may run into NCCL errors. You can specify the GPUs to use with the `CUDA_VISIBLE_DEVICES` environment variable.

#### 🧩 Option 2: Colocate mode

In this mode, vLLM runs inside the trainer process and shares GPU memory with the training model. This avoids launching a separate server and can improve GPU utilization, but may lead to memory contention on the training GPUs.

```python
from trl import GRPOConfig

training_args = GRPOConfig(
...,
use_vllm=True,
vllm_mode="colocate",
)
```

> [!TIP]
> Depending on the model size and the overall GPU memory requirements for training, you may need to adjust the `vllm_gpu_memory_utilization` parameter in [`GRPOConfig`] to avoid underutilization or out-of-memory errors.
>
Expand Down Expand Up @@ -349,6 +348,7 @@ def main():
training_args = GRPOConfig(
per_device_train_batch_size=4,
use_vllm=True,
vllm_mode="server",
vllm_server_host=args.vllm_server_host.replace("ip-", "").replace("-", "."), # from ip-X-X-X-X to X.X.X.X
)

Expand Down
32 changes: 16 additions & 16 deletions docs/source/rloo_trainer.md
Original file line number Diff line number Diff line change
Expand Up @@ -161,7 +161,20 @@ pip install trl[vllm]

We support two ways of using vLLM during training: **server mode** and **colocate mode**.

#### 🔌 Option 1: Server mode
#### Option 1: Colocate mode

In this mode, vLLM runs inside the trainer process and shares GPU memory with the training model. This avoids launching a separate server and can improve GPU utilization, but may lead to memory contention on the training GPUs. This is the default mode.

```python
from trl import RLOOConfig

training_args = RLOOConfig(
...,
use_vllm=True, # vllm_mode="colocate" by default
)
```

#### Option 2: Server mode

In this mode, vLLM runs in a separate process (and using separate GPUs) and communicates with the trainer via HTTP. This is ideal if you have dedicated GPUs for inference.

Expand All @@ -179,27 +192,13 @@ In this mode, vLLM runs in a separate process (and using separate GPUs) and comm
training_args = RLOOConfig(
...,
use_vllm=True,
vllm_mode="server", # default value, can be omitted
vllm_mode="server",
)
```

> [!WARNING]
> Make sure that the server is using different GPUs than the trainer, otherwise you may run into NCCL errors. You can specify the GPUs to use with the `CUDA_VISIBLE_DEVICES` environment variable.

#### 🧩 Option 2: Colocate mode

In this mode, vLLM runs inside the trainer process and shares GPU memory with the training model. This avoids launching a separate server and can improve GPU utilization, but may lead to memory contention on the training GPUs.

```python
from trl import RLOOConfig

training_args = RLOOConfig(
...,
use_vllm=True,
vllm_mode="colocate",
)
```

> [!TIP]
> Depending on the model size and the overall GPU memory requirements for training, you may need to adjust the `vllm_gpu_memory_utilization` parameter in [`RLOOConfig`] to avoid underutilization or out-of-memory errors.
>
Expand Down Expand Up @@ -278,6 +277,7 @@ def main():
per_device_train_batch_size=4,
bf16=True,
use_vllm=True,
vllm_mode="server",
vllm_server_host=args.vllm_server_host.replace("ip-", "").replace("-", "."), # from ip-X-X-X-X to X.X.X.X
)

Expand Down
6 changes: 3 additions & 3 deletions docs/source/speeding_up_training.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ Then, run the training script and pass `use_vllm=True` in the training arguments
```python
from trl.experimental.online_dpo import OnlineDPOConfig

training_args = OnlineDPOConfig(..., use_vllm=True)
training_args = OnlineDPOConfig(..., use_vllm=True, vllm_mode="server")
```

</hfoption>
Expand All @@ -44,7 +44,7 @@ Then, run the training script and pass `use_vllm=True` in the training arguments
```python
from trl import GRPOConfig

training_args = GRPOConfig(..., use_vllm=True)
training_args = GRPOConfig(..., use_vllm=True, vllm_mode="server")
```

You can customize the server configuration by passing additional arguments. For more information, see [vLLM integration](vllm_integration).
Expand Down Expand Up @@ -78,7 +78,7 @@ Then, run the training script and pass `use_vllm=True` in the training arguments
```python
from trl import RLOOConfig

training_args = RLOOConfig(..., use_vllm=True)
training_args = RLOOConfig(..., use_vllm=True, vllm_mode="server")
```

You can customize the server configuration by passing additional arguments. For more information, see [vLLM integration](vllm_integration).
Expand Down
49 changes: 22 additions & 27 deletions docs/source/vllm_integration.md
Original file line number Diff line number Diff line change
Expand Up @@ -52,7 +52,7 @@ dataset = load_dataset("trl-lib/DeepMath-103K", split="train")

trainer = GRPOTrainer(
model="Qwen/Qwen2.5-7B",
args=GRPOConfig(use_vllm=True),
args=GRPOConfig(use_vllm=True, vllm_mode="server"),
reward_funcs=accuracy_reward,
train_dataset=dataset,
)
Expand All @@ -72,7 +72,7 @@ dataset = load_dataset("trl-lib/DeepMath-103K", split="train")

trainer = OnlineDPOTrainer(
model="Qwen/Qwen2.5-7B",
args=OnlineDPOConfig(use_vllm=True),
args=OnlineDPOConfig(use_vllm=True, vllm_mode="server"),
reward_funcs=accuracy_reward,
train_dataset=dataset,
)
Expand All @@ -92,7 +92,7 @@ dataset = load_dataset("trl-lib/DeepMath-103K", split="train")

trainer = NashMDTrainer(
model="Qwen/Qwen2.5-7B",
args=NashMDConfig(use_vllm=True),
args=NashMDConfig(use_vllm=True, vllm_mode="server"),
reward_funcs=accuracy_reward,
train_dataset=dataset,
)
Expand All @@ -112,7 +112,7 @@ dataset = load_dataset("trl-lib/DeepMath-103K", split="train")

trainer = XPOTrainer(
model="Qwen/Qwen2.5-7B",
args=XPOConfig(use_vllm=True),
args=XPOConfig(use_vllm=True, vllm_mode="server"),
reward_funcs=accuracy_reward,
train_dataset=dataset,
)
Expand All @@ -132,7 +132,7 @@ dataset = load_dataset("trl-lib/DeepMath-103K", split="train")

trainer = RLOOTrainer(
model="Qwen/Qwen2.5-7B",
args=RLOOConfig(use_vllm=True),
args=RLOOConfig(use_vllm=True, vllm_mode="server"),
reward_funcs=accuracy_reward,
train_dataset=dataset,
)
Expand Down Expand Up @@ -276,12 +276,12 @@ CUDA_DEVICE_ORDER=PCI_BUS_ID CUDA_VISIBLE_DEVICES=0 trl vllm-serve --model Qwen/

### Modes of Using vLLM During Training

TRL supports **two modes** for integrating vLLM during training: **server mode** and **colocate mode**.
TRL supports **two modes** for integrating vLLM during training: **colocate mode** (default) and **server mode**.

#### Server Mode
#### Colocate Mode

In **server mode**, vLLM runs as a separate process on dedicated GPUs and communicates with the trainer via HTTP.
This setup is ideal if you have GPUs dedicated to inference.
In **colocate mode**, vLLM runs inside the trainer process and shares GPU memory with the training model.
This avoids launching a separate server and can improve GPU utilization, but may lead to memory contention on the training GPUs. This is the default mode.

Example configuration:

Expand All @@ -293,8 +293,7 @@ from trl import GRPOConfig

training_args = GRPOConfig(
...,
use_vllm=True,
vllm_mode="server", # default value, can be omitted
use_vllm=True, # vllm_mode="colocate" by default
)
```

Expand All @@ -306,8 +305,7 @@ from trl.experimental.online_dpo import OnlineDPOConfig

training_args = OnlineDPOConfig(
...,
use_vllm=True,
vllm_mode="server", # default value, can be omitted
use_vllm=True, # vllm_mode="colocate" by default
)
```

Expand All @@ -319,8 +317,7 @@ from trl.experimental.nash_md import NashMDConfig

training_args = NashMDConfig(
...,
use_vllm=True,
vllm_mode="server", # default value, can be omitted
use_vllm=True, # vllm_mode="colocate" by default
)
```

Expand All @@ -332,8 +329,7 @@ from trl.experimental.xpo import XPOConfig

training_args = XPOConfig(
...,
use_vllm=True,
vllm_mode="server", # default value, can be omitted
use_vllm=True, # vllm_mode="colocate" by default
)
```

Expand All @@ -345,18 +341,17 @@ from trl import RLOOConfig

training_args = RLOOConfig(
...,
use_vllm=True,
vllm_mode="server", # default value, can be omitted
use_vllm=True, # vllm_mode="colocate" by default
)
```

</hfoption>
</hfoptions>

#### Colocate Mode
#### Server Mode

In **colocate mode**, vLLM runs inside the trainer process and shares GPU memory with the training model.
This avoids launching a separate server and can improve GPU utilization, but may lead to memory contention on the training GPUs.
In **server mode**, vLLM runs as a separate process on dedicated GPUs and communicates with the trainer via HTTP.
This setup is ideal if you have GPUs dedicated to inference.

Example configuration:

Expand All @@ -369,7 +364,7 @@ from trl import GRPOConfig
training_args = GRPOConfig(
...,
use_vllm=True,
vllm_mode="colocate",
vllm_mode="server",
)
```

Expand All @@ -382,7 +377,7 @@ from trl.experimental.online_dpo import OnlineDPOConfig
training_args = OnlineDPOConfig(
...,
use_vllm=True,
vllm_mode="colocate",
vllm_mode="server",
)
```

Expand All @@ -395,7 +390,7 @@ from trl.experimental.nash_md import NashMDConfig
training_args = NashMDConfig(
...,
use_vllm=True,
vllm_mode="colocate",
vllm_mode="server",
)
```

Expand All @@ -408,7 +403,7 @@ from trl.experimental.xpo import XPOConfig
training_args = XPOConfig(
...,
use_vllm=True,
vllm_mode="colocate",
vllm_mode="server",
)
```

Expand All @@ -421,7 +416,7 @@ from trl import RLOOConfig
training_args = RLOOConfig(
...,
use_vllm=True,
vllm_mode="colocate",
vllm_mode="server",
)
```

Expand Down
5 changes: 3 additions & 2 deletions tests/experimental/test_online_dpo_trainer.py
Original file line number Diff line number Diff line change
Expand Up @@ -241,7 +241,7 @@ def test_training_with_judge(self, config_name):
@require_torch_accelerator
@require_vllm
@pytest.mark.slow
def test_training_with_vllm(self, config_name):
def test_training_with_vllm_server(self, config_name):
def cleanup_vllm_communicator(trainer):
"""Clean up vLLM communicator to avoid conflicts between test runs"""
try:
Expand All @@ -258,6 +258,7 @@ def cleanup_vllm_communicator(trainer):
training_args = OnlineDPOConfig(
output_dir=self.tmp_dir,
use_vllm=True,
vllm_mode="server",
vllm_gpu_memory_utilization=0.2,
report_to="none",
)
Expand Down Expand Up @@ -351,7 +352,7 @@ def test_vllm_config_validation(self):

# Test default values
config = OnlineDPOConfig()
assert config.vllm_mode == "server"
assert config.vllm_mode == "colocate"
assert config.vllm_server_base_url is None
assert config.vllm_server_host == "0.0.0.0"
assert config.vllm_server_port == 8000
Expand Down
4 changes: 2 additions & 2 deletions trl/experimental/gold/gold_config.py
Original file line number Diff line number Diff line change
Expand Up @@ -70,7 +70,7 @@ class GOLDConfig(SFTConfig):
Whether to skip EOS token for teacher in ULD loss computation.
use_vllm (`bool`, *optional*, defaults to `False`):
Whether to use vLLM for generating completions from the student model. Requires `vllm` to be installed.
vllm_mode (`str`, *optional*, defaults to `"server"`):
vllm_mode (`str`, *optional*, defaults to `"colocate"`):
Mode for student vLLM integration. Either `"server"` (connect to a running TRL vLLM server) or `"colocate"`
(run vLLM in the same process).
vllm_server_host (`str`, *optional*, defaults to `"0.0.0.0"`):
Expand Down Expand Up @@ -276,7 +276,7 @@ class GOLDConfig(SFTConfig):
metadata={"help": "Whether to use vLLM for generating completions. Requires `vllm` to be installed."},
)
vllm_mode: str = field(
default="server",
default="colocate",
metadata={
"help": 'Mode for vLLM integration. Either "server" (connect to a running TRL vLLM server) or "colocate" (run vLLM in the same process).'
},
Expand Down
4 changes: 2 additions & 2 deletions trl/experimental/online_dpo/online_dpo_config.py
Original file line number Diff line number Diff line change
Expand Up @@ -101,7 +101,7 @@ class may differ from those in [`~transformers.TrainingArguments`].
Model implementation to use for vLLM. Must be one of `"transformers"` or `"vllm"`. `"transformers"`: Use
the `transformers` backend for model implementation. `"vllm"`: Use the `vllm` library for model
implementation.
vllm_mode (`str`, *optional*, defaults to `"server"`):
vllm_mode (`str`, *optional*, defaults to `"colocate"`):
Mode to use for vLLM integration when `use_vllm` is set to `True`. Must be one of `"server"` or
`"colocate"`.

Expand Down Expand Up @@ -303,7 +303,7 @@ class may differ from those in [`~transformers.TrainingArguments`].
},
)
vllm_mode: str = field(
default="server",
default="colocate",
metadata={
"help": "Mode to use for vLLM integration when `use_vllm` is set to `True`. Must be one of `'server'` or "
"`'colocate'`. `'server'`: The trainer will send generation requests to a separate vLLM server. Make sure "
Expand Down
Loading
Loading