Merge branch 'main' into fix-retokenization-tool-loop

qgallouedec · web-flow · commit 51476256fbdb · 2026-03-13T18:23:04.000-06:00
diff --git a/MIGRATION.md b/MIGRATION.md
@@ -0,0 +1,20 @@
+# Migrating from TRL v0 to v1
+
+This guide covers the breaking changes introduced in TRL v1 and how to update your code. Most structural changes (trainers moved to experimental, removed model classes, etc.) already shipped in v0.29 — if you're already on v0.29, this migration is minimal.
+
+## Changed defaults
+
+| Config | Parameter | v0 default | v1 default | Action needed |
+| --- | --- | --- | --- | --- |
+| `GRPOConfig` | `vllm_mode` | `"server"` | `"colocate"` | If you use `use_vllm=True` without specifying `vllm_mode`, vLLM will now run in the same process instead of connecting to a separate server. Set `vllm_mode="server"` explicitly if you rely on server mode. |
+| `RLOOConfig` | `vllm_mode` | `"server"` | `"colocate"` | Same as above. |
+
+## Renamed options
+
+| Config | Parameter | v0 value | v1 value | Action needed |
+| --- | --- | --- | --- | --- |
+| `SFTConfig` | `packing` | `"bfd-requeue"` | `"bfd_split"` | Replace `packing="bfd-requeue"` with `packing="bfd_split"`. The old value will still be accepted for a few versions but will be removed in a future release. |
+
+## Migrating from an earlier version
+
+Depending on which version you're migrating from, refer to the [release notes](https://github.com/huggingface/trl/releases) for v0.29 and earlier for version-specific changes.
diff --git a/docs/source/grpo_trainer.md b/docs/source/grpo_trainer.md
@@ -206,7 +206,20 @@ We support two ways of using vLLM during training: **server mode** and **colocat
 > [!TIP]
 > By default, Truncated Importance Sampling is activated for vLLM generation to address the generation-training mismatch that occurs when using different frameworks. This can be turned off by setting `vllm_importance_sampling_correction=False`. For more information, see [Truncated Importance Sampling](paper_index#truncated-importance-sampling)
 
-#### 🔌 Option 1: Server mode
+#### Option 1: Colocate mode
+
+In this mode, vLLM runs inside the trainer process and shares GPU memory with the training model. This avoids launching a separate server and can improve GPU utilization, but may lead to memory contention on the training GPUs. This is the default mode.
+
+```python
+from trl import GRPOConfig
+
+training_args = GRPOConfig(
+    ...,
+    use_vllm=True,  # vllm_mode="colocate" by default
+)
+```
+
+#### Option 2: Server mode
 
 In this mode, vLLM runs in a separate process (and using separate GPUs) and communicates with the trainer via HTTP. This is ideal if you have dedicated GPUs for inference.
 
@@ -224,27 +237,13 @@ In this mode, vLLM runs in a separate process (and using separate GPUs) and comm
    training_args = GRPOConfig(
        ...,
        use_vllm=True,
-       vllm_mode="server",  # default value, can be omitted
+       vllm_mode="server",
    )
    ```
 
 > [!WARNING]
 > Make sure that the server is using different GPUs than the trainer, otherwise you may run into NCCL errors. You can specify the GPUs to use with the `CUDA_VISIBLE_DEVICES` environment variable.
 
-#### 🧩 Option 2: Colocate mode
-
-In this mode, vLLM runs inside the trainer process and shares GPU memory with the training model. This avoids launching a separate server and can improve GPU utilization, but may lead to memory contention on the training GPUs.
-
-```python
-from trl import GRPOConfig
-
-training_args = GRPOConfig(
-    ...,
-    use_vllm=True,
-    vllm_mode="colocate",
-)
-```
-
 > [!TIP]
 > Depending on the model size and the overall GPU memory requirements for training, you may need to adjust the `vllm_gpu_memory_utilization` parameter in [`GRPOConfig`] to avoid underutilization or out-of-memory errors.
 >
@@ -349,6 +348,7 @@ def main():
     training_args = GRPOConfig(
         per_device_train_batch_size=4,
         use_vllm=True,
+        vllm_mode="server",
         vllm_server_host=args.vllm_server_host.replace("ip-", "").replace("-", "."),  # from ip-X-X-X-X to X.X.X.X
     )
 
diff --git a/docs/source/rloo_trainer.md b/docs/source/rloo_trainer.md
@@ -161,7 +161,20 @@ pip install trl[vllm]
 
 We support two ways of using vLLM during training: **server mode** and **colocate mode**.
 
-#### 🔌 Option 1: Server mode
+#### Option 1: Colocate mode
+
+In this mode, vLLM runs inside the trainer process and shares GPU memory with the training model. This avoids launching a separate server and can improve GPU utilization, but may lead to memory contention on the training GPUs. This is the default mode.
+
+```python
+from trl import RLOOConfig
+
+training_args = RLOOConfig(
+    ...,
+    use_vllm=True,  # vllm_mode="colocate" by default
+)
+```
+
+#### Option 2: Server mode
 
 In this mode, vLLM runs in a separate process (and using separate GPUs) and communicates with the trainer via HTTP. This is ideal if you have dedicated GPUs for inference.
 
@@ -179,27 +192,13 @@ In this mode, vLLM runs in a separate process (and using separate GPUs) and comm
    training_args = RLOOConfig(
        ...,
        use_vllm=True,
-       vllm_mode="server",  # default value, can be omitted
+       vllm_mode="server",
    )
    ```
 
 > [!WARNING]
 > Make sure that the server is using different GPUs than the trainer, otherwise you may run into NCCL errors. You can specify the GPUs to use with the `CUDA_VISIBLE_DEVICES` environment variable.
 
-#### 🧩 Option 2: Colocate mode
-
-In this mode, vLLM runs inside the trainer process and shares GPU memory with the training model. This avoids launching a separate server and can improve GPU utilization, but may lead to memory contention on the training GPUs.
-
-```python
-from trl import RLOOConfig
-
-training_args = RLOOConfig(
-    ...,
-    use_vllm=True,
-    vllm_mode="colocate",
-)
-```
-
 > [!TIP]
 > Depending on the model size and the overall GPU memory requirements for training, you may need to adjust the `vllm_gpu_memory_utilization` parameter in [`RLOOConfig`] to avoid underutilization or out-of-memory errors.
 >
@@ -278,6 +277,7 @@ def main():
         per_device_train_batch_size=4,
         bf16=True,
         use_vllm=True,
+        vllm_mode="server",
         vllm_server_host=args.vllm_server_host.replace("ip-", "").replace("-", "."),  # from ip-X-X-X-X to X.X.X.X
     )
 
diff --git a/docs/source/speeding_up_training.md b/docs/source/speeding_up_training.md
@@ -27,7 +27,7 @@ Then, run the training script and pass `use_vllm=True` in the training arguments
 ```python
 from trl.experimental.online_dpo import OnlineDPOConfig
 
-training_args = OnlineDPOConfig(..., use_vllm=True)
+training_args = OnlineDPOConfig(..., use_vllm=True, vllm_mode="server")
 ```
 
 </hfoption>
@@ -44,7 +44,7 @@ Then, run the training script and pass `use_vllm=True` in the training arguments
 ```python
 from trl import GRPOConfig
 
-training_args = GRPOConfig(..., use_vllm=True)
+training_args = GRPOConfig(..., use_vllm=True, vllm_mode="server")
 ```
 
 You can customize the server configuration by passing additional arguments. For more information, see [vLLM integration](vllm_integration).
@@ -78,7 +78,7 @@ Then, run the training script and pass `use_vllm=True` in the training arguments
 ```python
 from trl import RLOOConfig
 
-training_args = RLOOConfig(..., use_vllm=True)
+training_args = RLOOConfig(..., use_vllm=True, vllm_mode="server")
 ```
 
 You can customize the server configuration by passing additional arguments. For more information, see [vLLM integration](vllm_integration).
diff --git a/docs/source/vllm_integration.md b/docs/source/vllm_integration.md
@@ -52,7 +52,7 @@ dataset = load_dataset("trl-lib/DeepMath-103K", split="train")
 
 trainer = GRPOTrainer(
     model="Qwen/Qwen2.5-7B",
-    args=GRPOConfig(use_vllm=True),
+    args=GRPOConfig(use_vllm=True, vllm_mode="server"),
     reward_funcs=accuracy_reward,
     train_dataset=dataset,
 )
@@ -72,7 +72,7 @@ dataset = load_dataset("trl-lib/DeepMath-103K", split="train")
 
 trainer = OnlineDPOTrainer(
     model="Qwen/Qwen2.5-7B",
-    args=OnlineDPOConfig(use_vllm=True),
+    args=OnlineDPOConfig(use_vllm=True, vllm_mode="server"),
     reward_funcs=accuracy_reward,
     train_dataset=dataset,
 )
@@ -92,7 +92,7 @@ dataset = load_dataset("trl-lib/DeepMath-103K", split="train")
 
 trainer = NashMDTrainer(
     model="Qwen/Qwen2.5-7B",
-    args=NashMDConfig(use_vllm=True),
+    args=NashMDConfig(use_vllm=True, vllm_mode="server"),
     reward_funcs=accuracy_reward,
     train_dataset=dataset,
 )
@@ -112,7 +112,7 @@ dataset = load_dataset("trl-lib/DeepMath-103K", split="train")
 
 trainer = XPOTrainer(
     model="Qwen/Qwen2.5-7B",
-    args=XPOConfig(use_vllm=True),
+    args=XPOConfig(use_vllm=True, vllm_mode="server"),
     reward_funcs=accuracy_reward,
     train_dataset=dataset,
 )
@@ -132,7 +132,7 @@ dataset = load_dataset("trl-lib/DeepMath-103K", split="train")
 
 trainer = RLOOTrainer(
     model="Qwen/Qwen2.5-7B",
-    args=RLOOConfig(use_vllm=True),
+    args=RLOOConfig(use_vllm=True, vllm_mode="server"),
     reward_funcs=accuracy_reward,
     train_dataset=dataset,
 )
@@ -276,12 +276,12 @@ CUDA_DEVICE_ORDER=PCI_BUS_ID CUDA_VISIBLE_DEVICES=0 trl vllm-serve --model Qwen/
 
 ### Modes of Using vLLM During Training
 
-TRL supports **two modes** for integrating vLLM during training: **server mode** and **colocate mode**.
+TRL supports **two modes** for integrating vLLM during training: **colocate mode** (default) and **server mode**.
 
-#### Server Mode
+#### Colocate Mode
 
-In **server mode**, vLLM runs as a separate process on dedicated GPUs and communicates with the trainer via HTTP.
-This setup is ideal if you have GPUs dedicated to inference.
+In **colocate mode**, vLLM runs inside the trainer process and shares GPU memory with the training model.
+This avoids launching a separate server and can improve GPU utilization, but may lead to memory contention on the training GPUs. This is the default mode.
 
 Example configuration:
 
@@ -293,8 +293,7 @@ from trl import GRPOConfig
 
 training_args = GRPOConfig(
     ...,
-    use_vllm=True,
-    vllm_mode="server",  # default value, can be omitted
+    use_vllm=True,  # vllm_mode="colocate" by default
 )
 ```
 
@@ -306,8 +305,7 @@ from trl.experimental.online_dpo import OnlineDPOConfig
 
 training_args = OnlineDPOConfig(
     ...,
-    use_vllm=True,
-    vllm_mode="server",  # default value, can be omitted
+    use_vllm=True,  # vllm_mode="colocate" by default
 )
 ```
 
@@ -319,8 +317,7 @@ from trl.experimental.nash_md import NashMDConfig
 
 training_args = NashMDConfig(
     ...,
-    use_vllm=True,
-    vllm_mode="server",  # default value, can be omitted
+    use_vllm=True,  # vllm_mode="colocate" by default
 )
 ```
 
@@ -332,8 +329,7 @@ from trl.experimental.xpo import XPOConfig
 
 training_args = XPOConfig(
     ...,
-    use_vllm=True,
-    vllm_mode="server",  # default value, can be omitted
+    use_vllm=True,  # vllm_mode="colocate" by default
 )
 ```
 
@@ -345,18 +341,17 @@ from trl import RLOOConfig
 
 training_args = RLOOConfig(
     ...,
-    use_vllm=True,
-    vllm_mode="server",  # default value, can be omitted
+    use_vllm=True,  # vllm_mode="colocate" by default
 )
 ```
 
 </hfoption>
 </hfoptions>
 
-#### Colocate Mode
+#### Server Mode
 
-In **colocate mode**, vLLM runs inside the trainer process and shares GPU memory with the training model.
-This avoids launching a separate server and can improve GPU utilization, but may lead to memory contention on the training GPUs.
+In **server mode**, vLLM runs as a separate process on dedicated GPUs and communicates with the trainer via HTTP.
+This setup is ideal if you have GPUs dedicated to inference.
 
 Example configuration:
 
@@ -369,7 +364,7 @@ from trl import GRPOConfig
 training_args = GRPOConfig(
     ...,
     use_vllm=True,
-    vllm_mode="colocate",
+    vllm_mode="server",
 )
 ```
 
@@ -382,7 +377,7 @@ from trl.experimental.online_dpo import OnlineDPOConfig
 training_args = OnlineDPOConfig(
     ...,
     use_vllm=True,
-    vllm_mode="colocate",
+    vllm_mode="server",
 )
 ```
 
@@ -395,7 +390,7 @@ from trl.experimental.nash_md import NashMDConfig
 training_args = NashMDConfig(
     ...,
     use_vllm=True,
-    vllm_mode="colocate",
+    vllm_mode="server",
 )
 ```
 
@@ -408,7 +403,7 @@ from trl.experimental.xpo import XPOConfig
 training_args = XPOConfig(
     ...,
     use_vllm=True,
-    vllm_mode="colocate",
+    vllm_mode="server",
 )
 ```
 
@@ -421,7 +416,7 @@ from trl import RLOOConfig
 training_args = RLOOConfig(
     ...,
     use_vllm=True,
-    vllm_mode="colocate",
+    vllm_mode="server",
 )
 ```
 
diff --git a/tests/experimental/test_online_dpo_trainer.py b/tests/experimental/test_online_dpo_trainer.py
@@ -241,7 +241,7 @@ def test_training_with_judge(self, config_name):
     @require_torch_accelerator
     @require_vllm
     @pytest.mark.slow
-    def test_training_with_vllm(self, config_name):
+    def test_training_with_vllm_server(self, config_name):
         def cleanup_vllm_communicator(trainer):
             """Clean up vLLM communicator to avoid conflicts between test runs"""
             try:
@@ -258,6 +258,7 @@ def cleanup_vllm_communicator(trainer):
         training_args = OnlineDPOConfig(
             output_dir=self.tmp_dir,
             use_vllm=True,
+            vllm_mode="server",
             vllm_gpu_memory_utilization=0.2,
             report_to="none",
         )
@@ -351,7 +352,7 @@ def test_vllm_config_validation(self):
 
         # Test default values
         config = OnlineDPOConfig()
-        assert config.vllm_mode == "server"
+        assert config.vllm_mode == "colocate"
         assert config.vllm_server_base_url is None
         assert config.vllm_server_host == "0.0.0.0"
         assert config.vllm_server_port == 8000
diff --git a/trl/experimental/gold/gold_config.py b/trl/experimental/gold/gold_config.py
@@ -68,7 +68,7 @@ class GOLDConfig(SFTConfig):
             Whether to skip EOS token for teacher in ULD loss computation.
         use_vllm (`bool`, *optional*, defaults to `False`):
             Whether to use vLLM for generating completions from the student model. Requires `vllm` to be installed.
-        vllm_mode (`str`, *optional*, defaults to `"server"`):
+        vllm_mode (`str`, *optional*, defaults to `"colocate"`):
             Mode for student vLLM integration. Either `"server"` (connect to a running TRL vLLM server) or `"colocate"`
             (run vLLM in the same process).
         vllm_server_host (`str`, *optional*, defaults to `"0.0.0.0"`):
@@ -274,7 +274,7 @@ class GOLDConfig(SFTConfig):
         metadata={"help": "Whether to use vLLM for generating completions. Requires `vllm` to be installed."},
     )
     vllm_mode: str = field(
-        default="server",
+        default="colocate",
         metadata={
             "help": 'Mode for vLLM integration. Either "server" (connect to a running TRL vLLM server) or "colocate" (run vLLM in the same process).'
         },
diff --git a/trl/experimental/online_dpo/online_dpo_config.py b/trl/experimental/online_dpo/online_dpo_config.py
@@ -101,7 +101,7 @@ class may differ from those in [`~transformers.TrainingArguments`].
             Model implementation to use for vLLM. Must be one of `"transformers"` or `"vllm"`. `"transformers"`: Use
             the `transformers` backend for model implementation. `"vllm"`: Use the `vllm` library for model
             implementation.
-        vllm_mode (`str`, *optional*, defaults to `"server"`):
+        vllm_mode (`str`, *optional*, defaults to `"colocate"`):
             Mode to use for vLLM integration when `use_vllm` is set to `True`. Must be one of `"server"` or
             `"colocate"`.
 
@@ -303,7 +303,7 @@ class may differ from those in [`~transformers.TrainingArguments`].
         },
     )
     vllm_mode: str = field(
-        default="server",
+        default="colocate",
         metadata={
             "help": "Mode to use for vLLM integration when `use_vllm` is set to `True`. Must be one of `'server'` or "
             "`'colocate'`. `'server'`: The trainer will send generation requests to a separate vLLM server. Make sure "
diff --git a/trl/generation/vllm_generation.py b/trl/generation/vllm_generation.py
diff --git a/trl/trainer/grpo_config.py b/trl/trainer/grpo_config.py
diff --git a/trl/trainer/rloo_config.py b/trl/trainer/rloo_config.py