Skip to content

Commit 8b9bed3

Browse files
authored
Merge branch 'main' into fix/stale_logprob_docstrings
2 parents e56abe0 + c0eabc4 commit 8b9bed3

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

69 files changed

+3536
-1051
lines changed

AGENTS.md

Lines changed: 96 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,96 @@
1+
# AGENTS.md
2+
3+
## Repository-specific guidance
4+
5+
### Main code vs experimental code
6+
7+
The repository is separated into **main code** and **experimental code**.
8+
9+
* **Main code** should remain stable, consistent, and well-tested.
10+
* **Experimental code** may be less stable and may contain inconsistent patterns or limited testing.
11+
12+
Small non-invasive improvements that make experimental code more consistent with the main codebase are encouraged, but avoid large refactors.
13+
14+
### Paper implementations
15+
16+
If a PR implements a method, algorithm, or training approach from a research paper, it must also add a corresponding subsection to `paper_index.md`.
17+
18+
When reviewing such PRs, ensure that `paper_index.md` was updated.
19+
20+
### Code duplication and consistency
21+
22+
Trainers in this repository are **self-contained by design**. Shared logic (generation, reward computation, metric logging, weight syncing, etc.) is deliberately duplicated across trainers rather than abstracted into a shared base class.
23+
24+
This is intentional: each trainer must be readable, modifiable, and evolvable in isolation. The base class (`_BaseTrainer`) provides only minimal utilities (model card generation). Everything else — vLLM generation paths, `_get_per_token_logps_and_entropies`, `_calculate_rewards`, `_prepare_inputs`, metric logging — is copied in full.
25+
26+
**The tradeoff**: duplication is accepted, but **consistency is mandatory**. When the same logic appears in multiple trainers, the duplicated blocks must stay aligned:
27+
28+
- Same variable names (`self._last_loaded_step`, `self._metrics[mode]`, …)
29+
- Same control flow structure (if/elif/else branches in the same order)
30+
- Same comments (word-for-word when the logic is identical)
31+
- Divergences only where the trainer's semantics require it (e.g., GRPO extracts logprobs from vLLM, RLOO discards them)
32+
33+
**Consistency over correctness**: this is a strong requirement. When duplicating code, reproduce it exactly — even if you believe the original has a bug. Do not silently fix the issue in your copy. Instead, keep your copy consistent with the source and report the problem so it can be fixed across all trainers in a dedicated PR. A correct-but-inconsistent codebase is harder to maintain than a consistently-wrong one that can be fixed in a single sweep.
34+
35+
**When modifying duplicated code**: if you change a pattern that exists in multiple trainers (e.g., the vLLM generation path in `_generate_single_turn`), apply the same change to all other trainers. A fix in GRPO often implies the same fix in RLOO, and vice versa. Not propagating a change is a bug.
36+
37+
**When reviewing**: if a PR touches duplicated logic, verify that all copies are updated consistently. A common mistake is fixing one trainer and forgetting the others.
38+
39+
### Simplicity
40+
41+
This codebase values **leanness and simplicity above all**. Prefer straightforward, inline code over abstractions, helpers, or utilities — even at the cost of some robustness or generality.
42+
43+
Concretely:
44+
45+
- Do not add layers of indirection (registries, factory patterns, plugin systems). A contributor should be able to read a trainer top to bottom and understand the full flow.
46+
- Prefer a simple implementation that covers 90% of cases over a complex one that covers 100%. A function that handles the common path in 20 lines is better than a catch-all that handles every edge case in 80.
47+
- Do not add defensive code, fallback paths, or configuration options "just in case". Only handle cases that actually exist today.
48+
- When in doubt, prefer less code. Every new function, parameter, or branch is maintenance burden. The best abstraction is often no abstraction.
49+
50+
## Documentation
51+
52+
### Docstrings
53+
54+
Docstrings must follow the repository format below. Do **not** convert docstrings to other styles (Google, NumPy, etc.).
55+
56+
Rules:
57+
58+
* Types appear in backticks inside parentheses: (`str`)
59+
* Optional parameters are marked with `*optional*`
60+
* Defaults are written as: `defaults to <value>`
61+
* When the default is `None`, prefer ```(`str`, *optional*)``` instead of ```(`str` or `None`, *optional*, defaults to `None`)```
62+
* Union types use `or`: `str` or `None`
63+
* References to classes use the format: [`~transformers.PreTrainedModel`]
64+
* Class docstrings may group parameters using headers such as: `> Parameters for X:`
65+
66+
Example:
67+
68+
````python
69+
def method(self, param1: str, param2: int = 1, param3: float | None = None):
70+
"""
71+
Brief one-line description of what this does.
72+
73+
Args:
74+
param1 (`str`):
75+
Description of required param.
76+
param2 (`int`, *optional*, defaults to `1`):
77+
Description of optional param with default.
78+
param3 (`float`, *optional*):
79+
Description of optional param without explicit default.
80+
81+
Returns:
82+
`dict` with keys:
83+
- `key1` (`list[int]`):
84+
Description of this key.
85+
86+
Examples:
87+
88+
```python
89+
>>> my_func("hello")
90+
```
91+
"""
92+
````
93+
94+
### Links to papers
95+
96+
When linking to papers, use `https://huggingface.co/papers/<id>` instead of `https://arxiv.org/abs/<id>` (same ID suffix system).

MIGRATION.md

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,20 @@
1+
# Migrating from TRL v0 to v1
2+
3+
This guide covers the breaking changes introduced in TRL v1 and how to update your code. Most structural changes (trainers moved to experimental, removed model classes, etc.) already shipped in v0.29 — if you're already on v0.29, this migration is minimal.
4+
5+
## Changed defaults
6+
7+
| Config | Parameter | v0 default | v1 default | Action needed |
8+
| --- | --- | --- | --- | --- |
9+
| `GRPOConfig` | `vllm_mode` | `"server"` | `"colocate"` | If you use `use_vllm=True` without specifying `vllm_mode`, vLLM will now run in the same process instead of connecting to a separate server. Set `vllm_mode="server"` explicitly if you rely on server mode. |
10+
| `RLOOConfig` | `vllm_mode` | `"server"` | `"colocate"` | Same as above. |
11+
12+
## Renamed options
13+
14+
| Config | Parameter | v0 value | v1 value | Action needed |
15+
| --- | --- | --- | --- | --- |
16+
| `SFTConfig` | `packing` | `"bfd-requeue"` | `"bfd_split"` | Replace `packing="bfd-requeue"` with `packing="bfd_split"`. The old value will still be accepted for a few versions but will be removed in a future release. |
17+
18+
## Migrating from an earlier version
19+
20+
Depending on which version you're migrating from, refer to the [release notes](https://github.com/huggingface/trl/releases) for v0.29 and earlier for version-specific changes.

README.md

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,10 @@
1-
# TRL - Transformer Reinforcement Learning
1+
# TRL - Transformers Reinforcement Learning
22

33
<div style="text-align: center">
4-
<img src="https://huggingface.co/datasets/trl-lib/documentation-images/resolve/main/trl_banner_dark.png" alt="TRL Banner">
4+
<picture>
5+
<source media="(prefers-color-scheme: light)" srcset="https://huggingface.co/datasets/trl-lib/documentation-images/resolve/main/TRL%20banner%20light.png">
6+
<img src="https://huggingface.co/datasets/trl-lib/documentation-images/resolve/main/trl_banner_dark.png" alt="TRL Banner">
7+
</picture>
58
</div>
69

710
<hr> <br>

docs/source/example_overview.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -37,6 +37,7 @@ These notebooks are easier to run and are designed for quick experimentation wit
3737
| [`grpo_ministral3_vl.ipynb`](https://github.com/huggingface/trl/tree/main/examples/notebooks/grpo_ministral3_vl.ipynb) | GRPO Ministral 3 with QLoRA using TRL on free Colab | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/trl/blob/main/examples/notebooks/grpo_ministral3_vl.ipynb) |
3838
| [`openenv_sudoku_grpo.ipynb`](https://github.com/huggingface/trl/tree/main/examples/notebooks/openenv_sudoku_grpo.ipynb) | GRPO to play Sudoku on an OpenEnv environment | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/trl/blob/main/examples/notebooks/openenv_sudoku_grpo.ipynb) |
3939
| [`openenv_wordle_grpo.ipynb`](https://github.com/huggingface/trl/tree/main/examples/notebooks/openenv_wordle_grpo.ipynb) | GRPO to play Worldle on an OpenEnv environment | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/trl/blob/main/examples/notebooks/openenv_wordle_grpo.ipynb) |
40+
| [`sft_nemotron_3.ipynb`](https://github.com/huggingface/trl/tree/main/examples/notebooks/sft_nemotron_3.ipynb) | SFT with LoRA on NVIDIA Nemotron 3 models | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/trl/blob/main/examples/notebooks/sft_nemotron_3.ipynb) |
4041
| [`sft_trl_lora_qlora.ipynb`](https://github.com/huggingface/trl/tree/main/examples/notebooks/sft_trl_lora_qlora.ipynb) | Supervised Fine-Tuning (SFT) using QLoRA on free Colab | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/trl/blob/main/examples/notebooks/sft_trl_lora_qlora.ipynb) |
4142
| [`sft_qwen_vl.ipynb`](https://github.com/huggingface/trl/tree/main/examples/notebooks/sft_qwen_vl.ipynb) | Supervised Fine-Tuning (SFT) Qwen3-VL with QLoRA using TRL on free Colab | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/trl/blob/main/examples/notebooks/sft_qwen_vl.ipynb) |
4243
| [`sft_tool_calling.ipynb`](https://github.com/huggingface/trl/tree/main/examples/notebooks/sft_tool_calling.ipynb) | Teaching tool calling to a model without native tool-calling support using SFT with QLoRA | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/trl/blob/main/examples/notebooks/sft_tool_calling.ipynb) |
@@ -80,6 +81,7 @@ Scripts are maintained in the [`trl/scripts`](https://github.com/huggingface/trl
8081
| [`examples/scripts/rloo.py`](https://github.com/huggingface/trl/blob/main/examples/scripts/rloo.py) | This script shows how to use the [`RLOOTrainer`] to fine-tune a model to improve its ability to solve math questions. |
8182
| [`examples/scripts/sft.py`](https://github.com/huggingface/trl/blob/main/trl/scripts/sft.py) | This script shows how to use the [`SFTTrainer`] to fine-tune a model. |
8283
| [`examples/scripts/sft_gemma3.py`](https://github.com/huggingface/trl/blob/main/examples/scripts/sft_gemma3.py) | This script shows how to use the [`SFTTrainer`] to fine-tune a Gemma 3 model. |
84+
| [`examples/scripts/sft_nemotron_3.py`](https://github.com/huggingface/trl/blob/main/examples/scripts/sft_nemotron_3.py) | This script shows how to use the [`SFTTrainer`] to fine-tune an NVIDIA Nemotron 3 model. |
8385
| [`examples/scripts/sft_tiny_aya_tool_calling.py`](https://github.com/huggingface/trl/blob/main/examples/scripts/sft_tiny_aya_tool_calling.py) | This script shows how to use the [`SFTTrainer`] to teach tool calling to a model without native tool-calling support using the [bebechien/SimpleToolCalling](https://huggingface.co/datasets/bebechien/SimpleToolCalling) dataset. |
8486
| [`examples/scripts/sft_video_llm.py`](https://github.com/huggingface/trl/blob/main/examples/scripts/sft_video_llm.py) | This script shows how to use the [`SFTTrainer`] to fine-tune a Video Language Model. |
8587
| [`examples/scripts/sft_vlm.py`](https://github.com/huggingface/trl/blob/main/examples/scripts/sft_vlm.py) | This script shows how to use the [`SFTTrainer`] to fine-tune a Vision Language Model in a chat setting. The script has only been tested with [LLaVA 1.5](https://huggingface.co/llava-hf/llava-1.5-7b-hf), [LLaVA 1.6](https://huggingface.co/llava-hf/llava-v1.6-mistral-7b-hf), and [Llama-3.2-11B-Vision-Instruct](https://huggingface.co/meta-llama/Llama-3.2-11B-Vision-Instruct) models, so users may see unexpected behaviour in other model architectures. |

docs/source/grpo_trainer.md

Lines changed: 18 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -206,7 +206,20 @@ We support two ways of using vLLM during training: **server mode** and **colocat
206206
> [!TIP]
207207
> By default, Truncated Importance Sampling is activated for vLLM generation to address the generation-training mismatch that occurs when using different frameworks. This can be turned off by setting `vllm_importance_sampling_correction=False`. For more information, see [Truncated Importance Sampling](paper_index#truncated-importance-sampling)
208208
209-
#### 🔌 Option 1: Server mode
209+
#### Option 1: Colocate mode
210+
211+
In this mode, vLLM runs inside the trainer process and shares GPU memory with the training model. This avoids launching a separate server and can improve GPU utilization, but may lead to memory contention on the training GPUs. This is the default mode.
212+
213+
```python
214+
from trl import GRPOConfig
215+
216+
training_args = GRPOConfig(
217+
...,
218+
use_vllm=True, # vllm_mode="colocate" by default
219+
)
220+
```
221+
222+
#### Option 2: Server mode
210223

211224
In this mode, vLLM runs in a separate process (and using separate GPUs) and communicates with the trainer via HTTP. This is ideal if you have dedicated GPUs for inference.
212225

@@ -224,27 +237,13 @@ In this mode, vLLM runs in a separate process (and using separate GPUs) and comm
224237
training_args = GRPOConfig(
225238
...,
226239
use_vllm=True,
227-
vllm_mode="server", # default value, can be omitted
240+
vllm_mode="server",
228241
)
229242
```
230243

231244
> [!WARNING]
232245
> Make sure that the server is using different GPUs than the trainer, otherwise you may run into NCCL errors. You can specify the GPUs to use with the `CUDA_VISIBLE_DEVICES` environment variable.
233246
234-
#### 🧩 Option 2: Colocate mode
235-
236-
In this mode, vLLM runs inside the trainer process and shares GPU memory with the training model. This avoids launching a separate server and can improve GPU utilization, but may lead to memory contention on the training GPUs.
237-
238-
```python
239-
from trl import GRPOConfig
240-
241-
training_args = GRPOConfig(
242-
...,
243-
use_vllm=True,
244-
vllm_mode="colocate",
245-
)
246-
```
247-
248247
> [!TIP]
249248
> Depending on the model size and the overall GPU memory requirements for training, you may need to adjust the `vllm_gpu_memory_utilization` parameter in [`GRPOConfig`] to avoid underutilization or out-of-memory errors.
250249
>
@@ -349,6 +348,7 @@ def main():
349348
training_args = GRPOConfig(
350349
per_device_train_batch_size=4,
351350
use_vllm=True,
351+
vllm_mode="server",
352352
vllm_server_host=args.vllm_server_host.replace("ip-", "").replace("-", "."), # from ip-X-X-X-X to X.X.X.X
353353
)
354354

@@ -695,7 +695,8 @@ trainer.train()
695695

696696
Tested with:
697697

698-
- **Qwen3** — e.g., `Qwen/Qwen3-0.6B`
698+
- [**Qwen3**](https://huggingface.co/collections/Qwen/qwen3) — e.g., `Qwen/Qwen3-0.6B`
699+
- [**Qwen3.5**](https://huggingface.co/collections/Qwen/qwen35) — e.g., `Qwen/Qwen3.5-2B`
699700

700701
> [!TIP]
701702
> Compatibility with all LLMs is not guaranteed. If you believe a model should be supported, feel free to open an issue on GitHub — or better yet, submit a pull request with the required changes.

docs/source/index.md

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,11 @@
11
<div style="text-align: center">
2-
<img src="https://huggingface.co/datasets/trl-lib/documentation-images/resolve/main/trl_banner_dark.png">
2+
<picture>
3+
<source media="(prefers-color-scheme: light)" srcset="https://huggingface.co/datasets/trl-lib/documentation-images/resolve/main/trl_banner_light.png">
4+
<img src="https://huggingface.co/datasets/trl-lib/documentation-images/resolve/main/trl_banner_dark.png">
5+
</picture>
36
</div>
47

5-
# TRL - Transformer Reinforcement Learning
8+
# TRL - Transformers Reinforcement Learning
69

710
TRL is a full stack library where we provide a set of tools to train transformer language models with methods like Supervised Fine-Tuning (SFT), Group Relative Policy Optimization (GRPO), Direct Preference Optimization (DPO), Reward Modeling, and more.
811
The library is integrated with 🤗 [transformers](https://github.com/huggingface/transformers).

docs/source/openenv.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# OpenEnv Integration for Training LLMs with Environments
22

3-
[OpenEnv](https://github.com/meta-pytorch/OpenEnv) is an open-source framework from Meta's PyTorch team for defining, deploying, and interacting with environments in reinforcement learning (RL) and agentic workflows. It offers [Gymnasium-style APIs](https://gymnasium.farama.org) (e.g., `reset()` and `step()`) to interface with environments in a standard manner, and supports running these environments as backend servers (for example, via HTTP or containerised execution). You can find a collection of ready-to-use OpenEnv environments on the [Hugging Face Hub](https://huggingface.co/collections/openenv/environment-hub).
3+
[OpenEnv](https://github.com/meta-pytorch/OpenEnv) is an open-source framework from Meta's PyTorch team for defining, deploying, and interacting with environments in reinforcement learning (RL) and agentic workflows. It offers [Gymnasium-style APIs](https://gymnasium.farama.org) (e.g., `reset()` and `step()`) to interface with environments in a standard manner, and supports running these environments as backend servers (for example, via HTTP or containerised execution). You can find a collection of ready-to-use OpenEnv environments on the [Hugging Face Hub](https://huggingface.co/collections/openenv/openenv-environment-hub).
44

55
In this guide, we’ll focus on **how to integrate OpenEnv with TRL**, but feel free to explore the links above to dive deeper into OpenEnv itself.
66

docs/source/paper_index.md

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1140,6 +1140,22 @@ SFTConfig(
11401140
)
11411141
```
11421142

1143+
### Fewer Truncations Improve Language Modeling
1144+
1145+
**📜 Paper**: https://huggingface.co/papers/2404.10830
1146+
1147+
The paper shows that the standard concatenate-then-split preprocessing (`packing_strategy="wrapped"`) used for LLM training causes many documents to be arbitrarily truncated, which harms learning. It proposes packing document chunks into context windows using a Best-Fit Decreasing bin-packing algorithm, greatly reducing truncation while keeping high token utilization and improving model performance. TRL implements this as the `"bfd_split"` packing strategy in [`SFTConfig`]. For more details on packing, see the [SFT documentation](sft_trainer#packing).
1148+
1149+
```python
1150+
from trl import SFTConfig
1151+
1152+
training_args = SFTConfig(
1153+
packing=True,
1154+
packing_strategy="bfd_split",
1155+
max_length=4096,
1156+
)
1157+
```
1158+
11431159
### Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
11441160

11451161
**📜 Paper**: https://huggingface.co/papers/1910.10683

0 commit comments

Comments
 (0)