You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: trl-vlm-alignment.md
+22-1Lines changed: 22 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -15,7 +15,7 @@ authors:
15
15
16
16
Vision Language Models (VLMs) are getting stronger, but *aligning* them to human preferences still matters. In TRL, we already showed how to post-train VLMs with [**Supervised Fine-Tuning (SFT)**](https://huggingface.co/docs/trl/main/en/training_vlm_sft) and [**Direct Preference Optimization (DPO)**](https://huggingface.co/learn/cookbook/fine_tuning_vlm_dpo_smolvlm_instruct). This time, we’re going further.
17
17
18
-
**tl;dr** We have added two new multimodal alignment methods to TRL: **Group Relative Policy Optimization (GRPO)**, its variant **Group Sequence Policy Optimization (GSPO)**, and **Mixed Preference Optimization (MPO)**. All of them let you go beyond pairwise DPO, extracting more signal from preference data and scaling better with modern VLMs. We release training scripts and demo notebooks to easily get started with them!
18
+
**tl;dr** We have added two new multimodal alignment methods to TRL: **Group Relative Policy Optimization (GRPO)**, its variant **Group Sequence Policy Optimization (GSPO)**, and **Mixed Preference Optimization (MPO)**. All of them let you go beyond pairwise DPO, extracting more signal from preference data and scaling better with modern VLMs. We have also added native Supervised Fine-tuning support for vision language models. We release training scripts and demo notebooks to easily get started with them!
19
19
20
20
## Table of Contents
21
21
@@ -26,6 +26,7 @@ Vision Language Models (VLMs) are getting stronger, but *aligning* them to human
26
26
-[Multimodal Group Relative Policy Optimization (GRPO)](#multimodal-group-relative-policy-optimization-grpo)
-[vLLM Integration in TRL](#vllm-integration-in-trl)
30
31
-[Useful Resources](#useful-resources)
31
32
@@ -184,6 +185,26 @@ Here's a table summarizing model outputs for Qwen2.5VL-3B fine-tuned with the te
184
185
</details>
185
186
186
187
188
+
## Native Supervised Fine-tuning Support
189
+
190
+
Previously, [`SFTTrainer`](https://huggingface.co/docs/trl/en/sft_trainer) was partially supporting vision language models. This was primarily due to many differences across VLM implementations in transformers API. With the standardization of the transformers API, we have shipped a full support for vision language models. You can simply initialize `SFTTrainer` with a VLM.
191
+
192
+
```python
193
+
from trl import SFTConfig, SFTTrainer
194
+
from datasets import load_dataset
195
+
196
+
trainer = SFTTrainer(
197
+
model="Qwen/Qwen2.5-VL-3B-Instruct",
198
+
args=SFTConfig(max_length=None), # To avoid truncation that may remove image tokens during training
To train a VLM, you need to provide a dataset with an additional `images` column containing the images to be processed. You can take a look at [Dataset Formats — Vision Datasets](https://huggingface.co/docs/trl/en/dataset_formats#vision-datasets) for more information on how it should look like. A good example is [LLaVA Instruct Mix](https://huggingface.co/datasets/trl-lib/llava-instruct-mix).
205
+
206
+
We also have a [`sft_vlm.py`](https://github.com/huggingface/trl/blob/main/examples/scripts/sft_vlm.py) script that works out of the box for transformers vision language models.
207
+
187
208
## vLLM Integration in TRL
188
209
189
210
vLLM is integrated in TRL to support online alignment methods where you need to generate samples during training. Running the example scripts like the following enables vLLM:
0 commit comments