Skip to content

Commit bd6e225

Browse files
Add SFT Trainer for VLM (#3024)
--------- Co-authored-by: Sergio Paniego Blanco <[email protected]>
1 parent 6c2867d commit bd6e225

File tree

1 file changed

+22
-1
lines changed

1 file changed

+22
-1
lines changed

trl-vlm-alignment.md

Lines changed: 22 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,7 @@ authors:
1515

1616
Vision Language Models (VLMs) are getting stronger, but *aligning* them to human preferences still matters. In TRL, we already showed how to post-train VLMs with [**Supervised Fine-Tuning (SFT)**](https://huggingface.co/docs/trl/main/en/training_vlm_sft) and [**Direct Preference Optimization (DPO)**](https://huggingface.co/learn/cookbook/fine_tuning_vlm_dpo_smolvlm_instruct). This time, we’re going further.
1717

18-
**tl;dr** We have added two new multimodal alignment methods to TRL: **Group Relative Policy Optimization (GRPO)**, its variant **Group Sequence Policy Optimization (GSPO)**, and **Mixed Preference Optimization (MPO)**. All of them let you go beyond pairwise DPO, extracting more signal from preference data and scaling better with modern VLMs. We release training scripts and demo notebooks to easily get started with them!
18+
**tl;dr** We have added two new multimodal alignment methods to TRL: **Group Relative Policy Optimization (GRPO)**, its variant **Group Sequence Policy Optimization (GSPO)**, and **Mixed Preference Optimization (MPO)**. All of them let you go beyond pairwise DPO, extracting more signal from preference data and scaling better with modern VLMs. We have also added native Supervised Fine-tuning support for vision language models. We release training scripts and demo notebooks to easily get started with them!
1919

2020
## Table of Contents
2121

@@ -26,6 +26,7 @@ Vision Language Models (VLMs) are getting stronger, but *aligning* them to human
2626
- [Multimodal Group Relative Policy Optimization (GRPO)](#multimodal-group-relative-policy-optimization-grpo)
2727
- [Group Sequence Policy Optimization (GSPO)](#group-sequence-policy-optimization-gspo)
2828
- [Comparison](#comparison)
29+
- [Native Supervised Fine-tuning Support](#native-supervised-fine-tuning-support)
2930
- [vLLM Integration in TRL](#vllm-integration-in-trl)
3031
- [Useful Resources](#useful-resources)
3132

@@ -184,6 +185,26 @@ Here's a table summarizing model outputs for Qwen2.5VL-3B fine-tuned with the te
184185
</details>
185186

186187

188+
## Native Supervised Fine-tuning Support
189+
190+
Previously, [`SFTTrainer`](https://huggingface.co/docs/trl/en/sft_trainer) was partially supporting vision language models. This was primarily due to many differences across VLM implementations in transformers API. With the standardization of the transformers API, we have shipped a full support for vision language models. You can simply initialize `SFTTrainer` with a VLM.
191+
192+
```python
193+
from trl import SFTConfig, SFTTrainer
194+
from datasets import load_dataset
195+
196+
trainer = SFTTrainer(
197+
model="Qwen/Qwen2.5-VL-3B-Instruct",
198+
args=SFTConfig(max_length=None), # To avoid truncation that may remove image tokens during training
199+
train_dataset=load_dataset("trl-lib/llava-instruct-mix", split="train"),
200+
)
201+
trainer.train()
202+
```
203+
204+
To train a VLM, you need to provide a dataset with an additional `images` column containing the images to be processed. You can take a look at [Dataset Formats — Vision Datasets](https://huggingface.co/docs/trl/en/dataset_formats#vision-datasets) for more information on how it should look like. A good example is [LLaVA Instruct Mix](https://huggingface.co/datasets/trl-lib/llava-instruct-mix).
205+
206+
We also have a [`sft_vlm.py`](https://github.com/huggingface/trl/blob/main/examples/scripts/sft_vlm.py) script that works out of the box for transformers vision language models.
207+
187208
## vLLM Integration in TRL
188209

189210
vLLM is integrated in TRL to support online alignment methods where you need to generate samples during training. Running the example scripts like the following enables vLLM:

0 commit comments

Comments
 (0)