You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
* Update trl-vlm-alignment.md
Fix typos and formatting issues:
- Replace em dashes (—) with double hyphens (--) in command line examples
- Fix subject-verb agreement: "fall short" → "falls short"
- Fix regex pattern: add missing backslash for whitespace (s* → \s*)
- Fix grammar: "introduces supports" → "introduces support"
These changes correct documentation errors that could cause confusion
or execution failures when users copy command examples or code snippets.
* Update trl-vlm-alignment.md
fall short -> falls short
Copy file name to clipboardExpand all lines: trl-vlm-alignment.md
+5-5Lines changed: 5 additions & 5 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -39,7 +39,7 @@ But in the last year, new multimodal alignment methods have gained popularity, G
39
39
40
40
### Mixed Preference Optimization (MPO)
41
41
42
-
Aligning multimodal models with SFT to do reasoning tasks fall short due to distribution shift. Meanwhile, models aligned with DPO fail to generate coherent rationales and might generate repetitive responses. To address this, there’s a new technique called [Mixed Preference Optimization](https://huggingface.co/papers/2411.10442) (MPO) specifically made for multimodal models. This method is essentially an extension of DPO with multiple losses: preference loss from DPO (sigmoid), quality loss from Binary Classifier Optimization (BCO), and generation loss from SFT. According to the [paper](https://huggingface.co/papers/2411.10442), simply switching to this combined loss results in 6.2 pts improvement in MathVista!
42
+
Aligning multimodal models with SFT to do reasoning tasks falls short due to distribution shift. Meanwhile, models aligned with DPO fail to generate coherent rationales and might generate repetitive responses. To address this, there’s a new technique called [Mixed Preference Optimization](https://huggingface.co/papers/2411.10442) (MPO) specifically made for multimodal models. This method is essentially an extension of DPO with multiple losses: preference loss from DPO (sigmoid), quality loss from Binary Classifier Optimization (BCO), and generation loss from SFT. According to the [paper](https://huggingface.co/papers/2411.10442), simply switching to this combined loss results in 6.2 pts improvement in MathVista!
matches = [re.match(pattern, content) for content in completions]
95
95
rewards_list = [1.0if match else0.0for match in matches]
96
96
rewards = [1.0if match else0.0for match in matches]
@@ -143,7 +143,7 @@ Explore the full notebook example [here](https://huggingface.co/learn/cookbook/f
143
143
144
144
[Group Sequence Policy Optimization](https://huggingface.co/papers/2507.18071) (GSPO) is a RL alignment algorithm recently released by Qwen that overcomes some limitations of GRPO. It achieves a more stable training computing importance sampling weights at the sequence level instead of per-token. Its benefits are more [relevant](https://github.com/volcengine/verl/pull/2775#issuecomment-3134375131) in MoE style models.
145
145
146
-
Latest TRL also introduces supports for GSPO and since it’s a variant of GRPO's loss, it comes with multimodal support. To create the trainer, the process is the same as with GRPO, but adding the following extra params (values are extracted from the paper).
146
+
Latest TRL also introduces support for GSPO and since it’s a variant of GRPO's loss, it comes with multimodal support. To create the trainer, the process is the same as with GRPO, but adding the following extra params (values are extracted from the paper).
147
147
148
148
```python
149
149
from trl import GRPOConfig
@@ -189,7 +189,7 @@ Here's a table summarizing model outputs for Qwen2.5VL-3B fine-tuned with the te
189
189
vLLM is integrated in TRL to support online alignment methods where you need to generate samples during training. Running the example scripts like the following enables vLLM:
There’s mainly two modes: `colocate` and `server`. [`colocate`](https://huggingface.co/blog/vllm-colocate) runs vLLM in the same process as the training loop, sharing the same GPU between training and generation, creating a vLLM LLM instance inside the `GRPOTrainer`. Meanwhile `server` requires you to serve vLLM separately in a different process where you can hit the server. You can start this server with the command:
CUDA_VISIBLE_DEVICES=1,2 python3 examples/scripts/grpo_vlm.py --model_name_or_path Qwen/Qwen2.5-VL-3B-Instruct … --log_completions —use_vllm —vlm_mode server
204
+
CUDA_VISIBLE_DEVICES=1,2 python3 examples/scripts/grpo_vlm.py --model_name_or_path Qwen/Qwen2.5-VL-3B-Instruct … --log_completions --use_vllm --vllm_mode server
205
205
```
206
206
207
207
One more tip: we have added support for using vLLM with transformers backend in TRL. You can enable it when running a script with colocate or when serving the model by passing the `--vllm_model_impl transformers` flag.
0 commit comments