Skip to content

Commit d46ba67

Browse files
Update trl-vlm-alignment.md (#3018)
* Update trl-vlm-alignment.md Fix typos and formatting issues: - Replace em dashes (—) with double hyphens (--) in command line examples - Fix subject-verb agreement: "fall short" → "falls short" - Fix regex pattern: add missing backslash for whitespace (s* → \s*) - Fix grammar: "introduces supports" → "introduces support" These changes correct documentation errors that could cause confusion or execution failures when users copy command examples or code snippets. * Update trl-vlm-alignment.md fall short -> falls short
1 parent dff418c commit d46ba67

File tree

1 file changed

+5
-5
lines changed

1 file changed

+5
-5
lines changed

trl-vlm-alignment.md

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -39,7 +39,7 @@ But in the last year, new multimodal alignment methods have gained popularity, G
3939

4040
### Mixed Preference Optimization (MPO)
4141

42-
Aligning multimodal models with SFT to do reasoning tasks fall short due to distribution shift. Meanwhile, models aligned with DPO fail to generate coherent rationales and might generate repetitive responses. To address this, there’s a new technique called [Mixed Preference Optimization](https://huggingface.co/papers/2411.10442) (MPO) specifically made for multimodal models. This method is essentially an extension of DPO with multiple losses: preference loss from DPO (sigmoid), quality loss from Binary Classifier Optimization (BCO), and generation loss from SFT. According to the [paper](https://huggingface.co/papers/2411.10442), simply switching to this combined loss results in 6.2 pts improvement in MathVista!
42+
Aligning multimodal models with SFT to do reasoning tasks falls short due to distribution shift. Meanwhile, models aligned with DPO fail to generate coherent rationales and might generate repetitive responses. To address this, there’s a new technique called [Mixed Preference Optimization](https://huggingface.co/papers/2411.10442) (MPO) specifically made for multimodal models. This method is essentially an extension of DPO with multiple losses: preference loss from DPO (sigmoid), quality loss from Binary Classifier Optimization (BCO), and generation loss from SFT. According to the [paper](https://huggingface.co/papers/2411.10442), simply switching to this combined loss results in 6.2 pts improvement in MathVista!
4343

4444
![MPO](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/trl-vlm/image_1.png)
4545

@@ -90,7 +90,7 @@ from math_verify import LatexExtractionConfig, parse, verify
9090

9191
def format_reward(completions, **kwargs):
9292
"""Reward function that checks if the completion has a specific format."""
93-
pattern = r"^<think>.*?</think>s*<answer>.*?</answer>$"
93+
pattern = r"^<think>.*?</think>\s*<answer>.*?</answer>$"
9494
matches = [re.match(pattern, content) for content in completions]
9595
rewards_list = [1.0 if match else 0.0 for match in matches]
9696
rewards = [1.0 if match else 0.0 for match in matches]
@@ -143,7 +143,7 @@ Explore the full notebook example [here](https://huggingface.co/learn/cookbook/f
143143

144144
[Group Sequence Policy Optimization](https://huggingface.co/papers/2507.18071) (GSPO) is a RL alignment algorithm recently released by Qwen that overcomes some limitations of GRPO. It achieves a more stable training computing importance sampling weights at the sequence level instead of per-token. Its benefits are more [relevant](https://github.com/volcengine/verl/pull/2775#issuecomment-3134375131) in MoE style models.
145145

146-
Latest TRL also introduces supports for GSPO and since it’s a variant of GRPO's loss, it comes with multimodal support. To create the trainer, the process is the same as with GRPO, but adding the following extra params (values are extracted from the paper).
146+
Latest TRL also introduces support for GSPO and since it’s a variant of GRPO's loss, it comes with multimodal support. To create the trainer, the process is the same as with GRPO, but adding the following extra params (values are extracted from the paper).
147147

148148
```python
149149
from trl import GRPOConfig
@@ -189,7 +189,7 @@ Here's a table summarizing model outputs for Qwen2.5VL-3B fine-tuned with the te
189189
vLLM is integrated in TRL to support online alignment methods where you need to generate samples during training. Running the example scripts like the following enables vLLM:
190190

191191
```bash
192-
CUDA_VISIBLE_DEVICES=1,2 python3 examples/scripts/grpo_vlm.py --model_name_or_path Qwen/Qwen2.5-VL-3B-Instruct … --log_completions use_vllm —vlm_mode colocate
192+
CUDA_VISIBLE_DEVICES=1,2 python3 examples/scripts/grpo_vlm.py --model_name_or_path Qwen/Qwen2.5-VL-3B-Instruct … --log_completions --use_vllm --vllm_mode colocate
193193
```
194194

195195
There’s mainly two modes: `colocate` and `server`. [`colocate`](https://huggingface.co/blog/vllm-colocate) runs vLLM in the same process as the training loop, sharing the same GPU between training and generation, creating a vLLM LLM instance inside the `GRPOTrainer`. Meanwhile `server` requires you to serve vLLM separately in a different process where you can hit the server. You can start this server with the command:
@@ -201,7 +201,7 @@ trl vllm-serve --model Qwen/Qwen2.5-VL-3B-Instruct --tensor-parallel-size 1
201201
Then you can run the script as follows.
202202

203203
```bash
204-
CUDA_VISIBLE_DEVICES=1,2 python3 examples/scripts/grpo_vlm.py --model_name_or_path Qwen/Qwen2.5-VL-3B-Instruct … --log_completions use_vllm —vlm_mode server
204+
CUDA_VISIBLE_DEVICES=1,2 python3 examples/scripts/grpo_vlm.py --model_name_or_path Qwen/Qwen2.5-VL-3B-Instruct … --log_completions --use_vllm --vllm_mode server
205205
```
206206

207207
One more tip: we have added support for using vLLM with transformers backend in TRL. You can enable it when running a script with colocate or when serving the model by passing the `--vllm_model_impl transformers` flag.

0 commit comments

Comments
 (0)