Skip to content

Commit ee831f5

Browse files
authored
updates GRPOTrainer compatible with trl 0.17 (#3969)
* shuffle * generate once * fix split tensor dict * fix split * move vllm args to mixin * update * gas wip * lint and fix * wip * rm mini-batch * recover log mertrics * fix * fix * wip * wip * server infer * rm unused * rollout cli * fix rollout * fix rollout * loss type * fix * mode and log * fix * fix * rm mini_batch_size and doc * update * doc * doc * lint * rm comment * rm comment
1 parent 87f7f76 commit ee831f5

37 files changed

+729
-324
lines changed

README.md

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -75,15 +75,15 @@ You can contact us and communicate with us by adding our group:
7575

7676
## 🎉 News
7777
- 🎁 2025.04.15: SWIFT paper has been accepted by AAAI 2025, you can find the paper [here](https://ojs.aaai.org/index.php/AAAI/article/view/35383).
78-
- 🎁 2025.03.23: SWIFT supports multi round GRPO, this is used to construct multi turn conversations(use cases like agent tool calling), check script [here](examples/train/grpo/train_multi_round.sh).
78+
- 🎁 2025.03.23: SWIFT supports multi round GRPO, this is used to construct multi turn conversations(use cases like agent tool calling), check script [here](examples/train/grpo/internal/train_multi_round.sh).
7979
- 🎁 2025.03.16: SWIFT supports training with Megatron's parallel technology. Please refer to the [Megatron-SWIFT Training Documentation](https://swift.readthedocs.io/en/latest/Instruction/Megatron-SWIFT-Training.html).
8080
- 🎁 2025.03.15: SWIFT support the fine-tuning of gme(multi-modal) embedding models,please check the [training script](examples/train/embedding/train_gme.sh)
81-
- 🎁 2025.03.13: We provide a script of GRPO to train a 72B model with only 4 GPUs(4*80G), please check [here](examples/train/grpo/train_72b_4gpu.sh)
82-
- 🎁 2025.03.05: We support the hybrid mode of GRPO(rollout and actor on the same GPU, rollout sleep when actor training), meanwhile tensor parallel for GRPO, check [training script here](examples/train/grpo/multi_gpu_mp_colocate.sh)
83-
- 🎁 2025.02.21: We test the speed performance of GRPO,and with some tricks to [speed up to 300%](examples/train/grpo/full_lmdeploy.sh). WanDB charts can be found [here](https://wandb.ai/tastelikefeet/grpo_perf_test?nw=nwuseryuzezyz)
81+
- 🎁 2025.03.13: We provide a script of GRPO to train a 72B model with only 4 GPUs(4*80G), please check [here](examples/train/grpo/internal/train_72b_4gpu.sh)
82+
- 🎁 2025.03.05: We support the hybrid mode of GRPO(rollout and actor on the same GPU, rollout sleep when actor training), meanwhile tensor parallel for GRPO, check [training script here](examples/train/grpo/internal/multi_gpu_mp_colocate.sh)
83+
- 🎁 2025.02.21: We test the speed performance of GRPO,and with some tricks to [speed up to 300%](examples/train/grpo/internal/full_lmdeploy.sh). WanDB charts can be found [here](https://wandb.ai/tastelikefeet/grpo_perf_test?nw=nwuseryuzezyz)
8484
- 🎁 2025.02.21: Support distill from LLM API,Please check [this example](examples/sampler/distill/distill.sh)
8585
- 🎁 2025.02.17: Support SwanLab, just add [a few of arguments](docs/source_en/Instruction/Command-line-parameters.md#swanlab) you can use swanlab to analysis your training results
86-
- 🎁 2025.02.16: Support LMDeploy in GRPO, use `--use_lmdeploy true`. Please check [this script](examples/train/grpo/full_lmdeploy.sh)
86+
- 🎁 2025.02.16: Support LMDeploy in GRPO, use `--use_lmdeploy true`. Please check [this script](examples/train/grpo/internal/full_lmdeploy.sh)
8787
- 🔥 2025.02.12: Support for GRPO(Group Relative Policy Optimization) algorithm for llm and mllm, document can be found in [here](docs/source_en/Instruction/GRPO.md)
8888
- 🎁 2025.02.10: SWIFT support the fine-tuning of embedding models,please check the [training script](examples/train/embedding/train_gte.sh)
8989
- 🎁 2025.01.23: SWIFT support the `sample` command, this is a very important feature for complex CoT and RFT. Meanwhile, we support an [Reinforced Fine-tuning script](docs/source_en/Instruction/Reinforced_Fine_tuning.md).
@@ -282,7 +282,7 @@ Supported Training Methods:
282282
| Pre-training | [](https://github.com/modelscope/ms-swift/blob/main/examples/train/pretrain/train.sh) ||||||
283283
| Instruction Supervised Fine-tuning | [](https://github.com/modelscope/ms-swift/blob/main/examples/train/full/train.sh) | [](https://github.com/modelscope/ms-swift/blob/main/examples/train/lora_sft.sh) | [](https://github.com/modelscope/ms-swift/tree/main/examples/train/qlora) | [](https://github.com/modelscope/ms-swift/tree/main/examples/train/multi-gpu/deepspeed) | [](https://github.com/modelscope/ms-swift/tree/main/examples/train/multi-node) | [](https://github.com/modelscope/ms-swift/tree/main/examples/train/multimodal) |
284284
| DPO Training || [](https://github.com/modelscope/ms-swift/blob/main/examples/train/rlhf/dpo.sh) || [](https://github.com/modelscope/ms-swift/blob/main/examples/train/rlhf/dpo.sh) || [](https://github.com/modelscope/ms-swift/blob/main/examples/train/multimodal/rlhf/dpo.sh) |
285-
| GRPO Training | []((https://github.com/modelscope/ms-swift/blob/main/examples/train/grpo/grpo_zero2.sh)) |||| [](https://github.com/modelscope/ms-swift/blob/main/examples/train/grpo/multi_node) ||
285+
| GRPO Training | []((https://github.com/modelscope/ms-swift/blob/main/examples/train/grpo/internal/grpo_zero2.sh)) |||| [](https://github.com/modelscope/ms-swift/blob/main/examples/train/grpo/internal/multi_node) ||
286286
| Reward Model Training || [](https://github.com/modelscope/ms-swift/blob/main/examples/train/rlhf/rm.sh) || [](https://github.com/modelscope/ms-swift/blob/main/examples/train/rlhf/rm.sh) |||
287287
| PPO Training || [](https://github.com/modelscope/ms-swift/blob/main/examples/train/rlhf/ppo.sh) || [](https://github.com/modelscope/ms-swift/blob/main/examples/train/rlhf/ppo.sh) |||
288288
| KTO Training || [](https://github.com/modelscope/ms-swift/blob/main/examples/train/rlhf/kto.sh) || [](https://github.com/modelscope/ms-swift/blob/main/examples/train/rlhf/kto.sh) || [](https://github.com/modelscope/ms-swift/blob/main/examples/train/multimodal/rlhf/kto.sh) |

README_CN.md

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -71,15 +71,15 @@
7171

7272
## 🎉 新闻
7373
- 🎁 2025.04.15: SWIFT论文已经被AAAI 2025接收, 论文地址在[这里](https://ojs.aaai.org/index.php/AAAI/article/view/35383).
74-
- 🎁 2025.03.23: SWIFT支持了多轮GRPO, 用于构建多轮对话场景的训练(例如agent tool calling), 请查看[训练脚本](examples/train/grpo/train_multi_round.sh).
74+
- 🎁 2025.03.23: SWIFT支持了多轮GRPO, 用于构建多轮对话场景的训练(例如agent tool calling), 请查看[训练脚本](examples/train/grpo/internal/train_multi_round.sh).
7575
- 🎁 2025.03.16: SWIFT支持了Megatron的并行技术进行训练,请查看[Megatron-SWIFT训练文档](https://swift.readthedocs.io/zh-cn/latest/Instruction/Megatron-SWIFT训练.html)
7676
- 🎁 2025.03.15: SWIFT支持了gme(多模态)embedding模型的微调,请查看[训练脚本](examples/train/embedding/train_gme.sh)
77-
- 🎁 2025.03.13: 我们提供了一个仅使用4GPU(4*80G)来训练72B模型的脚本, 请查看[这里](examples/train/grpo/train_72b_4gpu.sh)
78-
- 🎁 2025.03.05: 支持GRPO的hybrid模式(rollout和actor在同一GPU上, rollout可以进行offload), 同时支持了vllm的tensor parallel, 查看[训练脚本](examples/train/grpo/multi_gpu_mp_colocate.sh)
79-
- 🎁 2025.02.21: 我们测试了GRPO算法的性能,并且使用一些tricks使[训练速度提高到300%](examples/train/grpo/full_lmdeploy.sh). WanDB表格请查看[这里](https://wandb.ai/tastelikefeet/grpo_perf_test?nw=nwuseryuzezyz)
77+
- 🎁 2025.03.13: 我们提供了一个仅使用4GPU(4*80G)来训练72B模型的脚本, 请查看[这里](examples/train/grpo/internal/train_72b_4gpu.sh)
78+
- 🎁 2025.03.05: 支持GRPO的hybrid模式(rollout和actor在同一GPU上, rollout可以进行offload), 同时支持了vllm的tensor parallel, 查看[训练脚本](examples/train/grpo/internal/multi_gpu_mp_colocate.sh)
79+
- 🎁 2025.02.21: 我们测试了GRPO算法的性能,并且使用一些tricks使[训练速度提高到300%](examples/train/grpo/internal/full_lmdeploy.sh). WanDB表格请查看[这里](https://wandb.ai/tastelikefeet/grpo_perf_test?nw=nwuseryuzezyz)
8080
- 🎁 2025.02.21: 支持大模型API蒸馏采样,请查看[示例](examples/sampler/distill/distill.sh)
8181
- 🎁 2025.02.17: 支持SwanLab, 仅需添加[几个新的参数](docs/source/Instruction/命令行参数.md#swanlab)就可以在swanlab上验证你的训练效果
82-
- 🎁 2025.02.16: 在GRPO算法中支持LMDeploy, 请查看`--use_lmdeploy true`. 具体参考[这个脚本](examples/train/grpo/full_lmdeploy.sh)
82+
- 🎁 2025.02.16: 在GRPO算法中支持LMDeploy, 请查看`--use_lmdeploy true`. 具体参考[这个脚本](examples/train/grpo/internal/full_lmdeploy.sh)
8383
- 🔥 2025.02.12: 支持GRPO(Group Relative Policy Optimization) 训练算法,训练脚本可以在[这里](docs/source/Instruction/GRPO.md)找到
8484
- 🎁 2025.02.10: SWIFT支持了embedding模型的微调,请查看[训练脚本](examples/train/embedding/train_gte.sh)
8585
- 🎁 2025.01.23: SWIFT支持了`sample`命令, 这是一个对CoT和RFT非常重要的命令。同时, 我们支持了一个[强化微调脚本](docs/source/Instruction/强化微调.md)
@@ -270,7 +270,7 @@ print(f'response: {resp_list[0].choices[0].message.content}')
270270
| 预训练 | [](https://github.com/modelscope/ms-swift/blob/main/examples/train/pretrain/train.sh) ||||||
271271
| 指令监督微调 | [](https://github.com/modelscope/ms-swift/blob/main/examples/train/full/train.sh) | [](https://github.com/modelscope/ms-swift/blob/main/examples/train/lora_sft.sh) | [](https://github.com/modelscope/ms-swift/tree/main/examples/train/qlora) | [](https://github.com/modelscope/ms-swift/tree/main/examples/train/multi-gpu/deepspeed) | [](https://github.com/modelscope/ms-swift/tree/main/examples/train/multi-node) | [](https://github.com/modelscope/ms-swift/tree/main/examples/train/multimodal) |
272272
| DPO训练 || [](https://github.com/modelscope/ms-swift/blob/main/examples/train/rlhf/dpo.sh) || [](https://github.com/modelscope/ms-swift/blob/main/examples/train/rlhf/dpo.sh) || [](https://github.com/modelscope/ms-swift/blob/main/examples/train/multimodal/rlhf/dpo.sh) |
273-
| GRPO训练 | [](https://github.com/modelscope/ms-swift/blob/main/examples/train/grpo/grpo_zero2.sh) |||| [](https://github.com/modelscope/ms-swift/blob/main/examples/train/grpo/multi_node) ||
273+
| GRPO训练 | [](https://github.com/modelscope/ms-swift/blob/main/examples/train/grpo/internal/grpo_zero2.sh) |||| [](https://github.com/modelscope/ms-swift/blob/main/examples/train/grpo/internal/multi_node) ||
274274
| 奖励模型训练 || [](https://github.com/modelscope/ms-swift/blob/main/examples/train/rlhf/rm.sh) || [](https://github.com/modelscope/ms-swift/blob/main/examples/train/rlhf/rm.sh) |||
275275
| PPO训练 || [](https://github.com/modelscope/ms-swift/blob/main/examples/train/rlhf/ppo.sh) || [](https://github.com/modelscope/ms-swift/blob/main/examples/train/rlhf/ppo.sh) |||
276276
| KTO训练 || [](https://github.com/modelscope/ms-swift/blob/main/examples/train/rlhf/kto.sh) || [](https://github.com/modelscope/ms-swift/blob/main/examples/train/rlhf/kto.sh) || [](https://github.com/modelscope/ms-swift/blob/main/examples/train/multimodal/rlhf/kto.sh) |

docs/source/Instruction/GRPO.md

Lines changed: 14 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@
66

77
环境安装
88
```bash
9-
pip install math_verify # reward function
9+
pip install math_verify==0.5.2 # reward function
1010
pip install -U trl
1111
```
1212

@@ -47,6 +47,14 @@ GRPO 训练框架支持集成高性能推理引擎(如 vLLM)来加速采样
4747
--vllm_server_port <服务端口> \
4848
--vllm_server_timeout <超时时间> \
4949
```
50+
使用`swift rollout`命令部署vLLM 服务器, 现仅支持vLLM backend
51+
```bash
52+
CUDA_VISIBLE_DEVICES=2 \
53+
swift rollout \
54+
--model Qwen/Qwen2.5-VL-7B-Instruct \
55+
--tensor_parallel_size 2 \
56+
```
57+
完整脚本可以参考[这里](../../../examples/train/grpo/multi_node/Qwen2_5_32B_full.sh)
5058

5159

5260
## 奖励函数
@@ -137,12 +145,16 @@ A conversation between User and Assistant. The user asks a question, and the Ass
137145

138146
## 参数与运行脚本
139147
参数
140-
- num_generations: 每个prompt采样的数量,论文中的G值,需要被 per_device_batch_size * nproc_per_node 整除
148+
- per_device_train_batch_size: 每个设备训练批量大小,在GRPO中,指 completion 的批次大小。
149+
- per_device_eval_batch_size: 每个设备评估批量大小,在GRPO中,指 completion 的批次大小。
150+
- num_generations: 每个prompt采样的数量,论文中的G值,需要被 per_device_batch_size * gradient_accumulation_steps * nproc_per_node 整除,默认为8
141151
- max_completion_length: 采样生成的最大长度,默认为512
142152
- ds3_gather_for_generation: 该参数适用于DeepSpeed ZeRO-3。如果启用,策略模型权重将被收集用于生成,从而提高生成速度。然而,禁用此选项允许训练超出单个GPU VRAM的模型,尽管生成速度会变慢。禁用此选项与vLLM生成不兼容。默认为True
143153
- reward_funcs: 奖励函数,根据模型生成结果进行打分,内置accuracy、format、cosine和repetition四个rule-based函数,详细见 swift/plugin/orm.py 文件
144154
- reward_weights: 每个奖励函数的权重。必须与奖励函数的数量匹配。如果为 None,则所有奖励的权重都相等,为`1.0`
145155
- 提示:如果GRPO训练中包含`--reward_model`,则其加在奖励函数的最后位置
156+
- dataset_shuffle: 是否对dataset进行随机操作,默认为True
157+
- loss_type: loss 归一化的类型,可选项为['grpo', 'bnpo', 'dr_grpo'], 默认为'grpo', 具体查看该[pr](https://github.com/huggingface/trl/pull/3256#discussion_r2033213348)
146158
- log_completions: 是否记录训练中的模型生成内容,搭配 `--report_to wandb` 使用。默认为False
147159
- 提示:若没有设置`--report_to wandb`,则会在checkpoint中创建`completions.jsonl`来存储生成内容
148160
- use_vllm: 是否使用vLLM作为采样的生成后端,默认为False,建议使用加快训练速度
@@ -168,7 +180,6 @@ A conversation between User and Assistant. The user asks a question, and the Ass
168180
- 注意:若该参数设置为True,训练时grad_norm一直为0,请安装`vllm==0.7.3`
169181
- gc_collect_after_offload: 是否在offload结束时进行gc(python gc和GPU gc),默认为False
170182
- multi_turn_func: 多轮GRPO参数, 传入对应的plugin名称, 同时在plugin/multi_turn.py中添加好对应的实现
171-
- mini_batch_size:用于将每个设备上的批次大小(per_device_batch)进一步切分为更小的子批次。为确保切分有效,per_device_batch 需要能够被 mini_batch_size 整除
172183
- dynamic_sample:筛除group内奖励标准差为0的数据,额外采样新数据,默认为False。
173184
- max_resample_times:dynamic_sample设置下限制重采样次数,默认3次。
174185
- overlong_filter:跳过超长截断的样本,不参与loss计算,默认为False。

0 commit comments

Comments
 (0)