Skip to content

Commit 3356b67

Browse files
authored
[docs] fix grpo docs (#4777)
* fix docs * fix doc * update readme multi turn link
1 parent 348b11f commit 3356b67

File tree

6 files changed

+10
-10
lines changed

6 files changed

+10
-10
lines changed

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -81,7 +81,7 @@ You can contact us and communicate with us by adding our group:
8181
- 🎁 2025.05.29: Support sequence parallel in pt, sft, dpo and grpo, check script [here](https://github.com/modelscope/ms-swift/tree/main/examples/train/long_text).
8282
- 🎁 2025.05.11: GRPO now supports custom processing logic for reward models. See the GenRM example [here](./docs/source_en/Instruction/GRPO/DeveloperGuide/reward_model.md).
8383
- 🎁 2025.04.15: The ms-swift paper has been accepted by AAAI 2025. You can find the paper at [this link](https://ojs.aaai.org/index.php/AAAI/article/view/35383).
84-
- 🎁 2025.03.23: Multi-round GRPO is now supported for training multi-turn dialogue scenarios (e.g., agent tool calling). Please refer to the [training script](examples/train/grpo/internal/vllm_multi_turn.sh).
84+
- 🎁 2025.03.23: Multi-round GRPO is now supported for training multi-turn dialogue scenarios (e.g., agent tool calling). Please refer to the [doc](./docs/source_en/Instruction/GRPO/DeveloperGuide/multi_turn.md).
8585
- 🎁 2025.03.16: Support for Megatron's parallel training techniques is now available. Please see the [Megatron-SWIFT training documentation](https://swift.readthedocs.io/en/latest/Instruction/Megatron-SWIFT-Training.html).
8686
- 🎁 2025.03.15: Fine-tuning of embedding models for both pure text and multimodal models is supported. Please check the [training script](examples/train/embedding).
8787
- 🎁 2025.03.05: The hybrid mode for GRPO is supported, with a script for training a 72B model on 4 GPUs (4*80G) available [here](examples/train/grpo/internal/vllm_72b_4gpu.sh). Tensor parallelism with vllm is also supported, with the training script available [here](examples/train/grpo/internal).

README_CN.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -77,7 +77,7 @@
7777
- 🎁 2025.05.29: 支持pt、sft、dpo、grpo的序列并行,具体请查看[脚本](https://github.com/modelscope/ms-swift/tree/main/examples/train/long_text)
7878
- 🎁 2025.05.11: GRPO中的奖励模型支持自定义处理逻辑,GenRM的例子参考[这里](./docs/source/Instruction/GRPO/DeveloperGuide/奖励模型.md)
7979
- 🎁 2025.04.15: ms-swift论文已经被AAAI 2025接收,论文地址在[这里](https://ojs.aaai.org/index.php/AAAI/article/view/35383)
80-
- 🎁 2025.03.23: 支持了多轮GRPO,用于构建多轮对话场景的训练(例如agent tool calling),请查看[训练脚本](examples/train/grpo/internal/vllm_multi_turn.sh)
80+
- 🎁 2025.03.23: 支持了多轮GRPO,用于构建多轮对话场景的训练(例如agent tool calling),请查看[文档](docs/source/Instruction/GRPO/DeveloperGuide/多轮训练.md)
8181
- 🎁 2025.03.16: 支持了Megatron的并行技术进行训练,请查看[Megatron-SWIFT训练文档](https://swift.readthedocs.io/zh-cn/latest/Instruction/Megatron-SWIFT训练.html)
8282
- 🎁 2025.03.15: 支持纯文本和多模态模型的embedding模型的微调,请查看[训练脚本](examples/train/embedding)
8383
- 🎁 2025.03.05: 支持GRPO的hybrid模式,4GPU(4*80G)训练72B模型的脚本参考[这里](examples/train/grpo/internal/vllm_72b_4gpu.sh)。同时支持vllm的tensor并行,训练脚本参考[这里](examples/train/grpo/internal)

docs/source/Instruction/GRPO/DeveloperGuide/多轮训练.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -115,7 +115,7 @@ swift rollout \
115115
--model xxx \
116116
--use_async_engine true \
117117
--multi_turn_scheduler xxx \
118-
--multi_turns xxx
118+
--max_turns xxx
119119
```
120120

121121
通过参数`external_plugins`, 我们可以将本地的多轮规划器注册进 ms-swift 中,具体实现参考[代码](https://github.com/modelscope/ms-swift/blob/main/examples/train/grpo/plugin/plugin.py)

docs/source/Instruction/命令行参数.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -493,8 +493,8 @@ reward模型参数将在PPO、GRPO中使用。
493493
- overlong_filter:跳过超长截断的样本,不参与loss计算,默认为False。
494494

495495
cosine 奖励参数
496-
- cosine_min_len_value_wrong:cosine 奖励函数参数,生成错误答案时,最小长度对应的奖励值。默认值为0.0
497-
- cosine_max_len_value_wrong:生成错误答案时,最大长度对应的奖励值。默认值为-0.5
496+
- cosine_min_len_value_wrong:cosine 奖励函数参数,生成错误答案时,最小长度对应的奖励值。默认值为-0.5
497+
- cosine_max_len_value_wrong:生成错误答案时,最大长度对应的奖励值。默认值为0.0
498498
- cosine_min_len_value_correct:生成正确答案时,最小长度对应的奖励值。默认值为1.0。
499499
- cosine_max_len_value_correct:生成正确答案时,最大长度对应的奖励值。默认值为0.5。
500500
- cosine_max_len:生成文本的最大长度限制。默认等于 max_completion_length。

docs/source_en/Instruction/Command-line-parameters.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -511,10 +511,10 @@ The meanings of the following parameters can be referenced [here](https://huggin
511511
The hyperparameters for the reward function can be found in the [Built-in Reward Functions section](#built-in-reward-functions).
512512

513513
cosine reward function arguments
514-
- cosine_min_len_value_wrong (default: 0.0): Reward value corresponding to the minimum length when the answer is incorrect. Default is 0.0
515-
- cosine_max_len_value_wrong (default: -0.5): Reward value corresponding to the maximum length when the answer is incorrect. Default is -0.5
516-
- cosine_min_len_value_correct (default: 1.0): Reward value corresponding to the minimum length when the answer is correct. Default is 1.0
517-
- cosine_max_len_value_correct (default: 0.5): Reward value corresponding to the maximum length when the answer is correct. Default is 0.5
514+
- cosine_min_len_value_wrong (default: -0.5): Reward value corresponding to the minimum length when the answer is incorrect.
515+
- cosine_max_len_value_wrong (default: 0.0): Reward value corresponding to the maximum length when the answer is incorrect.
516+
- cosine_min_len_value_correct (default: 1.0): Reward value corresponding to the minimum length when the answer is correct.
517+
- cosine_max_len_value_correct (default: 0.5): Reward value corresponding to the maximum length when the answer is correct.
518518
- cosine_max_len (default value equal to the model's maximum generation capacity): Maximum length limit for generated text. Default value equal to max_completion_length
519519

520520
repetition penalty function arguments

docs/source_en/Instruction/GRPO/DeveloperGuide/multi_turn.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -109,7 +109,7 @@ swift rollout \
109109
--model xxx \
110110
--use_async_engine true \
111111
--multi_turn_scheduler xxx \
112-
--multi_turns xxx
112+
--max_turns xxx
113113
```
114114

115115
Through the `external_plugins` parameter, we can register local multi-round planners into ms-swift. For specific implementation, refer to the [code](https://github.com/modelscope/ms-swift/blob/main/examples/train/grpo/plugin/plugin.py).

0 commit comments

Comments
 (0)