Skip to content

Commit f770478

Browse files
authored
[grpo] update doc (#4853)
1 parent b41c78d commit f770478

File tree

4 files changed

+4
-15
lines changed

4 files changed

+4
-15
lines changed

docs/source/Instruction/GRPO/DeveloperGuide/多轮训练.md

Lines changed: 1 addition & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,6 @@
11
# 多轮训练
22

3-
注意:该 feature 需要使用 ms-swift 源码 (3.6.dev)
4-
```bash
5-
git clone https://github.com/modelscope/ms-swift.git
6-
cd ms-swift
7-
pip install -e .
8-
```
9-
3+
注意:该 feature 需要使用 ms-swift>=3.6
104

115
在强化学习训练场景中,模型采样可能需要与环境进行多轮交互(如工具调用、外部API访问等)。这种交互式训练要求模型能够根据环境反馈信息进行连续推理。本文档将详细介绍如何在 GRPO 训练中自定义多轮训练流程。
126

docs/source/Instruction/GRPO/GetStarted/GRPO.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -274,7 +274,7 @@ $
274274

275275
在设置以下参数情况下,算法为off-policy (near-on-policy)
276276
1. num_iterations > 1
277-
2. steps_per_generation > gradient_accumulation_steps
277+
2. gradient_accumulation_steps % steps_per_generation != 0
278278

279279
参考[issue](https://github.com/huggingface/open-r1/issues/239#issuecomment-2646297851)
280280

docs/source_en/Instruction/GRPO/DeveloperGuide/multi_turn.md

Lines changed: 1 addition & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,6 @@
11
# Multi-Turn Rollout
22

3-
Note: This feature requires the ms-swift source code (3.6.dev)
4-
```bash
5-
git clone https://github.com/modelscope/ms-swift.git
6-
cd ms-swift
7-
pip install -e .
8-
```
3+
Note: This feature requires ms-swift>=3.6
94

105
In reinforcement learning training scenarios, model sampling may require multiple rounds of interaction with the environment (e.g., tool calls, external API access, etc.). This interactive training requires the model to perform continuous reasoning based on environmental feedback. This document details how to customize multi-round training workflows in GRPO training.
116

docs/source_en/Instruction/GRPO/GetStarted/GRPO.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -273,7 +273,7 @@ Thus, the importance sampling ratio is always 1, and the clip operation does not
273273

274274
The algorithm becomes off-policy (near-on-policy) under the following parameter settings:
275275
1. num_iterations > 1
276-
2. steps_per_generation > gradient_accumulation_steps
276+
2. gradient_accumulation_steps % steps_per_generation != 0
277277

278278
Refer to [issue](https://github.com/huggingface/open-r1/issues/239#issuecomment-2646297851).
279279

0 commit comments

Comments
 (0)