[grpo] update doc (#4853)

hjh0119 · web-flow · commit f770478afea2 · 2025-07-07T17:44:36.000+08:00
diff --git a/docs/source/Instruction/GRPO/DeveloperGuide/多轮训练.md b/docs/source/Instruction/GRPO/DeveloperGuide/多轮训练.md
@@ -1,12 +1,6 @@
 # 多轮训练
 
-注意：该 feature 需要使用 ms-swift 源码 (3.6.dev)
-```bash
-git clone https://github.com/modelscope/ms-swift.git
-cd ms-swift
-pip install -e .
-```
-
+注意：该 feature 需要使用 ms-swift>=3.6
 
 在强化学习训练场景中，模型采样可能需要与环境进行多轮交互（如工具调用、外部API访问等）。这种交互式训练要求模型能够根据环境反馈信息进行连续推理。本文档将详细介绍如何在 GRPO 训练中自定义多轮训练流程。
 
diff --git a/docs/source/Instruction/GRPO/GetStarted/GRPO.md b/docs/source/Instruction/GRPO/GetStarted/GRPO.md
@@ -274,7 +274,7 @@ $
 
 在设置以下参数情况下，算法为off-policy (near-on-policy)
 1. num_iterations > 1
-2. steps_per_generation > gradient_accumulation_steps
+2. gradient_accumulation_steps % steps_per_generation != 0
 
 参考[issue](https://github.com/huggingface/open-r1/issues/239#issuecomment-2646297851)
 
diff --git a/docs/source_en/Instruction/GRPO/DeveloperGuide/multi_turn.md b/docs/source_en/Instruction/GRPO/DeveloperGuide/multi_turn.md
@@ -1,11 +1,6 @@
 # Multi-Turn Rollout
 
-Note: This feature requires the ms-swift source code (3.6.dev)
-```bash
-git clone https://github.com/modelscope/ms-swift.git
-cd ms-swift
-pip install -e .
-```
+Note: This feature requires ms-swift>=3.6
 
 In reinforcement learning training scenarios, model sampling may require multiple rounds of interaction with the environment (e.g., tool calls, external API access, etc.). This interactive training requires the model to perform continuous reasoning based on environmental feedback. This document details how to customize multi-round training workflows in GRPO training.
 
diff --git a/docs/source_en/Instruction/GRPO/GetStarted/GRPO.md b/docs/source_en/Instruction/GRPO/GetStarted/GRPO.md
@@ -273,7 +273,7 @@ Thus, the importance sampling ratio is always 1, and the clip operation does not
 
 The algorithm becomes off-policy (near-on-policy) under the following parameter settings:
 1. num_iterations > 1
-2. steps_per_generation > gradient_accumulation_steps
+2. gradient_accumulation_steps % steps_per_generation != 0
 
 Refer to [issue](https://github.com/huggingface/open-r1/issues/239#issuecomment-2646297851).