Update off-policy RFT documentation (agentscope-ai#335)

yanxi-chen · web-flow · commit e755d082bbf1 · 2025-10-22T12:22:34.000+08:00
diff --git a/docs/sphinx_doc/source/tutorial/example_reasoning_advanced.md b/docs/sphinx_doc/source/tutorial/example_reasoning_advanced.md
@@ -1,17 +1,10 @@
-# Off-Policy RFT
-
-
-Let's continue with the [previous GSM8k example](./example_reasoning_basic.md) and show some advanced features provided by Trinity-RFT, namely, off-policy or asynchronous RFT mode.
-
-
-
-
 (OPMD)=
-## OPMD: a native off-policy RL algorithm
+# Off-Policy RFT
 
 
-As an experimental feature of Trinity-RFT, we develop an embarrasingly simple off-policy RL algorithm, termed as OPMD (Online Policy Mirror Descent, inspired by [Kimi k1.5](https://arxiv.org/abs/2501.12599)).
-The algorithm design and analysis can be found in Appendix A of [the technique report of Trinity-RFT](https://arxiv.org/abs/2505.17826).
+Let's continue with the [previous GSM8k example](./example_reasoning_basic.md), but switch from on-policy to off-policy RFT.
+In this example, we consider an off-policy RL algorithm termed as OPMD (Online Policy Mirror Descent) in Trinity-RFT.
+The algorithm design and analysis can be found in Section 2.2 of [our paper](https://arxiv.org/abs/2509.24203).
 The config file is [`opmd_gsm8k.yaml`](https://github.com/modelscope/Trinity-RFT/blob/main/examples/opmd_gsm8k/opmd_gsm8k.yaml).
 
 To try out the OPMD algorithm:
@@ -22,14 +15,12 @@ trinity run --config examples/opmd_gsm8k/opmd_gsm8k.yaml
 Note that in this config file, `sync_interval` is set to 10, i.e., the model weights of explorer and trainer are synchronized only once every 10 training steps, which leads to a challenging off-policy scenario (potentially with abrupt distribution shift during the RFT process).
 
 
-
-
-
-The red curve below shows an example of OPMD's learning curves.
+In the plot below, the red curve shows the score achieved by the explorer during OPMD training.
 Since the explorer's model weights remain unchanged for the first 10 steps, its score remains flat.
 Then, after the model weights of explorer and trainer are synchronized at the end of step 10, we see an abrupt increase in score at step 11, which indicates effective off-policy learning in the first 10 steps.
 A similar performance boost is shown at step 21, which leads to a converged score matching what is achieved by GRPO in a mostly on-policy case (with `sync_interval=2`).
 
 
-
 ![opmd](../../assets/opmd-curve.png)
+
+If you're interested in more findings about off-policy RL algorithms, please refer to [our paper](https://arxiv.org/abs/2509.24203).
diff --git a/docs/sphinx_doc/source_zh/tutorial/example_reasoning_advanced.md b/docs/sphinx_doc/source_zh/tutorial/example_reasoning_advanced.md
@@ -1,13 +1,10 @@
+(OPMD)=
 # Off-Policy RFT
 
 
-让我们继续使用 [之前的 GSM8k 例子](./example_reasoning_basic.md)，并展示 Trinity-RFT 提供的一些高级功能，即 off-policy 或异步 RFT 模式。
-
-(OPMD)=
-## OPMD：一种原生的 off-policy 强化学习算法
-
-作为 Trinity-RFT 的一个实验性功能，我们开发了一种极其简单的 off-policy 强化学习算法，称为 OPMD（Online Policy Mirror Descent，灵感来自 [Kimi k1.5](https://arxiv.org/abs/2501.12599)）。
-该算法的设计与分析详见[Trinity-RFT 技术报告](https://arxiv.org/abs/2505.17826)的附录A。
+让我们继续使用 [之前的 GSM8k 例子](./example_reasoning_basic.md)，区别在于从 on-policy 模式切换到 off-policy 模式。
+在这个例子中，我们考虑一个名为 OPMD 的 off-policy 强化学习算法。
+该算法的设计与分析详见[我们的论文](https://arxiv.org/abs/2509.24203)中的 Section 2.2。
 本例子对应的配置文件为 [`opmd_gsm8k.yaml`](https://github.com/modelscope/Trinity-RFT/blob/main/examples/opmd_gsm8k/opmd_gsm8k.yaml)。
 
 要尝试 OPMD 算法，请运行：
@@ -17,9 +14,11 @@ trinity run --config examples/opmd_gsm8k/opmd_gsm8k.yaml
 
 注意，在此配置文件中，`sync_interval` 被设置为 10，也就是说，explorer 和 trainer 每 10 个训练步骤才同步一次模型权重，这导致了一个具有挑战性的 off-policy 场景（在 RFT 过程中可能出现剧烈的分布偏移）。
 
-下图中的红色曲线展示了 OPMD 学习过程的一个示例。
+下图中的红色曲线展示了 OPMD 训练过程中 explorer 取得的分数。
 由于 explorer 的模型权重在前 10 步保持不变，其得分也保持平稳。
 然后，在第 10 步结束时，explorer 和 trainer 完成模型权重同步，我们在第 11 步观察到得分突然上升，这表明前 10 步的 off-policy 学习是有效的。
 类似的性能提升在第 21 步再次出现，最终收敛的得分与在准在线策略情况下（`sync_interval=2`）GRPO 所达到的结果相当。
 
 ![opmd](../../assets/opmd-curve.png)
+
+对 off-policy 强化学习算法感兴趣的读者，可以参考[我们的论文](https://arxiv.org/abs/2509.24203)。
diff --git a/examples/opmd_gsm8k/README.md b/examples/opmd_gsm8k/README.md
@@ -1,6 +1,6 @@
 # Example: OPMD on GSM8K dataset
 
-This example shows the usage of OPMD on the GSM8K dataset.
+This example shows the usage of OPMD (an off-policy RL algorithm) on the GSM8K dataset.
 
 For more detailed information, please refer to the [documentation](../../docs/sphinx_doc/source/tutorial/example_reasoning_advanced.md).