Skip to content

Commit e755d08

Browse files
authored
Update off-policy RFT documentation (agentscope-ai#335)
1 parent e3ba954 commit e755d08

File tree

3 files changed

+15
-25
lines changed

3 files changed

+15
-25
lines changed
Lines changed: 7 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -1,17 +1,10 @@
1-
# Off-Policy RFT
2-
3-
4-
Let's continue with the [previous GSM8k example](./example_reasoning_basic.md) and show some advanced features provided by Trinity-RFT, namely, off-policy or asynchronous RFT mode.
5-
6-
7-
8-
91
(OPMD)=
10-
## OPMD: a native off-policy RL algorithm
2+
# Off-Policy RFT
113

124

13-
As an experimental feature of Trinity-RFT, we develop an embarrasingly simple off-policy RL algorithm, termed as OPMD (Online Policy Mirror Descent, inspired by [Kimi k1.5](https://arxiv.org/abs/2501.12599)).
14-
The algorithm design and analysis can be found in Appendix A of [the technique report of Trinity-RFT](https://arxiv.org/abs/2505.17826).
5+
Let's continue with the [previous GSM8k example](./example_reasoning_basic.md), but switch from on-policy to off-policy RFT.
6+
In this example, we consider an off-policy RL algorithm termed as OPMD (Online Policy Mirror Descent) in Trinity-RFT.
7+
The algorithm design and analysis can be found in Section 2.2 of [our paper](https://arxiv.org/abs/2509.24203).
158
The config file is [`opmd_gsm8k.yaml`](https://github.com/modelscope/Trinity-RFT/blob/main/examples/opmd_gsm8k/opmd_gsm8k.yaml).
169

1710
To try out the OPMD algorithm:
@@ -22,14 +15,12 @@ trinity run --config examples/opmd_gsm8k/opmd_gsm8k.yaml
2215
Note that in this config file, `sync_interval` is set to 10, i.e., the model weights of explorer and trainer are synchronized only once every 10 training steps, which leads to a challenging off-policy scenario (potentially with abrupt distribution shift during the RFT process).
2316

2417

25-
26-
27-
28-
The red curve below shows an example of OPMD's learning curves.
18+
In the plot below, the red curve shows the score achieved by the explorer during OPMD training.
2919
Since the explorer's model weights remain unchanged for the first 10 steps, its score remains flat.
3020
Then, after the model weights of explorer and trainer are synchronized at the end of step 10, we see an abrupt increase in score at step 11, which indicates effective off-policy learning in the first 10 steps.
3121
A similar performance boost is shown at step 21, which leads to a converged score matching what is achieved by GRPO in a mostly on-policy case (with `sync_interval=2`).
3222

3323

34-
3524
![opmd](../../assets/opmd-curve.png)
25+
26+
If you're interested in more findings about off-policy RL algorithms, please refer to [our paper](https://arxiv.org/abs/2509.24203).
Lines changed: 7 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1,13 +1,10 @@
1+
(OPMD)=
12
# Off-Policy RFT
23

34

4-
让我们继续使用 [之前的 GSM8k 例子](./example_reasoning_basic.md),并展示 Trinity-RFT 提供的一些高级功能,即 off-policy 或异步 RFT 模式。
5-
6-
(OPMD)=
7-
## OPMD:一种原生的 off-policy 强化学习算法
8-
9-
作为 Trinity-RFT 的一个实验性功能,我们开发了一种极其简单的 off-policy 强化学习算法,称为 OPMD(Online Policy Mirror Descent,灵感来自 [Kimi k1.5](https://arxiv.org/abs/2501.12599))。
10-
该算法的设计与分析详见[Trinity-RFT 技术报告](https://arxiv.org/abs/2505.17826)的附录A。
5+
让我们继续使用 [之前的 GSM8k 例子](./example_reasoning_basic.md),区别在于从 on-policy 模式切换到 off-policy 模式。
6+
在这个例子中,我们考虑一个名为 OPMD 的 off-policy 强化学习算法。
7+
该算法的设计与分析详见[我们的论文](https://arxiv.org/abs/2509.24203)中的 Section 2.2。
118
本例子对应的配置文件为 [`opmd_gsm8k.yaml`](https://github.com/modelscope/Trinity-RFT/blob/main/examples/opmd_gsm8k/opmd_gsm8k.yaml)
129

1310
要尝试 OPMD 算法,请运行:
@@ -17,9 +14,11 @@ trinity run --config examples/opmd_gsm8k/opmd_gsm8k.yaml
1714

1815
注意,在此配置文件中,`sync_interval` 被设置为 10,也就是说,explorer 和 trainer 每 10 个训练步骤才同步一次模型权重,这导致了一个具有挑战性的 off-policy 场景(在 RFT 过程中可能出现剧烈的分布偏移)。
1916

20-
下图中的红色曲线展示了 OPMD 学习过程的一个示例
17+
下图中的红色曲线展示了 OPMD 训练过程中 explorer 取得的分数
2118
由于 explorer 的模型权重在前 10 步保持不变,其得分也保持平稳。
2219
然后,在第 10 步结束时,explorer 和 trainer 完成模型权重同步,我们在第 11 步观察到得分突然上升,这表明前 10 步的 off-policy 学习是有效的。
2320
类似的性能提升在第 21 步再次出现,最终收敛的得分与在准在线策略情况下(`sync_interval=2`)GRPO 所达到的结果相当。
2421

2522
![opmd](../../assets/opmd-curve.png)
23+
24+
对 off-policy 强化学习算法感兴趣的读者,可以参考[我们的论文](https://arxiv.org/abs/2509.24203)

examples/opmd_gsm8k/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# Example: OPMD on GSM8K dataset
22

3-
This example shows the usage of OPMD on the GSM8K dataset.
3+
This example shows the usage of OPMD (an off-policy RL algorithm) on the GSM8K dataset.
44

55
For more detailed information, please refer to the [documentation](../../docs/sphinx_doc/source/tutorial/example_reasoning_advanced.md).
66

0 commit comments

Comments
 (0)