You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Let's continue with the [previous GSM8k example](./example_reasoning_basic.md) and show some advanced features provided by Trinity-RFT, namely, off-policy or asynchronous RFT mode.
5
-
6
-
7
-
8
-
9
1
(OPMD)=
10
-
## OPMD: a native off-policy RL algorithm
2
+
#Off-Policy RFT
11
3
12
4
13
-
As an experimental feature of Trinity-RFT, we develop an embarrasingly simple off-policy RL algorithm, termed as OPMD (Online Policy Mirror Descent, inspired by [Kimi k1.5](https://arxiv.org/abs/2501.12599)).
14
-
The algorithm design and analysis can be found in Appendix A of [the technique report of Trinity-RFT](https://arxiv.org/abs/2505.17826).
5
+
Let's continue with the [previous GSM8k example](./example_reasoning_basic.md), but switch from on-policy to off-policy RFT.
6
+
In this example, we consider an off-policy RL algorithm termed as OPMD (Online Policy Mirror Descent) in Trinity-RFT.
7
+
The algorithm design and analysis can be found in Section 2.2 of [our paper](https://arxiv.org/abs/2509.24203).
15
8
The config file is [`opmd_gsm8k.yaml`](https://github.com/modelscope/Trinity-RFT/blob/main/examples/opmd_gsm8k/opmd_gsm8k.yaml).
16
9
17
10
To try out the OPMD algorithm:
@@ -22,14 +15,12 @@ trinity run --config examples/opmd_gsm8k/opmd_gsm8k.yaml
22
15
Note that in this config file, `sync_interval` is set to 10, i.e., the model weights of explorer and trainer are synchronized only once every 10 training steps, which leads to a challenging off-policy scenario (potentially with abrupt distribution shift during the RFT process).
23
16
24
17
25
-
26
-
27
-
28
-
The red curve below shows an example of OPMD's learning curves.
18
+
In the plot below, the red curve shows the score achieved by the explorer during OPMD training.
29
19
Since the explorer's model weights remain unchanged for the first 10 steps, its score remains flat.
30
20
Then, after the model weights of explorer and trainer are synchronized at the end of step 10, we see an abrupt increase in score at step 11, which indicates effective off-policy learning in the first 10 steps.
31
21
A similar performance boost is shown at step 21, which leads to a converged score matching what is achieved by GRPO in a mostly on-policy case (with `sync_interval=2`).
32
22
33
23
34
-
35
24

25
+
26
+
If you're interested in more findings about off-policy RL algorithms, please refer to [our paper](https://arxiv.org/abs/2509.24203).
0 commit comments