You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/sphinx_doc/source/tutorial/example_dpo.md
+1-2Lines changed: 1 addition & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -40,7 +40,7 @@ Note that the dataset has the keys `prompt`, `chosen` and `rejected`. If not, pa
40
40
41
41
We use the configurations in [`dpo.yaml`](https://github.com/modelscope/Trinity-RFT/tree/main/examples/dpo_humanlike/dpo.yaml) and [`train_dpo.yaml`](https://github.com/modelscope/Trinity-RFT/tree/main/examples/dpo_humanlike/train_dpo.yaml) for this experiment. Some important setups are listed in the following:
42
42
43
-
We run the experiment in a train mode, as there is no Explorer. To enable this mode, we config `mode` to `train` and set `sync_method` to `checkpoint`. The value of `sync_iteration_interval` can be set as same of the value of `save_interval`.
43
+
We run the experiment in a train mode, as there is no Explorer. To enable this mode, we config `mode` to `train` and set `sync_method` to `checkpoint`.
Copy file name to clipboardExpand all lines: docs/sphinx_doc/source/tutorial/example_reasoning_advanced.md
+2-2Lines changed: 2 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -20,7 +20,7 @@ To try out the OPMD algorithm:
20
20
trinity run --config examples/opmd_gsm8k/opmd_gsm8k.yaml
21
21
```
22
22
23
-
Note that in this config file, `sync_iteration_interval` is set to 10, i.e., the model weights of explorer and trainer are synchronized only once every 10 training steps, which leads to a challenging off-policy scenario (potentially with abrupt distribution shift during the RFT process).
23
+
Note that in this config file, `sync_interval` is set to 10, i.e., the model weights of explorer and trainer are synchronized only once every 10 training steps, which leads to a challenging off-policy scenario (potentially with abrupt distribution shift during the RFT process).
24
24
Other configurations of particular interest are explained at the beginning of [`train_opmd_gsm8k.yaml`](https://github.com/modelscope/Trinity-RFT/tree/main/examples/opmd_gsm8k/train_opmd_gsm8k.yaml).
25
25
26
26
@@ -30,7 +30,7 @@ Other configurations of particular interest are explained at the beginning of [`
30
30
The red curve below shows an example of OPMD's learning curves.
31
31
Since the explorer's model weights remain unchanged for the first 10 steps, its score remains flat.
32
32
Then, after the model weights of explorer and trainer are synchronized at the end of step 10, we see an abrupt increase in score at step 11, which indicates effective off-policy learning in the first 10 steps.
33
-
A similar performance boost is shown at step 21, which leads to a converged score matching what is achieved by GRPO in a mostly on-policy case (with `sync_iteration_interval=2`).
33
+
A similar performance boost is shown at step 21, which leads to a converged score matching what is achieved by GRPO in a mostly on-policy case (with `sync_interval=2`).
Copy file name to clipboardExpand all lines: docs/sphinx_doc/source/tutorial/example_reasoning_basic.md
+4-5Lines changed: 4 additions & 5 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -37,13 +37,13 @@ More details on dataset downloading are referred to [ModelScope](https://modelsc
37
37
38
38
### Synchronous Mode of Trinity-RFT
39
39
40
-
We run the experiment in a synchronous mode where the Explorer and Trainer operate in turn. To enable this mode, we config `mode` to `both` (default) and set `sync_iteration_interval` properly. A smaller value of `sync_iteration_interval` makes the training closer to an on-policy setup.
40
+
We run the experiment in a synchronous mode where the Explorer and Trainer operate in turn. To enable this mode, we config `mode` to `both` (default) and set `sync_interval` properly. A smaller value of `sync_interval` makes the training closer to an on-policy setup.
41
41
42
42
```yaml
43
43
mode: both
44
44
synchronizer:
45
45
sync_method: 'nccl'
46
-
sync_iteration_interval: 2
46
+
sync_interval: 2
47
47
```
48
48
49
49
### Use GRPO or PPO Algorithm
@@ -76,21 +76,20 @@ trinity run --config examples/grpo_gsm8k/gsm8k.yaml
76
76
77
77
## Optional: RFT with SFT Warmup
78
78
79
-
Before RFT, we may use SFT as a warmup step. We need to set `trainer.sft_warmup_iteration > 0` and prepare the SFT data to `buffer.train_dataset.path=$DATASET_PATH/{sft_data}`.
79
+
Before RFT, we may use SFT as a warmup step. We need to set `trainer.sft_warmup_steps > 0` and prepare the SFT data to `buffer.train_dataset.path=$DATASET_PATH/{sft_data}`.
80
80
81
81
```yaml
82
82
# Properly set the following configs in gsm8k.yaml
Copy file name to clipboardExpand all lines: docs/sphinx_doc/source/tutorial/trinity_configs.md
+6-6Lines changed: 6 additions & 6 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -49,7 +49,7 @@ data:
49
49
- `data.max_retry_times`: The maximum number of retries when loading the dataset from database.
50
50
- `data.max_retry_interval`: The maximum interval between retries when loading the dataset from database.
51
51
- `data.total_epochs`: The total number of epochs to explore the dataset. Default is `1`. It should be set manually.
52
-
- `data.batch_size`: The number of `Task` in one training batch. The real batch size used in training is `data.batch_size` * `actor_rollout_ref.rollout.n` Default is `1`. It should be set manually.
52
+
- `data.batch_size`: The number of `Task` in one training batch. The real batch size used in training is `data.batch_size` * `explorer.repeat_times`. It should be set manually.
53
53
- `data.default_workflow_type`: The default workflow type used for training.
54
54
- `data.default_reward_fn_type`: The default reward function type used for training.
55
55
@@ -150,14 +150,14 @@ explorer:
150
150
```yaml
151
151
synchronizer:
152
152
sync_method: 'nccl'
153
-
sync_iteration_interval: 10
153
+
sync_interval: 10
154
154
sync_timeout: 1200
155
155
```
156
156
157
157
- `synchronizer.sync_method`: The synchronization method between `trainer` and `explorer`.
158
158
Support `nccl` and `checkpoint`, `nccl` represents that model weights in `explorer` will be synchronized from `trainer` through `nccl`,
159
159
`checkpoint`represents that `explorer` will load the newest checkpoints saved by `trainer` then update its model weights. Default is `nccl`.
160
-
- `synchronizer.sync_iteration_interval`: The interval between two synchronizations. Default is `10`. It should be set manually.
160
+
- `synchronizer.sync_interval`: The interval between two synchronizations. Default is `10`. It should be set manually.
161
161
- `synchronizer.sync_timeout`: The timeout of the synchronization. Default is `1200`.
0 commit comments