You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/sphinx_doc/source/tutorial/example_async_mode.md
+83-11Lines changed: 83 additions & 11 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,32 +1,104 @@
1
-
# A quick example for asynchronous mode
1
+
# Asynchronous RFT
2
2
3
-
This example shows how to run RFT in asynchronous mode with the GRPO algorithm, Qwen-2.5-1.5B-Instruct model and GSM8K dataset.
3
+
This example shows how to run RFT in a fully asynchronous mode with the GRPO algorithm, Qwen-2.5-1.5B-Instruct model and GSM8K dataset.
4
4
5
5
Trinity-RFT supports an asynchronous mode by running the trainer and explorer in separate processes.
6
6
7
-
For this purpose, we prepare two main config files: `trainer.yaml` and `explorer.yaml`.
8
-
The main difference between them is that in `trainer.yaml` we set `mode=train`, while in `explorer.yaml` we set `mode=explore`.
9
-
In addition, we need to configure the following parameters in both files.
7
+
For this purpose, we prepare two main config files: [`explorer.yaml`](https://github.com/modelscope/Trinity-RFT/blob/main/examples/async_gsm8k/explorer.yaml) and [`trainer.yaml`](https://github.com/modelscope/Trinity-RFT/blob/main/examples/async_gsm8k/trainer.yaml).
8
+
The main difference between them is that in `explorer.yaml` we set `mode` as `explore`, while in `trainer.yaml` we set `mode` as `train`.
10
9
The model weights of the explorer and trainer are synchronized once every `sync_interval * batch_size` tasks.
11
10
12
-
```yaml
13
-
project: tutorial
14
-
name: async_mode_example
15
-
checkpoint_root_dir: /PATH/TO/CHECKPOINT
11
+
Suppose we have a node of 8 GPUs; we use 4 GPUs for the trainer and 4 GPUs for the explorer.
12
+
Some important setups of `explorer.yaml` are listed in the following:
Copy file name to clipboardExpand all lines: docs/sphinx_doc/source/tutorial/example_dpo.md
+24-13Lines changed: 24 additions & 13 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,4 +1,4 @@
1
-
# Example: Run DPO on Human-Like-DPO-Dataset
1
+
# Offline DPO
2
2
3
3
This example describes DPO based on the Qwen-2.5-1.5B-Instruct model and [Human-like-DPO-dataset](https://huggingface.co/datasets/HumanLLMs/Human-Like-DPO-Dataset).
4
4
@@ -40,25 +40,36 @@ Note that the dataset has the keys `prompt`, `chosen` and `rejected`. If not, pa
40
40
41
41
We use the configurations in [`dpo.yaml`](https://github.com/modelscope/Trinity-RFT/tree/main/examples/dpo_humanlike/dpo.yaml) and [`train_dpo.yaml`](https://github.com/modelscope/Trinity-RFT/tree/main/examples/dpo_humanlike/train_dpo.yaml) for this experiment. Some important setups are listed in the following:
42
42
43
-
We run the experiment in a train mode, as there is no Explorer. To enable this mode, we config `mode` to `train` and set `sync_method`to `checkpoint`.
43
+
We run the experiment in a train mode, as there is no Explorer. To enable this mode, we config `mode` to `train` and pass the data path to the trainer.
Copy file name to clipboardExpand all lines: docs/sphinx_doc/source/tutorial/example_reasoning_advanced.md
+2-3Lines changed: 2 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,4 +1,4 @@
1
-
# Example: off-policy RFT mode
1
+
# Off-Policy RFT
2
2
3
3
4
4
Let's continue with the [previous GSM8k example](./example_reasoning_basic.md) and show some advanced features provided by Trinity-RFT, namely, off-policy or asynchronous RFT mode.
@@ -12,8 +12,7 @@ Let's continue with the [previous GSM8k example](./example_reasoning_basic.md) a
12
12
13
13
As an experimental feature of Trinity-RFT, we develop an embarrasingly simple off-policy RL algorithm, termed as OPMD (Online Policy Mirror Descent, inspired by [Kimi k1.5](https://arxiv.org/abs/2501.12599)).
14
14
The algorithm design and analysis can be found in this [technical report](../../assets/opmd.pdf).
15
-
16
-
15
+
The config files are [`opmd_gsm8k.yaml`](https://github.com/modelscope/Trinity-RFT/blob/main/examples/opmd_gsm8k/opmd_gsm8k.yaml) and [`train_opmd_gsm8k.yaml`](https://github.com/modelscope/Trinity-RFT/blob/main/examples/opmd_gsm8k/train_opmd_gsm8k.yaml).
docker run -it --gpus all --shm-size="64g" --rm -v $PWD:/workspace -v <root_path_of_data_and_checkpoints>:/data trinity-rft:latest
55
+
```
2
56
3
-
This example shows how to run RFT with the Qwen-2.5-1.5B-Instruct model and GSM8K dataset.
4
57
5
58
## Step 1: Model and Data Preparation
6
59
@@ -37,31 +90,71 @@ More details on dataset downloading are referred to [ModelScope](https://modelsc
37
90
38
91
### Synchronous Mode of Trinity-RFT
39
92
40
-
We run the experiment in a synchronous mode where the Explorer and Trainer operate in turn. To enable this mode, we config `mode` to `both` (default) and set `sync_interval` properly. A smaller value of `sync_interval` makes the training closer to an on-policy setup.
93
+
We run the experiment in a synchronous mode where the Explorer and Trainer operate in turn. To enable this mode, we config `mode` to `both` (default) and set `sync_interval` properly. A smaller value of `sync_interval` makes the training closer to an on-policy setup. For example, we set `sync_interval` to 1 to simulate an on-policy setup.
41
94
42
-
```yaml
43
-
mode: both
44
-
synchronizer:
45
-
sync_method: 'nccl'
46
-
sync_interval: 2
47
-
```
95
+
### Use GRPO Algorithm
48
96
49
-
### Use GRPO or PPO Algorithm
50
-
51
-
We use the configurations in [`gsm8k.yaml`](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_gsm8k/gsm8k.yaml) and [`train_gsm8k.yaml`](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_gsm8k/train_gsm8k.yaml) for this experiment. Some important setups are listed in the following:
97
+
We use the configurations in [`gsm8k.yaml`](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_gsm8k/gsm8k.yaml) and [`train_gsm8k.yaml`](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_gsm8k/train_gsm8k.yaml) for this experiment. Some important setups of `gsm8k.yaml` are listed in the following:
@@ -76,7 +169,7 @@ trinity run --config examples/grpo_gsm8k/gsm8k.yaml
76
169
Before RFT, we may use SFT as a warmup step. We need to set `buffer.trainer_input.sft_warmup_steps > 0` and prepare the SFT data to `buffer.trainer_input.sft_warmup_dataset.path=$DATASET_PATH/{sft_data}`.
77
170
78
171
```yaml
79
-
# Properly set the following configs in gsm8k.yaml
172
+
# Properly add the following configs in gsm8k.yaml
0 commit comments