Skip to content

Commit a182b1f

Browse files
authored
Fix in Config (#30)
1 parent 6c56401 commit a182b1f

File tree

40 files changed

+528
-352
lines changed

40 files changed

+528
-352
lines changed

docs/sphinx_doc/source/tutorial/example_dpo.md

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -40,7 +40,7 @@ Note that the dataset has the keys `prompt`, `chosen` and `rejected`. If not, pa
4040

4141
We use the configurations in [`dpo.yaml`](https://github.com/modelscope/Trinity-RFT/tree/main/examples/dpo_humanlike/dpo.yaml) and [`train_dpo.yaml`](https://github.com/modelscope/Trinity-RFT/tree/main/examples/dpo_humanlike/train_dpo.yaml) for this experiment. Some important setups are listed in the following:
4242

43-
We run the experiment in a train mode, as there is no Explorer. To enable this mode, we config `mode` to `train` and set `sync_method` to `checkpoint`. The value of `sync_iteration_interval` can be set as same of the value of `save_interval`.
43+
We run the experiment in a train mode, as there is no Explorer. To enable this mode, we config `mode` to `train` and set `sync_method` to `checkpoint`.
4444

4545
```yaml
4646
# In dpo.yaml
@@ -50,7 +50,6 @@ synchronizer:
5050
buffer:
5151
train_dataset:
5252
storage_type: file
53-
algorithm_type: dpo
5453
path: <$DATASET_PATH/human_like_dpo_dataset>
5554
kwargs:
5655
prompt_type: <prompt_type> # messages/plaintext

docs/sphinx_doc/source/tutorial/example_reasoning_advanced.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,7 @@ To try out the OPMD algorithm:
2020
trinity run --config examples/opmd_gsm8k/opmd_gsm8k.yaml
2121
```
2222

23-
Note that in this config file, `sync_iteration_interval` is set to 10, i.e., the model weights of explorer and trainer are synchronized only once every 10 training steps, which leads to a challenging off-policy scenario (potentially with abrupt distribution shift during the RFT process).
23+
Note that in this config file, `sync_interval` is set to 10, i.e., the model weights of explorer and trainer are synchronized only once every 10 training steps, which leads to a challenging off-policy scenario (potentially with abrupt distribution shift during the RFT process).
2424
Other configurations of particular interest are explained at the beginning of [`train_opmd_gsm8k.yaml`](https://github.com/modelscope/Trinity-RFT/tree/main/examples/opmd_gsm8k/train_opmd_gsm8k.yaml).
2525

2626

@@ -30,7 +30,7 @@ Other configurations of particular interest are explained at the beginning of [`
3030
The red curve below shows an example of OPMD's learning curves.
3131
Since the explorer's model weights remain unchanged for the first 10 steps, its score remains flat.
3232
Then, after the model weights of explorer and trainer are synchronized at the end of step 10, we see an abrupt increase in score at step 11, which indicates effective off-policy learning in the first 10 steps.
33-
A similar performance boost is shown at step 21, which leads to a converged score matching what is achieved by GRPO in a mostly on-policy case (with `sync_iteration_interval=2`).
33+
A similar performance boost is shown at step 21, which leads to a converged score matching what is achieved by GRPO in a mostly on-policy case (with `sync_interval=2`).
3434

3535

3636

docs/sphinx_doc/source/tutorial/example_reasoning_basic.md

Lines changed: 4 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -37,13 +37,13 @@ More details on dataset downloading are referred to [ModelScope](https://modelsc
3737

3838
### Synchronous Mode of Trinity-RFT
3939

40-
We run the experiment in a synchronous mode where the Explorer and Trainer operate in turn. To enable this mode, we config `mode` to `both` (default) and set `sync_iteration_interval` properly. A smaller value of `sync_iteration_interval` makes the training closer to an on-policy setup.
40+
We run the experiment in a synchronous mode where the Explorer and Trainer operate in turn. To enable this mode, we config `mode` to `both` (default) and set `sync_interval` properly. A smaller value of `sync_interval` makes the training closer to an on-policy setup.
4141

4242
```yaml
4343
mode: both
4444
synchronizer:
4545
sync_method: 'nccl'
46-
sync_iteration_interval: 2
46+
sync_interval: 2
4747
```
4848
4949
### Use GRPO or PPO Algorithm
@@ -76,21 +76,20 @@ trinity run --config examples/grpo_gsm8k/gsm8k.yaml
7676

7777
## Optional: RFT with SFT Warmup
7878

79-
Before RFT, we may use SFT as a warmup step. We need to set `trainer.sft_warmup_iteration > 0` and prepare the SFT data to `buffer.train_dataset.path=$DATASET_PATH/{sft_data}`.
79+
Before RFT, we may use SFT as a warmup step. We need to set `trainer.sft_warmup_steps > 0` and prepare the SFT data to `buffer.train_dataset.path=$DATASET_PATH/{sft_data}`.
8080

8181
```yaml
8282
# Properly set the following configs in gsm8k.yaml
8383
buffer:
8484
sft_warmup_dataset:
8585
storage_type: file
86-
algorithm_type: sft
8786
path: <$DATASET_PATH/{sft_data}>
8887
kwargs:
8988
prompt_type: <prompt_type> # messages/plaintext/chatpair
9089
prompt_key: <prompt_key>
9190
response_key: <response_key>
9291
trainer:
93-
sft_warmup_iteration: 10
92+
sft_warmup_steps: 10
9493
```
9594

9695
The following command runs SFT and RFT in sequence:

docs/sphinx_doc/source/tutorial/trinity_configs.md

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -49,7 +49,7 @@ data:
4949
- `data.max_retry_times`: The maximum number of retries when loading the dataset from database.
5050
- `data.max_retry_interval`: The maximum interval between retries when loading the dataset from database.
5151
- `data.total_epochs`: The total number of epochs to explore the dataset. Default is `1`. It should be set manually.
52-
- `data.batch_size`: The number of `Task` in one training batch. The real batch size used in training is `data.batch_size` * `actor_rollout_ref.rollout.n` Default is `1`. It should be set manually.
52+
- `data.batch_size`: The number of `Task` in one training batch. The real batch size used in training is `data.batch_size` * `explorer.repeat_times`. It should be set manually.
5353
- `data.default_workflow_type`: The default workflow type used for training.
5454
- `data.default_reward_fn_type`: The default reward function type used for training.
5555

@@ -150,14 +150,14 @@ explorer:
150150
```yaml
151151
synchronizer:
152152
sync_method: 'nccl'
153-
sync_iteration_interval: 10
153+
sync_interval: 10
154154
sync_timeout: 1200
155155
```
156156

157157
- `synchronizer.sync_method`: The synchronization method between `trainer` and `explorer`.
158158
Support `nccl` and `checkpoint`, `nccl` represents that model weights in `explorer` will be synchronized from `trainer` through `nccl`,
159159
`checkpoint` represents that `explorer` will load the newest checkpoints saved by `trainer` then update its model weights. Default is `nccl`.
160-
- `synchronizer.sync_iteration_interval`: The interval between two synchronizations. Default is `10`. It should be set manually.
160+
- `synchronizer.sync_interval`: The interval between two synchronizations. Default is `10`. It should be set manually.
161161
- `synchronizer.sync_timeout`: The timeout of the synchronization. Default is `1200`.
162162

163163
## Trainer
@@ -167,15 +167,15 @@ trainer:
167167
trainer_type: 'verl'
168168
algorithm_type: ppo
169169
trainer_config_path: 'examples/ppo_countdown/train_countdown.yaml'
170-
sft_warmup_iteration: 0
170+
sft_warmup_steps: 0
171171
eval_interval: 1000
172172
save_interval: 100
173173
```
174174

175175
- `trainer.trainer_type`: The backend of the trainer, Only `verl` is supported.
176176
- `trainer.algorithm_type`: The type of the algorithm, Support `ppo`, `grpo`, `opmd` and `dpo`.
177177
- `trainer.trainer_config_path`: The path to the trainer configuration file. It must be set manually.
178-
- `trainer.sft_warmup_iteration`: The number of iterations to warm up the model. Default is `0`.
178+
- `trainer.sft_warmup_steps`: The number of steps to warm up the model. Default is `0`.
179179
- `trainer.eval_interval`: The interval between two evaluations. Default is `1000`.
180180
- `trainer.save_interval`: The interval between two checkpoints. Default is `100`.
181181

@@ -418,7 +418,7 @@ trainer:
418418
- `trainer.balance_batch`: Whether to balance batch size between GPUs during training.
419419
- `trainer.resume_mode`: Resume mode for training. Support `disable`, `auto` and `resume_path`.
420420
- `trainer.resume_from_path`: Path to resume from.
421-
- `trainer.critic_warmup`: The number of iteration to train the critic model before actual policy learning.
421+
- `trainer.critic_warmup`: The number of steps to train the critic model before actual policy learning.
422422
- `trainer.default_hdfs_dir`: Default HDFS directory for saving checkpoints.
423423
- `trainer.remove_previous_ckpt_in_save`: Whether to remove previous checkpoints in save.
424424
- `trainer.del_local_ckpt_after_load`: Whether to delete local checkpoints after loading.

examples/async_gsm8k/explorer.yaml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -48,9 +48,9 @@ synchronizer:
4848
sync_iteration_interval: 10
4949
trainer:
5050
trainer_type: 'verl'
51-
algorithm_type: ppo
51+
algorithm_type: grpo
5252
trainer_config_path: examples/async_gsm8k/verl_config.yaml
53-
sft_warmup_iteration: 0 # Set to integer to enable sft warmup
53+
sft_warmup_steps: 0 # Set to integer to enable sft warmup
5454
eval_interval: 10
5555
monitor:
5656
cache_root_dir: ""

examples/async_gsm8k/trainer.yaml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -48,9 +48,9 @@ synchronizer:
4848
sync_iteration_interval: 10
4949
trainer:
5050
trainer_type: 'verl'
51-
algorithm_type: ppo
51+
algorithm_type: grpo
5252
trainer_config_path: examples/async_gsm8k/verl_config.yaml
53-
sft_warmup_iteration: 0 # Set to integer to enable sft warmup
53+
sft_warmup_steps: 0 # Set to integer to enable sft warmup
5454
eval_interval: 10
5555
monitor:
5656
cache_root_dir: ""

examples/dpo_humanlike/dpo.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -46,7 +46,7 @@ explorer:
4646
max_waiting_steps: 4
4747
synchronizer:
4848
sync_method: 'checkpoint'
49-
sync_iteration_interval: 30
49+
sync_interval: 30
5050
sync_timeout: 1200
5151
trainer:
5252
trainer_type: 'verl'

examples/grpo_alfworld/alfworld.yaml

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -39,14 +39,14 @@ explorer:
3939
max_pending_requests: 32
4040
max_waiting_steps: 4
4141
gpu_memory_utilization: 0.7
42-
enable_chunked_prefil: true
42+
enable_chunked_prefill: true
4343
synchronizer:
4444
sync_method: 'nccl'
45-
sync_iteration_interval: 8
45+
sync_interval: 8
4646
sync_timeout: 1200
4747
trainer:
4848
trainer_type: 'verl'
49-
algorithm_type: ppo
49+
algorithm_type: grpo
5050
trainer_config_path: 'examples/grpo_alfworld/train_alfworld.yaml'
5151
save_interval: 10
5252
monitor:

examples/grpo_gsm8k/gsm8k.yaml

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -60,13 +60,13 @@ explorer:
6060
max_waiting_steps: 4
6161
synchronizer:
6262
sync_method: 'nccl'
63-
sync_iteration_interval: 2
63+
sync_interval: 2
6464
sync_timeout: 1200
6565
trainer:
6666
trainer_type: 'verl'
67-
algorithm_type: ppo
67+
algorithm_type: grpo
6868
trainer_config_path: 'examples/grpo_gsm8k/train_gsm8k.yaml'
69-
sft_warmup_iteration: 0 # Set to integer to enable sft warmup
69+
sft_warmup_steps: 0 # Set to integer to enable sft warmup
7070
eval_interval: 50
7171
save_interval: 100
7272
# get_exp_strategy: 'LFU'

examples/grpo_math/math.yaml

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -46,13 +46,13 @@ explorer:
4646
max_waiting_steps: 4
4747
synchronizer:
4848
sync_method: 'nccl'
49-
sync_iteration_interval: 2
49+
sync_interval: 2
5050
sync_timeout: 1200
5151
trainer:
5252
trainer_type: 'verl'
53-
algorithm_type: ppo
53+
algorithm_type: grpo
5454
trainer_config_path: 'examples/grpo_math/train_math.yaml'
55-
sft_warmup_iteration: 0 # Set to integer to enable sft warmup
55+
sft_warmup_steps: 0 # Set to integer to enable sft warmup
5656
eval_interval: 10
5757
save_interval: 100
5858
monitor:

0 commit comments

Comments
 (0)