Skip to content

Commit 50aba82

Browse files
authored
Config Refactor (#27)
1 parent 3651ad8 commit 50aba82

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

42 files changed

+521
-325
lines changed

docs/sphinx_doc/source/conf.py

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -40,7 +40,6 @@
4040

4141
templates_path = ["_templates"]
4242
exclude_patterns = ["build"]
43-
autodoc_mock_imports = ["ray"]
4443

4544
autodoc_default_options = {
4645
"members": True,

docs/sphinx_doc/source/tutorial/example_dpo.md

Lines changed: 2 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -40,13 +40,13 @@ Note that the dataset has the keys `prompt`, `chosen` and `rejected`. If not, pa
4040

4141
We use the configurations in [`dpo.yaml`](https://github.com/modelscope/Trinity-RFT/tree/main/examples/dpo_humanlike/dpo.yaml) and [`train_dpo.yaml`](https://github.com/modelscope/Trinity-RFT/tree/main/examples/dpo_humanlike/train_dpo.yaml) for this experiment. Some important setups are listed in the following:
4242

43-
We run the experiment in a train mode, as there is no Explorer. To enable this mode, we config `mode` to `train` and set `sync_method` to `offline`. The value of `sync_iteration_interval` can be set as same of the value of `save_freq`.
43+
We run the experiment in a train mode, as there is no Explorer. To enable this mode, we config `mode` to `train` and set `sync_method` to `checkpoint`. The value of `sync_iteration_interval` can be set as same of the value of `save_interval`.
4444

4545
```yaml
4646
# In dpo.yaml
4747
mode: train
4848
synchronizer:
49-
sync_method: 'offline'
49+
sync_method: 'checkpoint'
5050
buffer:
5151
train_dataset:
5252
storage_type: file
@@ -63,7 +63,6 @@ trainer:
6363
# In train_dpo.yaml
6464
actor_rollout_ref:
6565
actor:
66-
alg_type: dpo
6766
use_kl_loss: True
6867
kl_loss_coef: 0.1 # value of beta in DPO
6968
```

docs/sphinx_doc/source/tutorial/example_reasoning_basic.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -42,7 +42,7 @@ We run the experiment in a synchronous mode where the Explorer and Trainer opera
4242
```yaml
4343
mode: both
4444
synchronizer:
45-
sync_method: 'online'
45+
sync_method: 'nccl'
4646
sync_iteration_interval: 2
4747
```
4848

docs/sphinx_doc/source/tutorial/trinity_configs.md

Lines changed: 8 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -15,17 +15,6 @@ monitor:
1515
- `monitor.name`: The name of the experiment. It must be set manually.
1616

1717

18-
## Monitor
19-
20-
```yaml
21-
monitor:
22-
project: "Trinity-RFT-countdown"
23-
name: "qwen2.5-1.5B-countdown"
24-
```
25-
26-
- `monitor.project`: The project name. It must be set manually.
27-
- `monitor.name`: The name of the experiment. It must be set manually.
28-
2918
## Data
3019

3120
<!-- The `data` configuration specifies the data used for training. It includes the total number of epochs, the batch size, the path to the dataset, the default workflow type, the default reward function type, and the format configuration. -->
@@ -131,8 +120,6 @@ explorer:
131120
enforce_eager: true
132121
dtype: bfloat16
133122
temperature: 1.0
134-
top_p: 1.0
135-
top_k: -1
136123
seed: 42
137124
logprobs: 0
138125
repeat_times: 5
@@ -150,8 +137,6 @@ explorer:
150137
- `explorer.enforce_eager`: Whether to enforce eager mode. Default is `True`.
151138
- `explorer.dtype`: The data type used in vLLM. Default is `bfloat16`.
152139
- `explorer.temperature`: The temperature used in vLLM. Default is `1.0`.
153-
- `explorer.top_p`: The top-p used in vLLM. Default is `1.0`.
154-
- `explorer.top_k`: The top-k used in vLLM. Default is `-1`.
155140
- `explorer.seed`: The seed used in vLLM. Default is `42`.
156141
- `explorer.logprobs`: The logprobs used in vLLM. Default is `0`.
157142
- `explorer.repeat_times`: The number of times to repeat each task, used for GRPO-like algorithms. Default is `5`.
@@ -164,12 +149,16 @@ explorer:
164149

165150
```yaml
166151
synchronizer:
167-
sync_method: 'online'
152+
sync_method: 'nccl'
168153
sync_iteration_interval: 10
154+
sync_timeout: 1200
169155
```
170156

171-
- `synchronizer.sync_method`: The synchronization method, Support `online` and `offline`. Default is `online`.
157+
- `synchronizer.sync_method`: The synchronization method between `trainer` and `explorer`.
158+
Support `nccl` and `checkpoint`, `nccl` represents that model weights in `explorer` will be synchronized from `trainer` through `nccl`,
159+
`checkpoint` represents that `explorer` will load the newest checkpoints saved by `trainer` then update its model weights. Default is `nccl`.
172160
- `synchronizer.sync_iteration_interval`: The interval between two synchronizations. Default is `10`. It should be set manually.
161+
- `synchronizer.sync_timeout`: The timeout of the synchronization. Default is `1200`.
173162

174163
## Trainer
175164

@@ -180,13 +169,15 @@ trainer:
180169
trainer_config_path: 'examples/ppo_countdown/train_countdown.yaml'
181170
sft_warmup_iteration: 0
182171
eval_interval: 1000
172+
save_interval: 100
183173
```
184174

185175
- `trainer.trainer_type`: The backend of the trainer, Only `verl` is supported.
186176
- `trainer.algorithm_type`: The type of the algorithm, Support `ppo`, `grpo`, `opmd` and `dpo`.
187177
- `trainer.trainer_config_path`: The path to the trainer configuration file. It must be set manually.
188178
- `trainer.sft_warmup_iteration`: The number of iterations to warm up the model. Default is `0`.
189179
- `trainer.eval_interval`: The interval between two evaluations. Default is `1000`.
180+
- `trainer.save_interval`: The interval between two checkpoints. Default is `100`.
190181

191182
### veRL Trainer Configuration
192183

@@ -249,7 +240,6 @@ actor_rollout_ref:
249240
optimizer_offload: False
250241
fsdp_size: -1
251242
# --- below: opmd ---
252-
alg_type: ppo # ppo / opmd / pairwise_opmd
253243
tau: 0.000 # strength of regularization w.r.t. old / ref policy
254244
opmd_baseline: mean # mean / logavgexp, applicable to opmd
255245
use_uid: False # True / False, applicable to pairwise_opmd
@@ -403,7 +393,6 @@ trainer:
403393
- `actor_rollout_ref.actor.kl_loss_coef`: The coefficient of kl loss.
404394
- `actor_rollout_ref.actor.kl_loss_type`: How to compute kl loss, optional value is `kl`, `abs`, `mse` or `low_var_kl`.
405395
- `actor_rollout_ref.actor.ulysses_sequence_parallel_size`: Ulysses sequence parallel size.
406-
- `actor_rollout_ref.actor.alg_type`: Used for OPMD, optional value is `ppo`, `opmd` or `pairwise_opmd`.
407396
- `actor_rollout_ref.actor.tau`: strength of regularization w.r.t. old / ref policy.
408397
- `actor_rollout_ref.actor.opmd_baseline`: mean / logavgexp, applicable to opmd.
409398
- `actor_rollout_ref.actor.use_uid`: True / False, applicable to pairwise_opmd.
@@ -427,7 +416,6 @@ trainer:
427416
- `algorithm`: Training algorithm settings.
428417

429418
- `trainer.balance_batch`: Whether to balance batch size between GPUs during training.
430-
- `trainer.save_freq`: Frequency of saving checkpoints.
431419
- `trainer.resume_mode`: Resume mode for training. Support `disable`, `auto` and `resume_path`.
432420
- `trainer.resume_from_path`: Path to resume from.
433421
- `trainer.critic_warmup`: The number of iteration to train the critic model before actual policy learning.

examples/dpo_humanlike/dpo.yaml

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -37,8 +37,6 @@ explorer:
3737
enforce_eager: true
3838
dtype: bfloat16
3939
temperature: 1.0
40-
top_p: 1.0
41-
top_k: -1
4240
seed: 42
4341
logprobs: 0
4442
repeat_times: 1 # NOTE
@@ -47,12 +45,14 @@ explorer:
4745
max_pending_requests: 32
4846
max_waiting_steps: 4
4947
synchronizer:
50-
sync_method: 'offline'
48+
sync_method: 'checkpoint'
5149
sync_iteration_interval: 30
50+
sync_timeout: 1200
5251
trainer:
5352
trainer_type: 'verl'
5453
algorithm_type: dpo
5554
trainer_config_path: 'examples/dpo_humanlike/train_dpo.yaml'
55+
save_interval: 30
5656
monitor:
5757
cache_root_dir: ""
5858
project: "dpo_example"

examples/dpo_humanlike/train_dpo.yaml

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -23,7 +23,6 @@ actor_rollout_ref:
2323
enable_gradient_checkpointing: True
2424
use_remove_padding: False
2525
actor:
26-
alg_type: dpo
2726
strategy: fsdp # This is for backward-compatibility
2827
ppo_mini_batch_size: 32
2928
# ppo_micro_batch_size: 8 # will be deprecated, use ppo_micro_batch_size_per_gpu
@@ -170,7 +169,6 @@ trainer:
170169
val_generations_to_log_to_wandb: 0
171170
nnodes: 1
172171
n_gpus_per_node: 2
173-
save_freq: 30
174172
# auto: find the last ckpt to resume. If can't find, start from scratch
175173
resume_mode: auto # or auto or resume_path if
176174
test_freq: 5

examples/grpo_alfworld/alfworld.yaml

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -31,8 +31,6 @@ explorer:
3131
enforce_eager: true
3232
dtype: bfloat16
3333
temperature: 1.0
34-
top_p: 1.0
35-
top_k: -1
3634
seed: 42
3735
logprobs: 0
3836
repeat_times: 8
@@ -43,12 +41,14 @@ explorer:
4341
gpu_memory_utilization: 0.7
4442
enable_chunked_prefil: true
4543
synchronizer:
46-
sync_method: 'online'
44+
sync_method: 'nccl'
4745
sync_iteration_interval: 8
46+
sync_timeout: 1200
4847
trainer:
4948
trainer_type: 'verl'
5049
algorithm_type: ppo
5150
trainer_config_path: 'examples/grpo_alfworld/train_alfworld.yaml'
51+
save_interval: 10
5252
monitor:
5353
cache_root_dir: ""
5454
project: "ALFWORLD"

examples/grpo_alfworld/train_alfworld.yaml

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -169,7 +169,6 @@ trainer:
169169
val_generations_to_log_to_wandb: 0
170170
nnodes: 1
171171
n_gpus_per_node: 2
172-
save_freq: 1
173172
# auto: find the last ckpt to resume. If can't find, start from scratch
174173
resume_mode: auto # or auto or resume_path if
175174
test_freq: 100

examples/grpo_gsm8k/gsm8k.yaml

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -51,8 +51,6 @@ explorer:
5151
enforce_eager: true
5252
dtype: bfloat16
5353
temperature: 1.0
54-
top_p: 1.0
55-
top_k: -1
5654
seed: 42
5755
logprobs: 0
5856
repeat_times: 8
@@ -61,14 +59,16 @@ explorer:
6159
max_pending_requests: 32
6260
max_waiting_steps: 4
6361
synchronizer:
64-
sync_method: 'online'
62+
sync_method: 'nccl'
6563
sync_iteration_interval: 2
64+
sync_timeout: 1200
6665
trainer:
6766
trainer_type: 'verl'
6867
algorithm_type: ppo
6968
trainer_config_path: 'examples/grpo_gsm8k/train_gsm8k.yaml'
7069
sft_warmup_iteration: 0 # Set to integer to enable sft warmup
7170
eval_interval: 50
71+
save_interval: 100
7272
# get_exp_strategy: 'LFU'
7373
monitor:
7474
cache_root_dir: ""

examples/grpo_gsm8k/train_gsm8k.yaml

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -52,7 +52,6 @@ actor_rollout_ref:
5252
optimizer_offload: False
5353
fsdp_size: -1
5454
# --- below: opmd ---
55-
alg_type: ppo # ppo / opmd / pairwise_opmd
5655
tau: 0.000 # strength of regularization w.r.t. old / ref policy
5756
opmd_baseline: mean # mean / logavgexp, applicable to opmd
5857
use_uid: False # True / False, applicable to pairwise_opmd
@@ -174,7 +173,6 @@ trainer:
174173
val_generations_to_log_to_wandb: 0
175174
nnodes: 1
176175
n_gpus_per_node: 2
177-
save_freq: 100
178176
# auto: find the last ckpt to resume. If can't find, start from scratch
179177
resume_mode: auto # or auto or resume_path if
180178
test_freq: 5

0 commit comments

Comments
 (0)