Skip to content

Commit f24db44

Browse files
authored
Update config manager (#86)
1 parent 5cb9ebe commit f24db44

File tree

27 files changed

+544
-581
lines changed

27 files changed

+544
-581
lines changed

docs/sphinx_doc/source/tutorial/example_mix_algo.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -46,7 +46,7 @@ class MIXAlgorithm(AlgorithmType):
4646
schema: type = ExperienceModel
4747

4848
@classmethod
49-
def get_default_config(cls) -> Dict:
49+
def default_config(cls) -> Dict:
5050
return {
5151
"repeat_times": 8,
5252
"policy_loss_fn": "mix",

docs/sphinx_doc/source/tutorial/trinity_configs.md

Lines changed: 0 additions & 31 deletions
Original file line numberDiff line numberDiff line change
@@ -376,11 +376,6 @@ actor_rollout_ref:
376376
use_dynamic_bsz: True
377377
ppo_max_token_len_per_gpu: 16384 # n * ${data.max_prompt_length} + ${data.max_response_length}
378378
grad_clip: 1.0
379-
clip_ratio: 0.2
380-
entropy_coeff: 0.001
381-
use_kl_loss: False # True for GRPO
382-
kl_loss_coef: 0.001 # for grpo
383-
kl_loss_type: low_var_kl # for grpo
384379
ppo_epochs: 1
385380
shuffle: False
386381
ulysses_sequence_parallel_size: 1 # sp size
@@ -399,10 +394,6 @@ actor_rollout_ref:
399394
param_offload: False
400395
optimizer_offload: False
401396
fsdp_size: -1
402-
# --- below: opmd ---
403-
tau: 0.000 # strength of regularization w.r.t. old / ref policy
404-
opmd_baseline: mean # mean / logavgexp, applicable to opmd
405-
use_uid: False # True / False, applicable to pairwise_opmd
406397
ref:
407398
fsdp_config:
408399
param_offload: False
@@ -447,22 +438,6 @@ critic:
447438
grad_clip: 1.0
448439
cliprange_value: 0.5
449440
450-
custom_reward_function:
451-
path: null
452-
name: compute_score
453-
454-
algorithm:
455-
gamma: 1.0
456-
lam: 1.0
457-
norm_adv_by_std_in_grpo: True
458-
use_kl_in_reward: False
459-
kl_penalty: kl # how to estimate kl divergence
460-
kl_ctrl:
461-
type: fixed
462-
kl_coef: 0.001
463-
horizon: 10000
464-
target_kl: 0.1
465-
466441
trainer:
467442
balance_batch: True
468443
# total_training_steps: null
@@ -483,11 +458,7 @@ trainer:
483458
- `actor_rollout_ref.model.use_remove_padding`: Whether to remove pad tokens, which will reduce training time.
484459
- `actor_rollout_ref.actor.use_dynamic_bsz`: Whether to reorganize the batch data, specifically to splice the shorter data to reduce the batch size in the actual training process.
485460
- `actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu`: Batch size for one GPU in one forward pass.
486-
- `actor_rollout_ref.actor.kl_loss_type`: How to compute kl loss, optional value is `kl`, `abs`, `mse` or `low_var_kl`.
487461
- `actor_rollout_ref.actor.ulysses_sequence_parallel_size`: Ulysses sequence parallel size.
488-
- `actor_rollout_ref.actor.tau`: strength of regularization w.r.t. old / ref policy.
489-
- `actor_rollout_ref.actor.opmd_baseline`: mean / logavgexp, applicable to opmd.
490-
- `actor_rollout_ref.actor.use_uid`: True / False, applicable to pairwise_opmd.
491462
- `actor_rollout_ref.actor.optim.lr`: Learning rate for actor model.
492463
- `actor_rollout_ref.actor.optim.lr_warmup_steps_ratio`: Ratio of warmup steps for learning rate.
493464
- `actor_rollout_ref.actor.optim.warmup_style`: Warmup style for learning rate.
@@ -505,8 +476,6 @@ trainer:
505476
- `critic.grad_clip`: Gradient clip for critic model training.
506477
- `critic.cliprange_value`: Used for compute value loss.
507478

508-
- `algorithm`: Training algorithm settings.
509-
510479
- `trainer.balance_batch`: Whether to balance batch size between GPUs during training.
511480
- `trainer.resume_mode`: Resume mode for training. Support `disable`, `auto` and `resume_path`.
512481
- `trainer.resume_from_path`: Path to resume from.

docs/sphinx_doc/source/tutorial/trinity_programming_guide.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -443,13 +443,13 @@ The `AlgorithmType` class includes the following attributes and methods:
443443
- `use_advantage`: Whether to calculate Advantage; if False, the `AdvantageFn` call will be skipped
444444
- `can_balance_batch`: Whether the algorithm allows automatic balancing when splitting a batch into microbatches (which permute the order of samples)
445445
- `schema`: The format of experience data corresponding to the algorithm
446-
- `get_default_config`: Gets the default configuration of the algorithm, which will override attributes with the same name in `ALGORITHM_TYPE`
446+
- `default_config`: Gets the default configuration of the algorithm, which will override attributes with the same name in `ALGORITHM_TYPE`
447447

448448
Similarly, after implementation, you need to register this module through `ALGORITHM_TYPE`.
449449

450450
Below is the implementation for the OPMD algorithm.
451451
Since the OPMD algorithm doesn't need to use the Critic model, `use_critic` is set to `False`.
452-
The dictionary returned by the `get_default_config` method indicates that OPMD will use the `opmd` type `AdvantageFn` and `PolicyLossFn` implemented in Step 1, will not apply KL Penalty on rewards, but will add a `k2` type KL loss when calculating the final loss.
452+
The dictionary returned by the `default_config` method indicates that OPMD will use the `opmd` type `AdvantageFn` and `PolicyLossFn` implemented in Step 1, will not apply KL Penalty on rewards, but will add a `k2` type KL loss when calculating the final loss.
453453

454454
```python
455455
@ALGORITHM_TYPE.register_module("opmd")
@@ -463,7 +463,7 @@ class OPMDAlgorithm(AlgorithmType):
463463
schema: type = ExperienceModel
464464

465465
@classmethod
466-
def get_default_config(cls) -> Dict:
466+
def default_config(cls) -> Dict:
467467
return {
468468
"repeat_times": 2,
469469
"sample_strategy": "warmup",

examples/async_gsm8k/verl_config.yaml

Lines changed: 0 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -12,11 +12,6 @@ actor_rollout_ref:
1212
use_dynamic_bsz: True # False
1313
ppo_max_token_len_per_gpu: 16384 # n * ${data.max_prompt_length} + ${data.max_response_length}
1414
grad_clip: 1.0
15-
clip_ratio: 0.2
16-
entropy_coeff: 0.001
17-
use_kl_loss: True # True for GRPO
18-
kl_loss_coef: 0.001 # for grpo
19-
kl_loss_type: low_var_kl # for grpo
2015
ppo_epochs: 1
2116
shuffle: False
2217
ulysses_sequence_parallel_size: 1 # sp size
@@ -33,10 +28,6 @@ actor_rollout_ref:
3328
param_offload: False
3429
optimizer_offload: False
3530
fsdp_size: -1
36-
# --- below: opmd ---
37-
tau: 0.000 # strength of regularization w.r.t. old / ref policy
38-
opmd_baseline: mean # mean / logavgexp, applicable to opmd
39-
use_uid: False # True / False, applicable to pairwise_opmd
4031
ref:
4132
fsdp_config:
4233
param_offload: False
@@ -48,18 +39,6 @@ actor_rollout_ref:
4839
log_prob_max_token_len_per_gpu: ${actor_rollout_ref.actor.ppo_max_token_len_per_gpu}
4940
ulysses_sequence_parallel_size: ${actor_rollout_ref.actor.ulysses_sequence_parallel_size} # sp size
5041

51-
custom_reward_function:
52-
path: null
53-
name: compute_score
54-
55-
algorithm:
56-
gamma: 1.0
57-
lam: 1.0
58-
kl_penalty: kl # how to estimate kl divergence
59-
kl_ctrl:
60-
type: fixed
61-
kl_coef: 0.001
62-
6342
trainer:
6443
balance_batch: True
6544
# total_training_steps: null

examples/dpo_humanlike/train_dpo.yaml

Lines changed: 0 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -12,11 +12,6 @@ actor_rollout_ref:
1212
use_dynamic_bsz: False
1313
ppo_max_token_len_per_gpu: 16384 # n * ${data.max_prompt_length} + ${data.max_response_length}
1414
grad_clip: 1.0
15-
clip_ratio: 0.2
16-
entropy_coeff: 0.001
17-
use_kl_loss: True
18-
kl_loss_coef: 0.1 # NOTE: beta for DPO
19-
kl_loss_type: low_var_kl # for grpo
2015
ppo_epochs: 1
2116
shuffle: False
2217
ulysses_sequence_parallel_size: 1 # sp size
@@ -46,18 +41,6 @@ actor_rollout_ref:
4641
log_prob_max_token_len_per_gpu: ${actor_rollout_ref.actor.ppo_max_token_len_per_gpu}
4742
ulysses_sequence_parallel_size: ${actor_rollout_ref.actor.ulysses_sequence_parallel_size} # sp size
4843

49-
custom_reward_function:
50-
path: null
51-
name: compute_score
52-
53-
algorithm:
54-
gamma: 1.0
55-
lam: 1.0
56-
kl_penalty: kl
57-
kl_ctrl:
58-
type: fixed
59-
kl_coef: 0.001
60-
6144
trainer:
6245
balance_batch: False
6346
total_training_steps: 783 #

examples/grpo_alfworld/train_alfworld.yaml

Lines changed: 0 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -12,11 +12,6 @@ actor_rollout_ref:
1212
use_dynamic_bsz: False
1313
ppo_max_token_len_per_gpu: 16384 # n * ${data.max_prompt_length} + ${data.max_response_length}
1414
grad_clip: 1.0
15-
clip_ratio: 0.2
16-
entropy_coeff: 0.001
17-
use_kl_loss: True # True for GRPO
18-
kl_loss_coef: 0.001 # for grpo
19-
kl_loss_type: low_var_kl # for grpo
2015
ppo_epochs: 1
2116
shuffle: False
2217
ulysses_sequence_parallel_size: 1 # sp size
@@ -44,18 +39,6 @@ actor_rollout_ref:
4439
log_prob_max_token_len_per_gpu: ${actor_rollout_ref.actor.ppo_max_token_len_per_gpu}
4540
ulysses_sequence_parallel_size: ${actor_rollout_ref.actor.ulysses_sequence_parallel_size} # sp size
4641

47-
custom_reward_function:
48-
path: null
49-
name: compute_score
50-
51-
algorithm:
52-
gamma: 1.0
53-
lam: 1.0
54-
kl_penalty: kl # how to estimate kl divergence
55-
kl_ctrl:
56-
type: fixed
57-
kl_coef: 0.001
58-
5942
trainer:
6043
balance_batch: True
6144
# total_training_steps: null

examples/grpo_gsm8k/train_gsm8k.yaml

Lines changed: 0 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -12,11 +12,6 @@ actor_rollout_ref:
1212
use_dynamic_bsz: True # False
1313
ppo_max_token_len_per_gpu: 16384 # n * ${data.max_prompt_length} + ${data.max_response_length}
1414
grad_clip: 1.0
15-
clip_ratio: 0.2
16-
entropy_coeff: 0.001
17-
use_kl_loss: True # True for GRPO
18-
kl_loss_coef: 0.001 # for grpo
19-
kl_loss_type: low_var_kl # for grpo
2015
ppo_epochs: 1
2116
shuffle: False
2217
ulysses_sequence_parallel_size: 1 # sp size
@@ -33,10 +28,6 @@ actor_rollout_ref:
3328
param_offload: False
3429
optimizer_offload: False
3530
fsdp_size: -1
36-
# --- below: opmd ---
37-
tau: 0.000 # strength of regularization w.r.t. old / ref policy
38-
opmd_baseline: mean # mean / logavgexp, applicable to opmd
39-
use_uid: False # True / False, applicable to pairwise_opmd
4031
ref:
4132
fsdp_config:
4233
param_offload: False
@@ -48,18 +39,6 @@ actor_rollout_ref:
4839
log_prob_max_token_len_per_gpu: ${actor_rollout_ref.actor.ppo_max_token_len_per_gpu}
4940
ulysses_sequence_parallel_size: ${actor_rollout_ref.actor.ulysses_sequence_parallel_size} # sp size
5041

51-
custom_reward_function:
52-
path: null
53-
name: compute_score
54-
55-
algorithm:
56-
gamma: 1.0
57-
lam: 1.0
58-
kl_penalty: kl # how to estimate kl divergence
59-
kl_ctrl:
60-
type: fixed
61-
kl_coef: 0.001
62-
6342
trainer:
6443
balance_batch: True
6544
# total_training_steps: null

examples/grpo_math/train_math.yaml

Lines changed: 0 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -12,11 +12,6 @@ actor_rollout_ref:
1212
use_dynamic_bsz: True # False
1313
ppo_max_token_len_per_gpu: 16384 # n * ${data.max_prompt_length} + ${data.max_response_length}
1414
grad_clip: 1.0
15-
clip_ratio: 0.2
16-
entropy_coeff: 0.001
17-
use_kl_loss: True # True for GRPO
18-
kl_loss_coef: 0.0001 # for grpo
19-
kl_loss_type: low_var_kl # for grpo
2015
ppo_epochs: 1
2116
shuffle: False
2217
ulysses_sequence_parallel_size: 1 # sp size
@@ -33,10 +28,6 @@ actor_rollout_ref:
3328
param_offload: False
3429
optimizer_offload: False
3530
fsdp_size: -1
36-
# --- below: opmd ---
37-
tau: 0.000 # strength of regularization w.r.t. old / ref policy
38-
opmd_baseline: mean # mean / logavgexp, applicable to opmd
39-
use_uid: False # True / False, applicable to pairwise_opmd
4031
ref:
4132
fsdp_config:
4233
param_offload: False
@@ -48,18 +39,6 @@ actor_rollout_ref:
4839
log_prob_max_token_len_per_gpu: ${actor_rollout_ref.actor.ppo_max_token_len_per_gpu}
4940
ulysses_sequence_parallel_size: ${actor_rollout_ref.actor.ulysses_sequence_parallel_size} # sp size
5041

51-
custom_reward_function:
52-
path: null
53-
name: compute_score
54-
55-
algorithm:
56-
gamma: 1.0
57-
lam: 1.0
58-
kl_penalty: kl # how to estimate kl divergence
59-
kl_ctrl:
60-
type: fixed
61-
kl_coef: 0.0001
62-
6342
trainer:
6443
balance_batch: True
6544
# auto: find the last ckpt to resume. If can't find, start from scratch

examples/grpo_sciworld/train_sciworld.yaml

Lines changed: 0 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -12,11 +12,6 @@ actor_rollout_ref:
1212
use_dynamic_bsz: False
1313
ppo_max_token_len_per_gpu: 16384 # n * ${data.max_prompt_length} + ${data.max_response_length}
1414
grad_clip: 1.0
15-
clip_ratio: 0.2
16-
entropy_coeff: 0.001
17-
use_kl_loss: True # True for GRPO
18-
kl_loss_coef: 0.001 # for grpo
19-
kl_loss_type: low_var_kl # for grpo
2015
ppo_epochs: 1
2116
shuffle: False
2217
ulysses_sequence_parallel_size: 1 # sp size
@@ -44,18 +39,6 @@ actor_rollout_ref:
4439
log_prob_max_token_len_per_gpu: ${actor_rollout_ref.actor.ppo_max_token_len_per_gpu}
4540
ulysses_sequence_parallel_size: ${actor_rollout_ref.actor.ulysses_sequence_parallel_size} # sp size
4641

47-
custom_reward_function:
48-
path: null
49-
name: compute_score
50-
51-
algorithm:
52-
gamma: 1.0
53-
lam: 1.0
54-
kl_penalty: kl # how to estimate kl divergence
55-
kl_ctrl:
56-
type: fixed
57-
kl_coef: 0.001
58-
5942
trainer:
6043
balance_batch: True
6144
# total_training_steps: null

examples/grpo_webshop/train_webshop.yaml

Lines changed: 0 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -12,11 +12,6 @@ actor_rollout_ref:
1212
use_dynamic_bsz: False
1313
ppo_max_token_len_per_gpu: 16384 # n * ${data.max_prompt_length} + ${data.max_response_length}
1414
grad_clip: 1.0
15-
clip_ratio: 0.2
16-
entropy_coeff: 0.001
17-
use_kl_loss: True # True for GRPO
18-
kl_loss_coef: 0.001 # for grpo
19-
kl_loss_type: low_var_kl # for grpo
2015
ppo_epochs: 1
2116
shuffle: False
2217
ulysses_sequence_parallel_size: 1 # sp size
@@ -44,18 +39,6 @@ actor_rollout_ref:
4439
log_prob_max_token_len_per_gpu: ${actor_rollout_ref.actor.ppo_max_token_len_per_gpu}
4540
ulysses_sequence_parallel_size: ${actor_rollout_ref.actor.ulysses_sequence_parallel_size} # sp size
4641

47-
custom_reward_function:
48-
path: null
49-
name: compute_score
50-
51-
algorithm:
52-
gamma: 1.0
53-
lam: 1.0
54-
kl_penalty: kl # how to estimate kl divergence
55-
kl_ctrl:
56-
type: fixed
57-
kl_coef: 0.001
58-
5942
trainer:
6043
balance_batch: True
6144
# total_training_steps: null

0 commit comments

Comments
 (0)