Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion docs/sphinx_doc/source/tutorial/example_mix_algo.md
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,7 @@ class MIXAlgorithm(AlgorithmType):
schema: type = ExperienceModel

@classmethod
def get_default_config(cls) -> Dict:
def default_config(cls) -> Dict:
return {
"repeat_times": 8,
"policy_loss_fn": "mix",
Expand Down
31 changes: 0 additions & 31 deletions docs/sphinx_doc/source/tutorial/trinity_configs.md
Original file line number Diff line number Diff line change
Expand Up @@ -376,11 +376,6 @@ actor_rollout_ref:
use_dynamic_bsz: True
ppo_max_token_len_per_gpu: 16384 # n * ${data.max_prompt_length} + ${data.max_response_length}
grad_clip: 1.0
clip_ratio: 0.2
entropy_coeff: 0.001
use_kl_loss: False # True for GRPO
kl_loss_coef: 0.001 # for grpo
kl_loss_type: low_var_kl # for grpo
ppo_epochs: 1
shuffle: False
ulysses_sequence_parallel_size: 1 # sp size
Expand All @@ -399,10 +394,6 @@ actor_rollout_ref:
param_offload: False
optimizer_offload: False
fsdp_size: -1
# --- below: opmd ---
tau: 0.000 # strength of regularization w.r.t. old / ref policy
opmd_baseline: mean # mean / logavgexp, applicable to opmd
use_uid: False # True / False, applicable to pairwise_opmd
ref:
fsdp_config:
param_offload: False
Expand Down Expand Up @@ -447,22 +438,6 @@ critic:
grad_clip: 1.0
cliprange_value: 0.5

custom_reward_function:
path: null
name: compute_score

algorithm:
gamma: 1.0
lam: 1.0
norm_adv_by_std_in_grpo: True
use_kl_in_reward: False
kl_penalty: kl # how to estimate kl divergence
kl_ctrl:
type: fixed
kl_coef: 0.001
horizon: 10000
target_kl: 0.1

trainer:
balance_batch: True
# total_training_steps: null
Expand All @@ -483,11 +458,7 @@ trainer:
- `actor_rollout_ref.model.use_remove_padding`: Whether to remove pad tokens, which will reduce training time.
- `actor_rollout_ref.actor.use_dynamic_bsz`: Whether to reorganize the batch data, specifically to splice the shorter data to reduce the batch size in the actual training process.
- `actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu`: Batch size for one GPU in one forward pass.
- `actor_rollout_ref.actor.kl_loss_type`: How to compute kl loss, optional value is `kl`, `abs`, `mse` or `low_var_kl`.
- `actor_rollout_ref.actor.ulysses_sequence_parallel_size`: Ulysses sequence parallel size.
- `actor_rollout_ref.actor.tau`: strength of regularization w.r.t. old / ref policy.
- `actor_rollout_ref.actor.opmd_baseline`: mean / logavgexp, applicable to opmd.
- `actor_rollout_ref.actor.use_uid`: True / False, applicable to pairwise_opmd.
- `actor_rollout_ref.actor.optim.lr`: Learning rate for actor model.
- `actor_rollout_ref.actor.optim.lr_warmup_steps_ratio`: Ratio of warmup steps for learning rate.
- `actor_rollout_ref.actor.optim.warmup_style`: Warmup style for learning rate.
Expand All @@ -505,8 +476,6 @@ trainer:
- `critic.grad_clip`: Gradient clip for critic model training.
- `critic.cliprange_value`: Used for compute value loss.

- `algorithm`: Training algorithm settings.

- `trainer.balance_batch`: Whether to balance batch size between GPUs during training.
- `trainer.resume_mode`: Resume mode for training. Support `disable`, `auto` and `resume_path`.
- `trainer.resume_from_path`: Path to resume from.
Expand Down
6 changes: 3 additions & 3 deletions docs/sphinx_doc/source/tutorial/trinity_programming_guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -443,13 +443,13 @@ The `AlgorithmType` class includes the following attributes and methods:
- `use_advantage`: Whether to calculate Advantage; if False, the `AdvantageFn` call will be skipped
- `can_balance_batch`: Whether the algorithm allows automatic balancing when splitting a batch into microbatches (which permute the order of samples)
- `schema`: The format of experience data corresponding to the algorithm
- `get_default_config`: Gets the default configuration of the algorithm, which will override attributes with the same name in `ALGORITHM_TYPE`
- `default_config`: Gets the default configuration of the algorithm, which will override attributes with the same name in `ALGORITHM_TYPE`

Similarly, after implementation, you need to register this module through `ALGORITHM_TYPE`.

Below is the implementation for the OPMD algorithm.
Since the OPMD algorithm doesn't need to use the Critic model, `use_critic` is set to `False`.
The dictionary returned by the `get_default_config` method indicates that OPMD will use the `opmd` type `AdvantageFn` and `PolicyLossFn` implemented in Step 1, will not apply KL Penalty on rewards, but will add a `k2` type KL loss when calculating the final loss.
The dictionary returned by the `default_config` method indicates that OPMD will use the `opmd` type `AdvantageFn` and `PolicyLossFn` implemented in Step 1, will not apply KL Penalty on rewards, but will add a `k2` type KL loss when calculating the final loss.

```python
@ALGORITHM_TYPE.register_module("opmd")
Expand All @@ -463,7 +463,7 @@ class OPMDAlgorithm(AlgorithmType):
schema: type = ExperienceModel

@classmethod
def get_default_config(cls) -> Dict:
def default_config(cls) -> Dict:
return {
"repeat_times": 2,
"sample_strategy": "warmup",
Expand Down
21 changes: 0 additions & 21 deletions examples/async_gsm8k/verl_config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -12,11 +12,6 @@ actor_rollout_ref:
use_dynamic_bsz: True # False
ppo_max_token_len_per_gpu: 16384 # n * ${data.max_prompt_length} + ${data.max_response_length}
grad_clip: 1.0
clip_ratio: 0.2
entropy_coeff: 0.001
use_kl_loss: True # True for GRPO
kl_loss_coef: 0.001 # for grpo
kl_loss_type: low_var_kl # for grpo
ppo_epochs: 1
shuffle: False
ulysses_sequence_parallel_size: 1 # sp size
Expand All @@ -33,10 +28,6 @@ actor_rollout_ref:
param_offload: False
optimizer_offload: False
fsdp_size: -1
# --- below: opmd ---
tau: 0.000 # strength of regularization w.r.t. old / ref policy
opmd_baseline: mean # mean / logavgexp, applicable to opmd
use_uid: False # True / False, applicable to pairwise_opmd
ref:
fsdp_config:
param_offload: False
Expand All @@ -48,18 +39,6 @@ actor_rollout_ref:
log_prob_max_token_len_per_gpu: ${actor_rollout_ref.actor.ppo_max_token_len_per_gpu}
ulysses_sequence_parallel_size: ${actor_rollout_ref.actor.ulysses_sequence_parallel_size} # sp size

custom_reward_function:
path: null
name: compute_score

algorithm:
gamma: 1.0
lam: 1.0
kl_penalty: kl # how to estimate kl divergence
kl_ctrl:
type: fixed
kl_coef: 0.001

trainer:
balance_batch: True
# total_training_steps: null
Expand Down
17 changes: 0 additions & 17 deletions examples/dpo_humanlike/train_dpo.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -12,11 +12,6 @@ actor_rollout_ref:
use_dynamic_bsz: False
ppo_max_token_len_per_gpu: 16384 # n * ${data.max_prompt_length} + ${data.max_response_length}
grad_clip: 1.0
clip_ratio: 0.2
entropy_coeff: 0.001
use_kl_loss: True
kl_loss_coef: 0.1 # NOTE: beta for DPO
kl_loss_type: low_var_kl # for grpo
ppo_epochs: 1
shuffle: False
ulysses_sequence_parallel_size: 1 # sp size
Expand Down Expand Up @@ -46,18 +41,6 @@ actor_rollout_ref:
log_prob_max_token_len_per_gpu: ${actor_rollout_ref.actor.ppo_max_token_len_per_gpu}
ulysses_sequence_parallel_size: ${actor_rollout_ref.actor.ulysses_sequence_parallel_size} # sp size

custom_reward_function:
path: null
name: compute_score

algorithm:
gamma: 1.0
lam: 1.0
kl_penalty: kl
kl_ctrl:
type: fixed
kl_coef: 0.001

trainer:
balance_batch: False
total_training_steps: 783 #
Expand Down
17 changes: 0 additions & 17 deletions examples/grpo_alfworld/train_alfworld.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -12,11 +12,6 @@ actor_rollout_ref:
use_dynamic_bsz: False
ppo_max_token_len_per_gpu: 16384 # n * ${data.max_prompt_length} + ${data.max_response_length}
grad_clip: 1.0
clip_ratio: 0.2
entropy_coeff: 0.001
use_kl_loss: True # True for GRPO
kl_loss_coef: 0.001 # for grpo
kl_loss_type: low_var_kl # for grpo
ppo_epochs: 1
shuffle: False
ulysses_sequence_parallel_size: 1 # sp size
Expand Down Expand Up @@ -44,18 +39,6 @@ actor_rollout_ref:
log_prob_max_token_len_per_gpu: ${actor_rollout_ref.actor.ppo_max_token_len_per_gpu}
ulysses_sequence_parallel_size: ${actor_rollout_ref.actor.ulysses_sequence_parallel_size} # sp size

custom_reward_function:
path: null
name: compute_score

algorithm:
gamma: 1.0
lam: 1.0
kl_penalty: kl # how to estimate kl divergence
kl_ctrl:
type: fixed
kl_coef: 0.001

trainer:
balance_batch: True
# total_training_steps: null
Expand Down
21 changes: 0 additions & 21 deletions examples/grpo_gsm8k/train_gsm8k.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -12,11 +12,6 @@ actor_rollout_ref:
use_dynamic_bsz: True # False
ppo_max_token_len_per_gpu: 16384 # n * ${data.max_prompt_length} + ${data.max_response_length}
grad_clip: 1.0
clip_ratio: 0.2
entropy_coeff: 0.001
use_kl_loss: True # True for GRPO
kl_loss_coef: 0.001 # for grpo
kl_loss_type: low_var_kl # for grpo
ppo_epochs: 1
shuffle: False
ulysses_sequence_parallel_size: 1 # sp size
Expand All @@ -33,10 +28,6 @@ actor_rollout_ref:
param_offload: False
optimizer_offload: False
fsdp_size: -1
# --- below: opmd ---
tau: 0.000 # strength of regularization w.r.t. old / ref policy
opmd_baseline: mean # mean / logavgexp, applicable to opmd
use_uid: False # True / False, applicable to pairwise_opmd
ref:
fsdp_config:
param_offload: False
Expand All @@ -48,18 +39,6 @@ actor_rollout_ref:
log_prob_max_token_len_per_gpu: ${actor_rollout_ref.actor.ppo_max_token_len_per_gpu}
ulysses_sequence_parallel_size: ${actor_rollout_ref.actor.ulysses_sequence_parallel_size} # sp size

custom_reward_function:
path: null
name: compute_score

algorithm:
gamma: 1.0
lam: 1.0
kl_penalty: kl # how to estimate kl divergence
kl_ctrl:
type: fixed
kl_coef: 0.001

trainer:
balance_batch: True
# total_training_steps: null
Expand Down
21 changes: 0 additions & 21 deletions examples/grpo_math/train_math.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -12,11 +12,6 @@ actor_rollout_ref:
use_dynamic_bsz: True # False
ppo_max_token_len_per_gpu: 16384 # n * ${data.max_prompt_length} + ${data.max_response_length}
grad_clip: 1.0
clip_ratio: 0.2
entropy_coeff: 0.001
use_kl_loss: True # True for GRPO
kl_loss_coef: 0.0001 # for grpo
kl_loss_type: low_var_kl # for grpo
ppo_epochs: 1
shuffle: False
ulysses_sequence_parallel_size: 1 # sp size
Expand All @@ -33,10 +28,6 @@ actor_rollout_ref:
param_offload: False
optimizer_offload: False
fsdp_size: -1
# --- below: opmd ---
tau: 0.000 # strength of regularization w.r.t. old / ref policy
opmd_baseline: mean # mean / logavgexp, applicable to opmd
use_uid: False # True / False, applicable to pairwise_opmd
ref:
fsdp_config:
param_offload: False
Expand All @@ -48,18 +39,6 @@ actor_rollout_ref:
log_prob_max_token_len_per_gpu: ${actor_rollout_ref.actor.ppo_max_token_len_per_gpu}
ulysses_sequence_parallel_size: ${actor_rollout_ref.actor.ulysses_sequence_parallel_size} # sp size

custom_reward_function:
path: null
name: compute_score

algorithm:
gamma: 1.0
lam: 1.0
kl_penalty: kl # how to estimate kl divergence
kl_ctrl:
type: fixed
kl_coef: 0.0001

trainer:
balance_batch: True
# auto: find the last ckpt to resume. If can't find, start from scratch
Expand Down
17 changes: 0 additions & 17 deletions examples/grpo_sciworld/train_sciworld.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -12,11 +12,6 @@ actor_rollout_ref:
use_dynamic_bsz: False
ppo_max_token_len_per_gpu: 16384 # n * ${data.max_prompt_length} + ${data.max_response_length}
grad_clip: 1.0
clip_ratio: 0.2
entropy_coeff: 0.001
use_kl_loss: True # True for GRPO
kl_loss_coef: 0.001 # for grpo
kl_loss_type: low_var_kl # for grpo
ppo_epochs: 1
shuffle: False
ulysses_sequence_parallel_size: 1 # sp size
Expand Down Expand Up @@ -44,18 +39,6 @@ actor_rollout_ref:
log_prob_max_token_len_per_gpu: ${actor_rollout_ref.actor.ppo_max_token_len_per_gpu}
ulysses_sequence_parallel_size: ${actor_rollout_ref.actor.ulysses_sequence_parallel_size} # sp size

custom_reward_function:
path: null
name: compute_score

algorithm:
gamma: 1.0
lam: 1.0
kl_penalty: kl # how to estimate kl divergence
kl_ctrl:
type: fixed
kl_coef: 0.001

trainer:
balance_batch: True
# total_training_steps: null
Expand Down
17 changes: 0 additions & 17 deletions examples/grpo_webshop/train_webshop.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -12,11 +12,6 @@ actor_rollout_ref:
use_dynamic_bsz: False
ppo_max_token_len_per_gpu: 16384 # n * ${data.max_prompt_length} + ${data.max_response_length}
grad_clip: 1.0
clip_ratio: 0.2
entropy_coeff: 0.001
use_kl_loss: True # True for GRPO
kl_loss_coef: 0.001 # for grpo
kl_loss_type: low_var_kl # for grpo
ppo_epochs: 1
shuffle: False
ulysses_sequence_parallel_size: 1 # sp size
Expand Down Expand Up @@ -44,18 +39,6 @@ actor_rollout_ref:
log_prob_max_token_len_per_gpu: ${actor_rollout_ref.actor.ppo_max_token_len_per_gpu}
ulysses_sequence_parallel_size: ${actor_rollout_ref.actor.ulysses_sequence_parallel_size} # sp size

custom_reward_function:
path: null
name: compute_score

algorithm:
gamma: 1.0
lam: 1.0
kl_penalty: kl # how to estimate kl divergence
kl_ctrl:
type: fixed
kl_coef: 0.001

trainer:
balance_batch: True
# total_training_steps: null
Expand Down
21 changes: 0 additions & 21 deletions examples/mix_math/train_mix_math.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -12,11 +12,6 @@ actor_rollout_ref:
use_dynamic_bsz: True # False
ppo_max_token_len_per_gpu: 25600 # n * ${data.max_prompt_length} + ${data.max_response_length}
grad_clip: 1.0
clip_ratio: 0.2
entropy_coeff: 0.001
use_kl_loss: True # True for GRPO
kl_loss_coef: 0.0001 # for grpo
kl_loss_type: low_var_kl # for grpo
ppo_epochs: 1
shuffle: False
ulysses_sequence_parallel_size: 1 # sp size
Expand All @@ -33,10 +28,6 @@ actor_rollout_ref:
param_offload: False
optimizer_offload: False
fsdp_size: -1
# --- below: opmd ---
tau: 0.000 # strength of regularization w.r.t. old / ref policy
opmd_baseline: mean # mean / logavgexp, applicable to opmd
use_uid: False # True / False, applicable to pairwise_opmd
ref:
fsdp_config:
param_offload: False
Expand All @@ -48,18 +39,6 @@ actor_rollout_ref:
log_prob_max_token_len_per_gpu: ${actor_rollout_ref.actor.ppo_max_token_len_per_gpu}
ulysses_sequence_parallel_size: ${actor_rollout_ref.actor.ulysses_sequence_parallel_size} # sp size

custom_reward_function:
path: null
name: compute_score

algorithm:
gamma: 1.0
lam: 1.0
kl_penalty: kl # how to estimate kl divergence
kl_ctrl:
type: fixed
kl_coef: 0.0001

trainer:
balance_batch: True
# auto: find the last ckpt to resume. If can't find, start from scratch
Expand Down
Loading