You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/sphinx_doc/source/tutorial/trinity_configs.md
-31Lines changed: 0 additions & 31 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -376,11 +376,6 @@ actor_rollout_ref:
376
376
use_dynamic_bsz: True
377
377
ppo_max_token_len_per_gpu: 16384 # n * ${data.max_prompt_length} + ${data.max_response_length}
378
378
grad_clip: 1.0
379
-
clip_ratio: 0.2
380
-
entropy_coeff: 0.001
381
-
use_kl_loss: False # True for GRPO
382
-
kl_loss_coef: 0.001 # for grpo
383
-
kl_loss_type: low_var_kl # for grpo
384
379
ppo_epochs: 1
385
380
shuffle: False
386
381
ulysses_sequence_parallel_size: 1 # sp size
@@ -399,10 +394,6 @@ actor_rollout_ref:
399
394
param_offload: False
400
395
optimizer_offload: False
401
396
fsdp_size: -1
402
-
# --- below: opmd ---
403
-
tau: 0.000 # strength of regularization w.r.t. old / ref policy
404
-
opmd_baseline: mean # mean / logavgexp, applicable to opmd
405
-
use_uid: False # True / False, applicable to pairwise_opmd
406
397
ref:
407
398
fsdp_config:
408
399
param_offload: False
@@ -447,22 +438,6 @@ critic:
447
438
grad_clip: 1.0
448
439
cliprange_value: 0.5
449
440
450
-
custom_reward_function:
451
-
path: null
452
-
name: compute_score
453
-
454
-
algorithm:
455
-
gamma: 1.0
456
-
lam: 1.0
457
-
norm_adv_by_std_in_grpo: True
458
-
use_kl_in_reward: False
459
-
kl_penalty: kl # how to estimate kl divergence
460
-
kl_ctrl:
461
-
type: fixed
462
-
kl_coef: 0.001
463
-
horizon: 10000
464
-
target_kl: 0.1
465
-
466
441
trainer:
467
442
balance_batch: True
468
443
# total_training_steps: null
@@ -483,11 +458,7 @@ trainer:
483
458
- `actor_rollout_ref.model.use_remove_padding`: Whether to remove pad tokens, which will reduce training time.
484
459
- `actor_rollout_ref.actor.use_dynamic_bsz`: Whether to reorganize the batch data, specifically to splice the shorter data to reduce the batch size in the actual training process.
485
460
- `actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu`: Batch size for one GPU in one forward pass.
486
-
- `actor_rollout_ref.actor.kl_loss_type`: How to compute kl loss, optional value is `kl`, `abs`, `mse` or `low_var_kl`.
Copy file name to clipboardExpand all lines: docs/sphinx_doc/source/tutorial/trinity_programming_guide.md
+3-3Lines changed: 3 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -443,13 +443,13 @@ The `AlgorithmType` class includes the following attributes and methods:
443
443
-`use_advantage`: Whether to calculate Advantage; if False, the `AdvantageFn` call will be skipped
444
444
-`can_balance_batch`: Whether the algorithm allows automatic balancing when splitting a batch into microbatches (which permute the order of samples)
445
445
-`schema`: The format of experience data corresponding to the algorithm
446
-
-`get_default_config`: Gets the default configuration of the algorithm, which will override attributes with the same name in `ALGORITHM_TYPE`
446
+
-`default_config`: Gets the default configuration of the algorithm, which will override attributes with the same name in `ALGORITHM_TYPE`
447
447
448
448
Similarly, after implementation, you need to register this module through `ALGORITHM_TYPE`.
449
449
450
450
Below is the implementation for the OPMD algorithm.
451
451
Since the OPMD algorithm doesn't need to use the Critic model, `use_critic` is set to `False`.
452
-
The dictionary returned by the `get_default_config` method indicates that OPMD will use the `opmd` type `AdvantageFn` and `PolicyLossFn` implemented in Step 1, will not apply KL Penalty on rewards, but will add a `k2` type KL loss when calculating the final loss.
452
+
The dictionary returned by the `default_config` method indicates that OPMD will use the `opmd` type `AdvantageFn` and `PolicyLossFn` implemented in Step 1, will not apply KL Penalty on rewards, but will add a `k2` type KL loss when calculating the final loss.
453
453
454
454
```python
455
455
@ALGORITHM_TYPE.register_module("opmd")
@@ -463,7 +463,7 @@ class OPMDAlgorithm(AlgorithmType):
0 commit comments