You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/sphinx_doc/source/tutorial/trinity_configs.md
+13-3Lines changed: 13 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -436,10 +436,12 @@ Specifies the backend and behavior of the trainer.
436
436
```yaml
437
437
trainer:
438
438
name: trainer
439
-
trainer_type: 'verl'
440
-
save_interval: 100
439
+
trainer_type: "verl"
440
+
trainer_strategy: "fsdp"
441
441
total_steps: 1000
442
+
save_interval: 100
442
443
save_strategy: "unrestricted"
444
+
save_hf_checkpoint: "last"
443
445
grad_clip: 1.0
444
446
use_dynamic_bsz: true
445
447
max_token_len_per_gpu: 16384
@@ -449,13 +451,21 @@ trainer:
449
451
450
452
- `name`: Name of the trainer. This name will be used as the Ray actor's name, so it must be unique.
451
453
- `trainer_type`: Trainer backend implementation. Currently only supports `verl`.
452
-
- `save_interval`: Frequency (in steps) at which to save model checkpoints.
454
+
- `trainer_strategy`: Strategy for VeRL trainer. Default is `fsdp`. Options include:
455
+
- `fsdp`: Use PyTorch FSDP.
456
+
- `fsdp2`: Use PyTorch FSDP2.
457
+
- `megatron`: Use Megatron-LM.
453
458
- `total_steps`: Total number of training steps.
459
+
- `save_interval`: Frequency (in steps) at which to save model checkpoints.
454
460
- `save_strategy`: The parallel strategy used when saving the model. Defaults to `unrestricted`. The available options are as follows:
455
461
- `single_thread`: Only one thread across the entire system is allowed to save the model; saving tasks from different threads are executed sequentially.
456
462
- `single_process`: Only one process across the entire system is allowed to perform saving; multiple threads within that process can handle saving tasks in parallel, while saving operations across different processes are executed sequentially.
457
463
- `single_node`: Only one compute node across the entire system is allowed to perform saving; processes and threads within that node can work in parallel, while saving operations across different nodes are executed sequentially.
458
464
- `unrestricted`: No restrictions on saving operations; multiple nodes, processes, or threads are allowed to save the model simultaneously.
465
+
- `save_hf_checkpoint`: Whether to save the model in HuggingFace format. Default is `last`. Note that saving in HuggingFace format consumes additional time, storage space, and GPU memory, which may impact training performance or lead to out-of-memory errors. Options include:
466
+
- `last`: Save only the last checkpoint in HuggingFace format.
467
+
- `always`: Save all checkpoints in HuggingFace format.
468
+
- `never`: Do not save in HuggingFace format.
459
469
- `grad_clip`: Gradient clipping for updates.
460
470
- `use_dynamic_bsz`: Whether to use dynamic batch size.
461
471
- `max_token_len_per_gpu`: The maximum number of tokens to be processed in forward and backward when updating the policy. Effective when `use_dynamic_bsz=true`.
0 commit comments