Skip to content
Open
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions docs/source/Megatron-SWIFT/命令行参数.md
Original file line number Diff line number Diff line change
Expand Up @@ -95,8 +95,8 @@
- exit_on_missing_checkpoint: 如果设置了`–-load`,但**找不到检查点,则直接退出**,而不是初始化。默认为True。
- 🔥async_save: 使用异步检查点保存。目前仅适用于`torch_dist`分布式检查点格式。默认为False。
- use_persistent_ckpt_worker: 使用持久化检查点工作进程用于异步保存,即创建专门后台进程来处理异步保存。默认为False。
- ckpt_fully_parallel_load: 跨 DP 对分布式检查点使用完全加载并行化,加速权重加载速度。默认为False
- ckpt_assume_constant_structure: 如果在单个训练中,模型和优化器状态字典结构保持不变,允许Megatron进行额外检查点性能优化。默认为False
- ckpt_fully_parallel_load: 跨 DP 对分布式检查点使用完全加载并行化,加速权重加载速度。默认为True
- ckpt_assume_constant_structure: 如果在单个训练中,模型和优化器状态字典结构保持不变,允许Megatron进行额外检查点性能优化。默认为True

**分布式参数**:
并行技术的选择请参考[训练技巧文档](快速开始.md#训练技巧)。
Expand Down
4 changes: 2 additions & 2 deletions docs/source_en/Megatron-SWIFT/Command-line-parameters.md
Original file line number Diff line number Diff line change
Expand Up @@ -99,8 +99,8 @@
- exit_on_missing_checkpoint: If `--load` is set but **no checkpoint is found, exit directly** instead of initializing. Default is True.
- 🔥async_save: Use asynchronous checkpoint saving. Currently only applicable to the `torch_dist` distributed checkpoint format. Defaults to False.
- use_persistent_ckpt_worker: Use a persistent checkpoint worker process for async saving, i.e., create a dedicated background process to handle asynchronous saving. Defaults to False.
- ckpt_fully_parallel_load: Apply full load parallelization across DP for distributed checkpoints to accelerate weight loading speed. Defaults to False.
- ckpt_assume_constant_structure: If the model and optimizer state dict structure remains constant throughout a single training job, allows Megatron to perform additional checkpoint performance optimizations. Defaults to False.
- ckpt_fully_parallel_load: Apply full load parallelization across DP for distributed checkpoints to accelerate weight loading speed. Defaults to True.
- ckpt_assume_constant_structure: If the model and optimizer state dict structure remains constant throughout a single training job, allows Megatron to perform additional checkpoint performance optimizations. Defaults to True.


**Distributed Parameters**:
Expand Down
4 changes: 2 additions & 2 deletions swift/megatron/argument/megatron_args.py
Original file line number Diff line number Diff line change
Expand Up @@ -214,8 +214,8 @@ class MegatronArguments(ExtraMegatronArguments):
exit_on_missing_checkpoint: bool = True
async_save: bool = False
use_persistent_ckpt_worker: bool = False
ckpt_fully_parallel_load: bool = False
ckpt_assume_constant_structure: bool = False
ckpt_fully_parallel_load: bool = True
ckpt_assume_constant_structure: bool = True

# dist
distributed_backend: Literal['nccl', 'gloo'] = 'nccl'
Expand Down
Loading