Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions docs/source/Megatron-SWIFT/命令行参数.md
Original file line number Diff line number Diff line change
Expand Up @@ -95,8 +95,8 @@
- exit_on_missing_checkpoint: 如果设置了`–-load`,但**找不到检查点,则直接退出**,而不是初始化。默认为True。
- 🔥async_save: 使用异步检查点保存。目前仅适用于`torch_dist`分布式检查点格式。默认为False。
- use_persistent_ckpt_worker: 使用持久化检查点工作进程用于异步保存,即创建专门后台进程来处理异步保存。默认为False。
- ckpt_fully_parallel_load: 跨 DP 对分布式检查点使用完全加载并行化,加速权重加载速度。默认为False
- ckpt_assume_constant_structure: 如果在单个训练中,模型和优化器状态字典结构保持不变,允许Megatron进行额外检查点性能优化。默认为False
- ckpt_fully_parallel_load: 跨 DP 对分布式检查点使用完全加载并行化,加速权重加载速度。默认为True
- ckpt_assume_constant_structure: 如果在单个训练中,模型和优化器状态字典结构保持不变,允许Megatron进行额外检查点性能优化。默认为True

**分布式参数**:
并行技术的选择请参考[训练技巧文档](快速开始.md#训练技巧)。
Expand Down
4 changes: 2 additions & 2 deletions docs/source_en/Megatron-SWIFT/Command-line-parameters.md
Original file line number Diff line number Diff line change
Expand Up @@ -99,8 +99,8 @@
- exit_on_missing_checkpoint: If `--load` is set but **no checkpoint is found, exit directly** instead of initializing. Default is True.
- 🔥async_save: Use asynchronous checkpoint saving. Currently only applicable to the `torch_dist` distributed checkpoint format. Defaults to False.
- use_persistent_ckpt_worker: Use a persistent checkpoint worker process for async saving, i.e., create a dedicated background process to handle asynchronous saving. Defaults to False.
- ckpt_fully_parallel_load: Apply full load parallelization across DP for distributed checkpoints to accelerate weight loading speed. Defaults to False.
- ckpt_assume_constant_structure: If the model and optimizer state dict structure remains constant throughout a single training job, allows Megatron to perform additional checkpoint performance optimizations. Defaults to False.
- ckpt_fully_parallel_load: Apply full load parallelization across DP for distributed checkpoints to accelerate weight loading speed. Defaults to True.
- ckpt_assume_constant_structure: If the model and optimizer state dict structure remains constant throughout a single training job, allows Megatron to perform additional checkpoint performance optimizations. Defaults to True.


**Distributed Parameters**:
Expand Down
4 changes: 2 additions & 2 deletions swift/megatron/argument/megatron_args.py
Original file line number Diff line number Diff line change
Expand Up @@ -214,8 +214,8 @@ class MegatronArguments(ExtraMegatronArguments):
exit_on_missing_checkpoint: bool = True
async_save: bool = False
use_persistent_ckpt_worker: bool = False
ckpt_fully_parallel_load: bool = False
ckpt_assume_constant_structure: bool = False
ckpt_fully_parallel_load: bool = True
ckpt_assume_constant_structure: bool = True

# dist
distributed_backend: Literal['nccl', 'gloo'] = 'nccl'
Expand Down
Loading