diff --git "a/docs/source/Megatron-SWIFT/\345\221\275\344\273\244\350\241\214\345\217\202\346\225\260.md" "b/docs/source/Megatron-SWIFT/\345\221\275\344\273\244\350\241\214\345\217\202\346\225\260.md" index b835cb959c..7c7ca5bcca 100644 --- "a/docs/source/Megatron-SWIFT/\345\221\275\344\273\244\350\241\214\345\217\202\346\225\260.md" +++ "b/docs/source/Megatron-SWIFT/\345\221\275\344\273\244\350\241\214\345\217\202\346\225\260.md" @@ -95,8 +95,8 @@ - exit_on_missing_checkpoint: 如果设置了`–-load`,但**找不到检查点,则直接退出**,而不是初始化。默认为True。 - 🔥async_save: 使用异步检查点保存。目前仅适用于`torch_dist`分布式检查点格式。默认为False。 - use_persistent_ckpt_worker: 使用持久化检查点工作进程用于异步保存,即创建专门后台进程来处理异步保存。默认为False。 -- ckpt_fully_parallel_load: 跨 DP 对分布式检查点使用完全加载并行化,加速权重加载速度。默认为False。 -- ckpt_assume_constant_structure: 如果在单个训练中,模型和优化器状态字典结构保持不变,允许Megatron进行额外检查点性能优化。默认为False。 +- ckpt_fully_parallel_load: 跨 DP 对分布式检查点使用完全加载并行化,加速权重加载速度。默认为True。 +- ckpt_assume_constant_structure: 如果在单个训练中,模型和优化器状态字典结构保持不变,允许Megatron进行额外检查点性能优化。默认为True。 **分布式参数**: 并行技术的选择请参考[训练技巧文档](快速开始.md#训练技巧)。 diff --git a/docs/source_en/Megatron-SWIFT/Command-line-parameters.md b/docs/source_en/Megatron-SWIFT/Command-line-parameters.md index 37983e8e1b..5bc776fc43 100644 --- a/docs/source_en/Megatron-SWIFT/Command-line-parameters.md +++ b/docs/source_en/Megatron-SWIFT/Command-line-parameters.md @@ -99,8 +99,8 @@ - exit_on_missing_checkpoint: If `--load` is set but **no checkpoint is found, exit directly** instead of initializing. Default is True. - 🔥async_save: Use asynchronous checkpoint saving. Currently only applicable to the `torch_dist` distributed checkpoint format. Defaults to False. - use_persistent_ckpt_worker: Use a persistent checkpoint worker process for async saving, i.e., create a dedicated background process to handle asynchronous saving. Defaults to False. -- ckpt_fully_parallel_load: Apply full load parallelization across DP for distributed checkpoints to accelerate weight loading speed. Defaults to False. -- ckpt_assume_constant_structure: If the model and optimizer state dict structure remains constant throughout a single training job, allows Megatron to perform additional checkpoint performance optimizations. Defaults to False. +- ckpt_fully_parallel_load: Apply full load parallelization across DP for distributed checkpoints to accelerate weight loading speed. Defaults to True. +- ckpt_assume_constant_structure: If the model and optimizer state dict structure remains constant throughout a single training job, allows Megatron to perform additional checkpoint performance optimizations. Defaults to True. **Distributed Parameters**: diff --git a/swift/megatron/argument/megatron_args.py b/swift/megatron/argument/megatron_args.py index 4cd2fc7a18..8f46599eb4 100644 --- a/swift/megatron/argument/megatron_args.py +++ b/swift/megatron/argument/megatron_args.py @@ -214,8 +214,8 @@ class MegatronArguments(ExtraMegatronArguments): exit_on_missing_checkpoint: bool = True async_save: bool = False use_persistent_ckpt_worker: bool = False - ckpt_fully_parallel_load: bool = False - ckpt_assume_constant_structure: bool = False + ckpt_fully_parallel_load: bool = True + ckpt_assume_constant_structure: bool = True # dist distributed_backend: Literal['nccl', 'gloo'] = 'nccl'