[megatron] update benchmark docs (#4991)

Jintao-Huang · Jintao-Huang · commit 1dda8999b966 · 2025-07-18T15:11:16.000+08:00
diff --git a/docs/source/Instruction/Megatron-SWIFT训练.md b/docs/source/Instruction/Megatron-SWIFT训练.md
@@ -9,9 +9,12 @@ SWIFT引入了Megatron的并行技术来加速大模型的训练，包括数据
 ```shell
 # 推荐torch版本：2.5 / 2.6
 pip install pybind11
+
 # transformer_engine
 # 若出现安装错误，可以参考该issue解决: https://github.com/modelscope/ms-swift/issues/3793
 pip install git+https://github.com/NVIDIA/TransformerEngine.git@release_v2.3
+# 若以上命令报错也可以使用以下方式安装
+# pip install transformer_engine[pytorch]
 
 # apex
 git clone https://github.com/NVIDIA/apex
@@ -134,8 +137,8 @@ I am a language model developed by swift, you can call me swift-robot. How can I
 
 |          | Megatron-LM | Deepspeed-ZeRO2 | Deepspeed-ZeRO3 |
 | -------- | ----------- | ---------- | ---------- |
-| 训练速度 |      2.93s/it       |  6.02s/it   | 24.30s/it |
-| 显存占用 | 8\*66GB     |  8\*72GB   | 8\*50GB |
+| 训练速度 |      2.95s/it       |  6.02s/it   | 24.30s/it |
+| 显存占用 | 8\*57GB     |  8\*72GB   | 8\*50GB |
 
 
 ## 命令行参数
@@ -232,8 +235,8 @@ I am a language model developed by swift, you can call me swift-robot. How can I
 - 🔥sequence_parallel: 启动序列并行的优化器。默认为False。
 - 🔥context_parallel_size: cp数，默认为1。
 - tp_comm_overlap: 启用张量并行通信与GEMM（通用矩阵乘法）内核的重叠（降低通信耗时）。默认为False。
-- overlap_grad_reduce: 启用DDP中grad reduce操作的重叠（降低DP通信耗时）。默认为False。
-- overlap_param_gather: 启用分布式优化器中参数all-gather的重叠（降低DP通信耗时）。默认为False。
+- 🔥overlap_grad_reduce: 启用DDP中grad reduce操作的重叠（降低DP通信耗时）。默认为False。
+- 🔥overlap_param_gather: 启用分布式优化器中参数all-gather的重叠（降低DP通信耗时）。默认为False。
 - distributed_timeout_minutes: torch.distributed的timeout时间（单位为分钟），该参数失效，使用[基础参数](./命令行参数.md#基本参数)中的ddp_timeout控制，默认为300000分钟。
 
 **日志参数**:
diff --git a/docs/source_en/Instruction/Megatron-SWIFT-Training.md b/docs/source_en/Instruction/Megatron-SWIFT-Training.md
@@ -10,9 +10,12 @@ To use Megatron-SWIFT, in addition to installing the `swift` dependencies, you a
 ```shell
 # Recommended PyTorch version: 2.5 / 2.6
 pip install pybind11
+
 # transformer_engine
 # If an installation error occurs, you can refer to this issue for resolution: https://github.com/modelscope/ms-swift/issues/3793
 pip install git+https://github.com/NVIDIA/TransformerEngine.git@release_v2.3
+# If the above command fails, you can also install it using the following command:
+# pip install transformer_engine[pytorch]
 
 # apex
 git clone https://github.com/NVIDIA/apex
@@ -138,8 +141,8 @@ The speed comparison of full-parameter training for Dense/MoE models using `mega
 
 |                  | Megatron-LM | Deepspeed-ZeRO2 | Deepspeed-ZeRO3 |
 | ---------------- | ----------- | --------------- | --------------- |
-| Training Speed   | 2.93s/it    | 6.02s/it        | 24.30s/it       |
-| GPU Memory Usage | 8\*66GB      | 8\*72GB          | 8\*50GB          |
+| Training Speed   | 2.95s/it    | 6.02s/it        | 24.30s/it       |
+| GPU Memory Usage | 8\*57GB      | 8\*72GB          | 8\*50GB          |
 
 ## Command Line Arguments
 
@@ -239,8 +242,8 @@ seq_length: Defaults to None, meaning it is set to `max_length`. To restrict the
 - 🔥sequence_parallel: Enable sequence parallel optimization. Default is False.
 - 🔥context_parallel_size: CP (Context Parallelism) size, default is 1.
 - tp_comm_overlap: Overlap tensor parallel communication with GEMM (General Matrix Multiplication) kernels (to reduce communication time). Default is False.
-- overlap_grad_reduce: Overlap grad reduction operations in DDP (to reduce DP communication time). Default is False.
-- overlap_param_gather: Overlap all-gather of parameters in the distributed optimizer (to reduce DP communication time). Default is False.
+- 🔥overlap_grad_reduce: Overlap grad reduction operations in DDP (to reduce DP communication time). Default is False.
+- 🔥overlap_param_gather: Overlap all-gather of parameters in the distributed optimizer (to reduce DP communication time). Default is False.
 - distributed_timeout_minutes: The timeout duration for torch.distributed (in minutes). This parameter is deprecated and is now controlled by the `ddp_timeout` in the [Base Arguments](./Command-line-parameters.md#base-arguments), with a default value of 300000 minutes.
 
 **Logging Parameters**:
diff --git a/examples/train/megatron/moe/moe.sh b/examples/train/megatron/moe/moe.sh
@@ -1,5 +1,4 @@
-# pp2ep4: 7 * 73GiB, 2.5s/it
-# tp2ep4: 8 * 65GiB, 3s/it
+# 8 * 57GiB, 2.95s/it
 PYTORCH_CUDA_ALLOC_CONF='expandable_segments:True' \
 NPROC_PER_NODE=8 \
 CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
@@ -8,6 +7,7 @@ megatron sft \
     --dataset 'liucong/Chinese-DeepSeek-R1-Distill-data-110k-SFT' \
     --split_dataset_ratio 0.01 \
     --pipeline_model_parallel_size 2 \
+    --decoder_last_pipeline_num_layers 11 \
     --expert_model_parallel_size 4 \
     --moe_grouped_gemm true \
     --moe_shared_expert_overlap true \
@@ -17,7 +17,9 @@ megatron sft \
     --packing true \
     --moe_permute_fusion true \
     --moe_router_dtype fp32 \
-    --recompute_granularity selective \
+    --recompute_granularity full \
+    --recompute_method uniform \
+    --recompute_num_layers 1 \
     --max_epochs 1 \
     --finetune true \
     --cross_entropy_loss_fusion true \
@@ -33,4 +35,4 @@ megatron sft \
     --no_save_optim true \
     --no_save_rng true \
     --sequence_parallel true \
-    --use_flash_attn true
+    --attention_backend flash