[megatron] optimize dpo main_grad (GPU memory) (#6027)

Jintao-Huang · Jintao-Huang · commit 35b37cbbbe47 · 2025-09-30T23:57:12.000+08:00
diff --git a/docs/source/Megatron-SWIFT/快速开始.md b/docs/source/Megatron-SWIFT/快速开始.md
@@ -27,13 +27,18 @@ pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation -
 pip install git+https://github.com/NVIDIA/Megatron-LM.git@core_r0.13.0
 
 # 若使用多机训练，请额外设置`MODELSCOPE_CACHE`环境变量为共享存储路径
-# 这将确保数据集缓存共享，而加速预处理速度
+# 这将确保数据集缓存共享，而加速预处理速度。
+# 注意：这步很关键，不然多机训练可能因随机性问题导致数据不一致而训练卡住。
 export MODELSCOPE_CACHE='/xxx/shared'
 
 # Megatron-LM
 # 依赖库Megatron-LM中的训练模块将由swift进行git clone并安装。你也可以通过环境变量`MEGATRON_LM_PATH`指向已经下载好的repo路径（断网环境，[core_r0.13.0分支](https://github.com/NVIDIA/Megatron-LM/tree/core_r0.13.0)）。
 git clone --branch core_r0.13.0 https://github.com/NVIDIA/Megatron-LM.git
 export MEGATRON_LM_PATH='/xxx/Megatron-LM'
+
+# flash_attn
+# 选择合适的版本进行安装：https://github.com/Dao-AILab/flash-attention/releases/tag/v2.7.4.post1
+# 注意：请勿安装高于transformer_engine限制的最高版本：https://github.com/NVIDIA/TransformerEngine/blob/release_v2.6/transformer_engine/pytorch/attention/dot_product_attention/utils.py#L109
 ```
 
 或者你也可以使用镜像：
@@ -145,6 +150,14 @@ I am a language model developed by swift, you can call me swift-robot. How can I
 - Megatron-SWIFT使用与ms-swift相同的dataset和template处理模块，因此同样支持packing、loss_scale、agent训练等技术。自定义数据集格式参考[自定义数据集文档](../Customization/自定义数据集.md)。
 - **更多案例**：包括packing、多机、32K上下文、DPO、MoE模型、预训练，可以查看[这里](https://github.com/modelscope/ms-swift/tree/main/examples/megatron)。
 
+训练小贴士：
+- 增加训练吞吐量方法：使用packing、增加DP、减少重计算、增加计算通信overlap。
+- 并行技术选择：
+  - Megatron-SWIFT的并行技术采用zero1（默认开启use_distributed_optimizer）+各种并行技术的组合。
+  - DP的速度最快，但显存占用较多，使用其他并行技术以降低显存占用。
+  - TP/EP通信量较大，尽量不跨节点（NVLink域内），跨节点建议使用PP/DP；专家层建议使用EP而不是ETP，ETP更节约显存，但速度较慢。
+  - MoE 并行折叠：MoE 相关的并行组与 Dense 组分离。Attention使用 tp-cp-dp-pp 组，MoE 使用 etp-ep-dp-pp 组。
+- 权重转换并行数的选择：Megatron-SWIFT在mcore端使用torch_dist存储格式，训练时可以调整并行数，不需要在权重转化时指定并行数。
 
 ## Benchmark
 
diff --git a/docs/source_en/Megatron-SWIFT/Quick-start.md b/docs/source_en/Megatron-SWIFT/Quick-start.md
@@ -28,12 +28,17 @@ pip install git+https://github.com/NVIDIA/Megatron-LM.git@core_r0.13.0
 
 # If you are using multi-node training, please additionally set the `MODELSCOPE_CACHE` environment variable to a shared storage path.
 # This will ensure that the dataset cache is shared, thereby speeding up preprocessing.
+# Note: This step is crucial; otherwise multi-machine training may hang due to data inconsistencies caused by randomness in data preprocessing.
 export MODELSCOPE_CACHE='/xxx/shared'
 
 # Megatron-LM
 # The training module in the dependent library Megatron-LM will be cloned and installed by swift via `git clone`. Alternatively, you can use the environment variable `MEGATRON_LM_PATH` to point to the path of an already downloaded repository (in offline environments, use the [core_r0.13.0 branch](https://github.com/NVIDIA/Megatron-LM/tree/core_r0.13.0)).
 git clone --branch core_r0.13.0 https://github.com/NVIDIA/Megatron-LM.git
 export MEGATRON_LM_PATH='/xxx/Megatron-LM'
+
+# flash_attn
+# Choose an appropriate version to install: https://github.com/Dao-AILab/flash-attention/releases/tag/v2.7.4.post1
+# Note: Do not install a version higher than the maximum supported by transformer_engine: https://github.com/NVIDIA/TransformerEngine/blob/release_v2.6/transformer_engine/pytorch/attention/dot_product_attention/utils.py#L109
 ```
 
 Alternatively, you can also use the image:
@@ -149,6 +154,15 @@ I am a language model developed by swift, you can call me swift-robot. How can I
 - Megatron-SWIFT uses the same dataset and template processing modules as ms-swift, thus supporting techniques such as packing, loss scale, and agent training. For custom dataset formats, please refer to the [Custom Dataset Documentation](../Customization/Custom-dataset.md).
 - **More Examples**: Including packing, multi-node training, 32K context length, DPO, MoE models, and pre-training, can be found [here](https://github.com/modelscope/ms-swift/tree/main/examples/megatron).
 
+Training tips:
+- Ways to increase training throughput: use packing, increase DP (data parallelism), reduce recomputation, and increase computation-communication overlap.
+- Parallelism choices:
+  - Megatron-SWIFT uses ZeRO-1 (use_distributed_optimizer enabled by default) combined with various parallelism techniques.
+  - DP is the fastest but consumes the most memory; use other parallel techniques to reduce memory usage.
+  - TP/EP involve heavy communication, so keep them within the NVLink domain when possible; for cross-domain setups prefer PP/DP. For expert layers, prefer EP over ETP — ETP saves memory but is slower.
+  - MoE parallel folding: separate MoE parallel groups from Dense groups. Attention uses tp-cp-dp-pp groups, while MoE uses etp-ep-dp-pp groups.
+- Choosing parallelism for weight conversion: Megatron-SWIFT uses the torch_dist storage format on the MCore side; you can adjust parallelism at training time and do not need to specify it during weight conversion.
+
 
 ## Benchmark
 The training speed comparison for full-parameter dense models with 8K context length, using `megatron sft` and `swift sft`, under a single-node, eight-GPU A800 environment is as follows:
diff --git a/examples/megatron/multimodal/dense/dpo.sh b/examples/megatron/multimodal/dense/dpo.sh
@@ -1,4 +1,4 @@
-# 4 * 60GiB 14s/it
+# 4 * 50GiB 14s/it
 PYTORCH_CUDA_ALLOC_CONF='expandable_segments:True' \
 NPROC_PER_NODE=4 \
 MAX_PIXELS=1003520 \
@@ -33,7 +33,7 @@ megatron rlhf \
     --num_workers 4 \
     --no_save_optim true \
     --no_save_rng true \
-    --dataset_num_proc 16 \
+    --dataset_num_proc 8 \
     --attention_backend flash \
     --beta 0.1 \
     --loss_type sigmoid
diff --git a/examples/megatron/rlhf/dpo/dense.sh b/examples/megatron/rlhf/dpo/dense.sh
@@ -1,4 +1,4 @@
-# 4 * 55GiB
+# 4 * 45GiB
 PYTORCH_CUDA_ALLOC_CONF='expandable_segments:True' \
 NPROC_PER_NODE=4 \
 CUDA_VISIBLE_DEVICES=0,1,2,3 \
diff --git a/examples/megatron/rlhf/dpo/moe.sh b/examples/megatron/rlhf/dpo/moe.sh
@@ -1,4 +1,4 @@
-# 8 * 65GiB; 13s/it
+# 8 * 46GiB; 13s/it
 # Note: "ms-swift<3.8" does not support DPO packing; please remove --packing true.
 PYTORCH_CUDA_ALLOC_CONF='expandable_segments:True' \
 NPROC_PER_NODE=8 \
diff --git a/examples/megatron/rlhf/dpo/packing.sh b/examples/megatron/rlhf/dpo/packing.sh
@@ -1,4 +1,4 @@
-# 4 * 33GiB; 3.4s/it
+# 4 * 28GiB; 3.4s/it
 PYTORCH_CUDA_ALLOC_CONF='expandable_segments:True' \
 NPROC_PER_NODE=4 \
 CUDA_VISIBLE_DEVICES=0,1,2,3 \
diff --git a/swift/megatron/trainers/dpo_trainer.py b/swift/megatron/trainers/dpo_trainer.py
@@ -42,13 +42,13 @@ def __init__(self, args):
     def setup_model_and_optimizer(self, model_provider_func, model_type, *_args, **kwargs):
         args = get_args()
         if args.train_type == 'full':
-            ref_model = get_model(model_provider_func, model_type)
+            ref_model = get_model(model_provider_func, model_type, wrap_with_ddp=False)
             if args.ref_load is None:
                 args.ref_load = args.load
             args.iteration, args.num_floating_point_operations_so_far = load_checkpoint(
                 ref_model, None, None, load_arg='ref_load')
-            self.ref_model = ref_model[0]
-            self.ref_model.eval()
+            self.ref_model = unwrap_model(ref_model[0])
+            self.ref_model.requires_grad_(False).eval()
         else:
             self.ref_model = None
         return super().setup_model_and_optimizer(model_provider_func, model_type, *_args, **kwargs)
@@ -156,7 +156,7 @@ def null_ref_context(self):
         args = get_args()
         if args.train_type == 'full':
             context = nullcontext()
-            ref_model = unwrap_model(self.ref_model)
+            ref_model = self.ref_model
         else:
             if args.ref_adapter_load is None:
                 context = self.peft_model.disable_adapter()

Original file line number	Diff line number	Diff line change
`@@ -1,4 +1,4 @@`
`1`		`-# 4 * 55GiB`
	`1`	`+# 4 * 45GiB`
`2`	`2`	`PYTORCH_CUDA_ALLOC_CONF='expandable_segments:True' \`
`3`	`3`	`NPROC_PER_NODE=4 \`
`4`	`4`	`CUDA_VISIBLE_DEVICES=0,1,2,3 \`
Original file line number	Diff line number	Diff line change
`@@ -1,4 +1,4 @@`
`1`		`-# 8 * 65GiB; 13s/it`
	`1`	`+# 8 * 46GiB; 13s/it`
`2`	`2`	`# Note: "ms-swift<3.8" does not support DPO packing; please remove --packing true.`
`3`	`3`	`PYTORCH_CUDA_ALLOC_CONF='expandable_segments:True' \`
`4`	`4`	`NPROC_PER_NODE=8 \`
Original file line number	Diff line number	Diff line change
`@@ -1,4 +1,4 @@`
`1`		`-# 4 * 33GiB; 3.4s/it`
	`1`	`+# 4 * 28GiB; 3.4s/it`
`2`	`2`	`PYTORCH_CUDA_ALLOC_CONF='expandable_segments:True' \`
`3`	`3`	`NPROC_PER_NODE=4 \`
`4`	`4`	`CUDA_VISIBLE_DEVICES=0,1,2,3 \`