Skip to content

Commit 35b37cb

Browse files
committed
[megatron] optimize dpo main_grad (GPU memory) (#6027)
1 parent 45df767 commit 35b37cb

File tree

7 files changed

+37
-10
lines changed

7 files changed

+37
-10
lines changed

docs/source/Megatron-SWIFT/快速开始.md

Lines changed: 14 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -27,13 +27,18 @@ pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation -
2727
pip install git+https://github.com/NVIDIA/Megatron-LM.git@core_r0.13.0
2828

2929
# 若使用多机训练,请额外设置`MODELSCOPE_CACHE`环境变量为共享存储路径
30-
# 这将确保数据集缓存共享,而加速预处理速度
30+
# 这将确保数据集缓存共享,而加速预处理速度。
31+
# 注意:这步很关键,不然多机训练可能因随机性问题导致数据不一致而训练卡住。
3132
export MODELSCOPE_CACHE='/xxx/shared'
3233

3334
# Megatron-LM
3435
# 依赖库Megatron-LM中的训练模块将由swift进行git clone并安装。你也可以通过环境变量`MEGATRON_LM_PATH`指向已经下载好的repo路径(断网环境,[core_r0.13.0分支](https://github.com/NVIDIA/Megatron-LM/tree/core_r0.13.0))。
3536
git clone --branch core_r0.13.0 https://github.com/NVIDIA/Megatron-LM.git
3637
export MEGATRON_LM_PATH='/xxx/Megatron-LM'
38+
39+
# flash_attn
40+
# 选择合适的版本进行安装:https://github.com/Dao-AILab/flash-attention/releases/tag/v2.7.4.post1
41+
# 注意:请勿安装高于transformer_engine限制的最高版本:https://github.com/NVIDIA/TransformerEngine/blob/release_v2.6/transformer_engine/pytorch/attention/dot_product_attention/utils.py#L109
3742
```
3843

3944
或者你也可以使用镜像:
@@ -145,6 +150,14 @@ I am a language model developed by swift, you can call me swift-robot. How can I
145150
- Megatron-SWIFT使用与ms-swift相同的dataset和template处理模块,因此同样支持packing、loss_scale、agent训练等技术。自定义数据集格式参考[自定义数据集文档](../Customization/自定义数据集.md)
146151
- **更多案例**:包括packing、多机、32K上下文、DPO、MoE模型、预训练,可以查看[这里](https://github.com/modelscope/ms-swift/tree/main/examples/megatron)
147152

153+
训练小贴士:
154+
- 增加训练吞吐量方法:使用packing、增加DP、减少重计算、增加计算通信overlap。
155+
- 并行技术选择:
156+
- Megatron-SWIFT的并行技术采用zero1(默认开启use_distributed_optimizer)+各种并行技术的组合。
157+
- DP的速度最快,但显存占用较多,使用其他并行技术以降低显存占用。
158+
- TP/EP通信量较大,尽量不跨节点(NVLink域内),跨节点建议使用PP/DP;专家层建议使用EP而不是ETP,ETP更节约显存,但速度较慢。
159+
- MoE 并行折叠:MoE 相关的并行组与 Dense 组分离。Attention使用 tp-cp-dp-pp 组,MoE 使用 etp-ep-dp-pp 组。
160+
- 权重转换并行数的选择:Megatron-SWIFT在mcore端使用torch_dist存储格式,训练时可以调整并行数,不需要在权重转化时指定并行数。
148161

149162
## Benchmark
150163

docs/source_en/Megatron-SWIFT/Quick-start.md

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -28,12 +28,17 @@ pip install git+https://github.com/NVIDIA/Megatron-LM.git@core_r0.13.0
2828

2929
# If you are using multi-node training, please additionally set the `MODELSCOPE_CACHE` environment variable to a shared storage path.
3030
# This will ensure that the dataset cache is shared, thereby speeding up preprocessing.
31+
# Note: This step is crucial; otherwise multi-machine training may hang due to data inconsistencies caused by randomness in data preprocessing.
3132
export MODELSCOPE_CACHE='/xxx/shared'
3233

3334
# Megatron-LM
3435
# The training module in the dependent library Megatron-LM will be cloned and installed by swift via `git clone`. Alternatively, you can use the environment variable `MEGATRON_LM_PATH` to point to the path of an already downloaded repository (in offline environments, use the [core_r0.13.0 branch](https://github.com/NVIDIA/Megatron-LM/tree/core_r0.13.0)).
3536
git clone --branch core_r0.13.0 https://github.com/NVIDIA/Megatron-LM.git
3637
export MEGATRON_LM_PATH='/xxx/Megatron-LM'
38+
39+
# flash_attn
40+
# Choose an appropriate version to install: https://github.com/Dao-AILab/flash-attention/releases/tag/v2.7.4.post1
41+
# Note: Do not install a version higher than the maximum supported by transformer_engine: https://github.com/NVIDIA/TransformerEngine/blob/release_v2.6/transformer_engine/pytorch/attention/dot_product_attention/utils.py#L109
3742
```
3843

3944
Alternatively, you can also use the image:
@@ -149,6 +154,15 @@ I am a language model developed by swift, you can call me swift-robot. How can I
149154
- Megatron-SWIFT uses the same dataset and template processing modules as ms-swift, thus supporting techniques such as packing, loss scale, and agent training. For custom dataset formats, please refer to the [Custom Dataset Documentation](../Customization/Custom-dataset.md).
150155
- **More Examples**: Including packing, multi-node training, 32K context length, DPO, MoE models, and pre-training, can be found [here](https://github.com/modelscope/ms-swift/tree/main/examples/megatron).
151156

157+
Training tips:
158+
- Ways to increase training throughput: use packing, increase DP (data parallelism), reduce recomputation, and increase computation-communication overlap.
159+
- Parallelism choices:
160+
- Megatron-SWIFT uses ZeRO-1 (use_distributed_optimizer enabled by default) combined with various parallelism techniques.
161+
- DP is the fastest but consumes the most memory; use other parallel techniques to reduce memory usage.
162+
- TP/EP involve heavy communication, so keep them within the NVLink domain when possible; for cross-domain setups prefer PP/DP. For expert layers, prefer EP over ETP — ETP saves memory but is slower.
163+
- MoE parallel folding: separate MoE parallel groups from Dense groups. Attention uses tp-cp-dp-pp groups, while MoE uses etp-ep-dp-pp groups.
164+
- Choosing parallelism for weight conversion: Megatron-SWIFT uses the torch_dist storage format on the MCore side; you can adjust parallelism at training time and do not need to specify it during weight conversion.
165+
152166

153167
## Benchmark
154168
The training speed comparison for full-parameter dense models with 8K context length, using `megatron sft` and `swift sft`, under a single-node, eight-GPU A800 environment is as follows:

examples/megatron/multimodal/dense/dpo.sh

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
# 4 * 60GiB 14s/it
1+
# 4 * 50GiB 14s/it
22
PYTORCH_CUDA_ALLOC_CONF='expandable_segments:True' \
33
NPROC_PER_NODE=4 \
44
MAX_PIXELS=1003520 \
@@ -33,7 +33,7 @@ megatron rlhf \
3333
--num_workers 4 \
3434
--no_save_optim true \
3535
--no_save_rng true \
36-
--dataset_num_proc 16 \
36+
--dataset_num_proc 8 \
3737
--attention_backend flash \
3838
--beta 0.1 \
3939
--loss_type sigmoid

examples/megatron/rlhf/dpo/dense.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
# 4 * 55GiB
1+
# 4 * 45GiB
22
PYTORCH_CUDA_ALLOC_CONF='expandable_segments:True' \
33
NPROC_PER_NODE=4 \
44
CUDA_VISIBLE_DEVICES=0,1,2,3 \

examples/megatron/rlhf/dpo/moe.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
# 8 * 65GiB; 13s/it
1+
# 8 * 46GiB; 13s/it
22
# Note: "ms-swift<3.8" does not support DPO packing; please remove --packing true.
33
PYTORCH_CUDA_ALLOC_CONF='expandable_segments:True' \
44
NPROC_PER_NODE=8 \

examples/megatron/rlhf/dpo/packing.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
# 4 * 33GiB; 3.4s/it
1+
# 4 * 28GiB; 3.4s/it
22
PYTORCH_CUDA_ALLOC_CONF='expandable_segments:True' \
33
NPROC_PER_NODE=4 \
44
CUDA_VISIBLE_DEVICES=0,1,2,3 \

swift/megatron/trainers/dpo_trainer.py

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -42,13 +42,13 @@ def __init__(self, args):
4242
def setup_model_and_optimizer(self, model_provider_func, model_type, *_args, **kwargs):
4343
args = get_args()
4444
if args.train_type == 'full':
45-
ref_model = get_model(model_provider_func, model_type)
45+
ref_model = get_model(model_provider_func, model_type, wrap_with_ddp=False)
4646
if args.ref_load is None:
4747
args.ref_load = args.load
4848
args.iteration, args.num_floating_point_operations_so_far = load_checkpoint(
4949
ref_model, None, None, load_arg='ref_load')
50-
self.ref_model = ref_model[0]
51-
self.ref_model.eval()
50+
self.ref_model = unwrap_model(ref_model[0])
51+
self.ref_model.requires_grad_(False).eval()
5252
else:
5353
self.ref_model = None
5454
return super().setup_model_and_optimizer(model_provider_func, model_type, *_args, **kwargs)
@@ -156,7 +156,7 @@ def null_ref_context(self):
156156
args = get_args()
157157
if args.train_type == 'full':
158158
context = nullcontext()
159-
ref_model = unwrap_model(self.ref_model)
159+
ref_model = self.ref_model
160160
else:
161161
if args.ref_adapter_load is None:
162162
context = self.peft_model.disable_adapter()

0 commit comments

Comments
 (0)