[megatron] Support dpo lora (#4913)

Jintao-Huang · web-flow · commit 67458dfebdd1 · 2025-07-11T13:43:54.000+08:00
diff --git a/docs/source/BestPractices/Qwen3最佳实践.md b/docs/source/BestPractices/Qwen3最佳实践.md
@@ -339,7 +339,7 @@ megatron sft \
     --load Qwen3-30B-A3B-Base-mcore \
     --dataset 'liucong/Chinese-DeepSeek-R1-Distill-data-110k-SFT' \
     --split_dataset_ratio 0.01 \
-    --tensor_model_parallel_size 2 \
+    --pipeline_model_parallel_size 2 \
     --expert_model_parallel_size 8 \
     --moe_grouped_gemm true \
     --moe_shared_expert_overlap true \
@@ -366,7 +366,7 @@ megatron sft \
     --no_save_optim true \
     --no_save_rng true \
     --sequence_parallel true \
-    --use_flash_attn true
+    --attention_backend flash
 ```
 
 训练loss图（部分）：
diff --git a/docs/source/Instruction/Megatron-SWIFT训练.md b/docs/source/Instruction/Megatron-SWIFT训练.md
@@ -44,14 +44,14 @@ modelscope-registry.us-west-1.cr.aliyuncs.com/modelscope-repo/modelscope:ubuntu2
 
 首先，我们需要将HF格式的权重转为Megatron格式：
 - 若出现OOM，将`CUDA_VISIBLE_DEVICES=0`删除即可。
-- "ms-swift>=3.6"推荐增加`--test_convert_precision true`参数测试转换精度。
 ```shell
 CUDA_VISIBLE_DEVICES=0 \
 swift export \
     --model Qwen/Qwen2.5-7B-Instruct \
     --to_mcore true \
     --torch_dtype bfloat16 \
-    --output_dir Qwen2.5-7B-Instruct-mcore
+    --output_dir Qwen2.5-7B-Instruct-mcore \
+    --test_convert_precision true
 ```
 
 然后，使用以下脚本进行训练，训练所需显存资源为2*80GiB：
@@ -93,14 +93,14 @@ megatron sft \
 最后，将Megatron格式权重转为HF格式：
 - 注意：`--mcore_model`请指向`iter_xxx`的上级目录。默认会使用`latest_checkpointed_iteration.txt`中对应的checkpoint。
 - 若出现OOM，将`CUDA_VISIBLE_DEVICES=0`删除即可。
-- "ms-swift>=3.6"推荐增加`--test_convert_precision true`参数测试转换精度。
 ```shell
 CUDA_VISIBLE_DEVICES=0 \
 swift export \
     --mcore_model megatron_output/Qwen2.5-7B-Instruct/vx-xxx \
     --to_hf true \
     --torch_dtype bfloat16 \
-    --output_dir megatron_output/Qwen2.5-7B-Instruct/vx-xxx-hf
+    --output_dir megatron_output/Qwen2.5-7B-Instruct/vx-xxx-hf \
+    --test_convert_precision true
 ```
 
 我们对生成的HF格式权重进行推理：
@@ -172,10 +172,10 @@ MCore转换HF脚本：
 ```bash
 CUDA_VISIBLE_DEVICES=0 \
 swift export \
-    --mcore_adapters /mnt/nas2/huangjintao.hjt/work/llmscope/megatron_output/Qwen3-30B-A3B/v5-20250710-204630 \
+    --mcore_adapters megatron_output/Qwen2.5-7B-Instruct/vx-xxx \
     --to_hf true \
     --torch_dtype bfloat16 \
-    --output_dir /mnt/nas2/huangjintao.hjt/work/llmscope/megatron_output/Qwen3-30B-A3B/v5-20250710-204630-hf \
+    --output_dir megatron_output/Qwen2.5-7B-Instruct/vx-xxx-hf \
     --test_convert_precision true
 ```
 - 注意：`mcore_adapters`文件夹中包含`args.json`文件，转换过程中会读取文件中`mcore_model`和LoRA相关的参数信息，并将`mcore_model`和`mcore_adapters`进行merge-lora成完整权重，最终转换成HF格式权重。
@@ -402,6 +402,7 @@ lora训练：
 - adapter_load: 加载adapter的权重路径，默认为None。
 - 🔥target_modules: 指定lora模块的后缀, 默认为`['all-linear']`。
 - 🔥target_regex: 指定lora模块的regex表达式，默认为`None`。如果该值传入，则target_modules参数失效。
+- 🔥modules_to_save: 在已附加tuner后，额外指定一部分原模型模块参与训练和存储。默认为`[]`。
 - 🔥lora_rank: 默认为`8`。
 - 🔥lora_alpha: 默认为`32`。
 - lora_dropout: 默认为`0.05`。
diff --git a/docs/source_en/BestPractices/Qwen3-Best-Practice.md b/docs/source_en/BestPractices/Qwen3-Best-Practice.md
@@ -343,7 +343,7 @@ megatron sft \
     --load Qwen3-30B-A3B-Base-mcore \
     --dataset 'liucong/Chinese-DeepSeek-R1-Distill-data-110k-SFT' \
     --split_dataset_ratio 0.01 \
-    --tensor_model_parallel_size 2 \
+    --pipeline_model_parallel_size 2 \
     --expert_model_parallel_size 8 \
     --moe_grouped_gemm true \
     --moe_shared_expert_overlap true \
@@ -370,7 +370,7 @@ megatron sft \
     --no_save_optim true \
     --no_save_rng true \
     --sequence_parallel true \
-    --use_flash_attn true
+    --attention_backend flash
 ```
 
 
diff --git a/docs/source_en/Instruction/Megatron-SWIFT-Training.md b/docs/source_en/Instruction/Megatron-SWIFT-Training.md
@@ -45,14 +45,14 @@ This section introduces a quick start example for fine-tuning the self-awareness
 
 First, we need to convert the weights from HF (Hugging Face) format to Megatron format:
 - If you encounter OOM, simply remove `CUDA_VISIBLE_DEVICES=0`.
-- For "ms-swift>=3.6", it is recommended to add the `--test_convert_precision true` parameter to test conversion precision.
 ```shell
 CUDA_VISIBLE_DEVICES=0 \
 swift export \
     --model Qwen/Qwen2.5-7B-Instruct \
     --to_mcore true \
     --torch_dtype bfloat16 \
-    --output_dir Qwen2.5-7B-Instruct-mcore
+    --output_dir Qwen2.5-7B-Instruct-mcore \
+    --test_convert_precision true
 ```
 
 Next, use the following script to start training. The required GPU memory resources are 2*80GiB:
@@ -94,15 +94,15 @@ megatron sft \
 Finally, convert the Megatron format weights back to HF format:
 - Note: Please point `--mcore_model` to the parent directory of `iter_xxx`. By default, the corresponding checkpoint from `latest_checkpointed_iteration.txt` will be used.
 - If you encounter OOM, simply remove `CUDA_VISIBLE_DEVICES=0`.
-- For "ms-swift>=3.6", it is recommended to add the `--test_convert_precision true` parameter to test conversion precision.
 
 ```shell
 CUDA_VISIBLE_DEVICES=0 \
 swift export \
     --mcore_model megatron_output/Qwen2.5-7B-Instruct/vx-xxx \
     --to_hf true \
     --torch_dtype bfloat16 \
-    --output_dir megatron_output/Qwen2.5-7B-Instruct/vx-xxx-hf
+    --output_dir megatron_output/Qwen2.5-7B-Instruct/vx-xxx-hf \
+    --test_convert_precision true
 ```
 
 We then perform inference on the generated HF format weights:
@@ -423,6 +423,7 @@ LoRA Training:
 - adapter_load: Path to the adapter weights to be loaded. Default is `None`.
 - 🔥target_modules: Suffixes of modules to apply LoRA to. Default is `['all-linear']`.
 - 🔥target_regex: Regex expression to specify LoRA modules. Default is `None`. If this value is provided, the `target_modules` parameter will be ignored.
+- 🔥modules_to_save: After attaching a tuner, explicitly specifies additional original model modules to participate in training and storage. The default is `[]`.
 - 🔥lora_rank: Default is `8`.
 - 🔥lora_alpha: Default is `32`.
 - lora_dropout: Default is `0.05`.
diff --git a/examples/train/megatron/lora/dpo.sh b/examples/train/megatron/lora/dpo.sh
@@ -0,0 +1,41 @@
+# 2 * 55GiB; 4.50s/it
+PYTORCH_CUDA_ALLOC_CONF='expandable_segments:True' \
+NPROC_PER_NODE=2 \
+CUDA_VISIBLE_DEVICES=0,1 \
+megatron rlhf \
+    --rlhf_type dpo \
+    --load Qwen3-30B-A3B-Base-mcore \
+    --dataset 'hjh0119/shareAI-Llama3-DPO-zh-en-emoji#20000' \
+    --train_type lora \
+    --lora_rank 8 \
+    --lora_alpha 32 \
+    --target_modules all-linear \
+    --split_dataset_ratio 0.01 \
+    --expert_model_parallel_size 2 \
+    --moe_grouped_gemm true \
+    --moe_shared_expert_overlap true \
+    --moe_aux_loss_coeff 0.01 \
+    --micro_batch_size 8 \
+    --global_batch_size 16 \
+    --recompute_granularity full \
+    --recompute_method uniform \
+    --recompute_num_layers 1 \
+    --max_epochs 1 \
+    --finetune true \
+    --cross_entropy_loss_fusion true \
+    --lr 1e-4 \
+    --lr_warmup_fraction 0.05 \
+    --min_lr 1e-5 \
+    --save megatron_output/Qwen3-30B-A3B-Base \
+    --eval_interval 100 \
+    --save_interval 100 \
+    --max_length 8192 \
+    --num_workers 8 \
+    --dataset_num_proc 8 \
+    --no_save_optim true \
+    --no_save_rng true \
+    --sequence_parallel true \
+    --attention_backend flash \
+    --beta 0.1 \
+    --rpo_alpha 1 \
+    --loss_type sigmoid
diff --git a/examples/train/megatron/moe/qwen3_moe.sh b/examples/train/megatron/moe/qwen3_moe.sh
@@ -10,7 +10,7 @@ megatron sft \
     --load Qwen3-30B-A3B-Base-mcore \
     --dataset 'liucong/Chinese-DeepSeek-R1-Distill-data-110k-SFT' \
     --split_dataset_ratio 0.01 \
-    --tensor_model_parallel_size 2 \
+    --pipeline_model_parallel_size 2 \
     --expert_model_parallel_size 8 \
     --moe_grouped_gemm true \
     --moe_shared_expert_overlap true \
diff --git a/swift/megatron/trainers/base.py b/swift/megatron/trainers/base.py
@@ -157,9 +157,9 @@ def load_state_dict(self, state_dict, strict: bool = True, *args, **kwargs):
     def setup_model_and_optimizer(self, model_provider_func, model_type, *_args, **kwargs):
 
         def new_model_provider_func(*args, **kwargs):
-            model = model_provider_func(*args, **kwargs)
-            prepare_mcore_model(model)
-            return model
+            self.unwrapped_model = model_provider_func(*args, **kwargs)
+            self.peft_model = prepare_mcore_model(self.unwrapped_model)
+            return self.unwrapped_model
 
         with self._patch_load_state_dict():
             model, optimizer, opt_param_scheduler = self._origin_setup_model_and_optimizer(
diff --git a/swift/megatron/trainers/dpo_trainer.py b/swift/megatron/trainers/dpo_trainer.py
@@ -1,5 +1,6 @@
 # Copyright (c) Alibaba, Inc. and its affiliates.
 from collections import namedtuple
+from contextlib import contextmanager, nullcontext
 from functools import partial
 
 import torch
@@ -40,13 +41,16 @@ def __init__(self, args):
 
     def setup_model_and_optimizer(self, model_provider_func, model_type, *_args, **kwargs):
         args = get_args()
-        ref_model = get_model(model_provider_func, model_type)
-        if args.ref_load is None:
-            args.ref_load = args.load
-        args.iteration, args.num_floating_point_operations_so_far = load_checkpoint(
-            ref_model, None, None, load_arg='ref_load')
-        self.ref_model = ref_model[0]
-        self.ref_model.eval()
+        if args.train_type == 'full':
+            ref_model = get_model(model_provider_func, model_type)
+            if args.ref_load is None:
+                args.ref_load = args.load
+            args.iteration, args.num_floating_point_operations_so_far = load_checkpoint(
+                ref_model, None, None, load_arg='ref_load')
+            self.ref_model = ref_model[0]
+            self.ref_model.eval()
+        else:
+            self.ref_model = None
         return super().setup_model_and_optimizer(model_provider_func, model_type, *_args, **kwargs)
 
     @staticmethod
@@ -78,8 +82,7 @@ def _forward_step_helper(model, inputs):
 
         return output_tensor
 
-    def ref_forward(self, data_iterator):
-        ref_model = unwrap_model(self.ref_model)
+    def ref_forward(self, ref_model, data_iterator):
         with self.stimer(bdata=True):
             data = get_batch(data_iterator)
         data.pop('loss_scale', None)
@@ -144,13 +147,25 @@ def loss_func(self, output_tensor: torch.Tensor, *, ref_logps: torch.Tensor, lab
         loss = loss / mpu.get_context_parallel_world_size()
         return (loss, reporting_metric)
 
+    @contextmanager
+    def null_ref_context(self):
+        args = get_args()
+        if args.train_type == 'full':
+            context = nullcontext()
+            ref_model = unwrap_model(self.ref_model)
+        else:
+            context = self.peft_model.disable_adapter()
+            ref_model = self.unwrapped_model
+        with context:
+            yield ref_model
+
     def _replace_data_iterator(self, data_iterator):
         args = get_args()
         num_iters_per_step = args.global_batch_size // (args.micro_batch_size * mpu.get_data_parallel_world_size())
         res = []
-        for i in range(num_iters_per_step):
-            with torch.no_grad():
-                res.append(self.ref_forward(data_iterator))
+        with torch.no_grad(), self.null_ref_context() as ref_model:
+            for i in range(num_iters_per_step):
+                res.append(self.ref_forward(ref_model, data_iterator))
         return iter(res)
 
     def forward_step(self, data_iterator, model):
diff --git a/swift/megatron/tuners/lora.py b/swift/megatron/tuners/lora.py
@@ -225,21 +225,21 @@ def reset_lora_parameters(self, adapter_name, init_lora_weights):
 
     def forward(self, x: torch.Tensor, *args: Any, **kwargs: Any):
         previous_dtype = x.dtype
-        if self.disable_adapters:
-            if self.merged:
-                self.unmerge()
-            result, bias = self.base_layer(x, *args, **kwargs)
-        elif self.merged:
-            result, bias = self.base_layer(x, *args, **kwargs)
-        else:
-            if isinstance(self.base_layer, TELayerNormColumnParallelLinear):
-                self.base_layer.return_layernorm_output = True
-                result, bias = self.base_layer(x, *args, **kwargs)
-                result, x = result  # ln_out
-            elif isinstance(self.base_layer, (TELinear, TEGroupedLinear)):
+        if self.disable_adapters and self.merged:
+            self.unmerge()
+
+        if isinstance(self.base_layer, TELayerNormColumnParallelLinear):
+            if self.disable_adapters or self.merged:
+                self.base_layer.return_layernorm_output = False
                 result, bias = self.base_layer(x, *args, **kwargs)
             else:
-                raise ValueError(f'Unsupported base layer type: {type(self.base_layer)}')
+                self.base_layer.return_layernorm_output = True
+                (result, x), bias = self.base_layer(x, *args, **kwargs)
+        elif isinstance(self.base_layer, (TELinear, TEGroupedLinear)):
+            result, bias = self.base_layer(x, *args, **kwargs)
+        else:
+            raise ValueError(f'Unsupported base layer type: {type(self.base_layer)}')
+        if not self.disable_adapters and not self.merged:
             for active_adapter in self.active_adapters:
                 if active_adapter not in self.lora_A.keys():
                     continue