[dpo] support dpo padding_free/logits_to_keep & dpo compat trl==0.18 (#4394)

Jintao-Huang · web-flow · commit e060ad82fc02 · 2025-06-03T16:11:39.000+08:00
diff --git a/docs/source/Instruction/命令行参数.md b/docs/source/Instruction/命令行参数.md
@@ -78,7 +78,7 @@
 - 🔥agent_template: Agent模板，确定如何将工具列表转换成system，如何从模型回复中提取toolcall，以及确定`{"role": "tool_call", "content": "xxx"}`, `{"role": "tool_response", "content": "xxx"}`的模板格式。可选为"react_en", "hermes", "glm4", "qwen_en", "toolbench"等，更多请查看[这里](https://github.com/modelscope/ms-swift/blob/main/swift/plugin/agent_template/__init__.py)。默认为None，根据模型类型进行选择。
 - norm_bbox: 控制如何缩放边界框（bbox）。选项为'norm1000'和'none'。'norm1000'表示将bbox坐标缩放至千分之一，而'none'表示不进行缩放。默认值为None，将根据模型自动选择。
 - use_chat_template: 使用chat模板或generation模板，默认为`True`。`swift pt`会自动设置为generation模板。
-- 🔥padding_free: 将一个batch中的数据进行展平而避免数据padding，从而降低显存占用并加快训练。默认为False。当前支持`swift pt/sft`。
+- 🔥padding_free: 将一个batch中的数据进行展平而避免数据padding，从而降低显存占用并加快训练。默认为False。当前支持CPT/SFT/DPO/GRPO。
   - 注意：使用padding_free请结合`--attn_impl flash_attn`使用且"transformers>=4.44"，具体查看[该PR](https://github.com/huggingface/transformers/pull/31629)。（同packing）
   - 支持的多模态模型与多模态packing支持情况相同。相较于packing，padding_free不额外消耗时间和空间。
   - Megatron-SWIFT默认使用padding_free，即`qkv_format='thd'`，不需要额外设置。
@@ -88,7 +88,7 @@
   - 'all': 计算所有tokens的损失。
   - 'ignore_empty_think': 在`'default'`的基础上，忽略空的`'<think>\n\n</think>\n\n'`损失计算，具体请参考[此issue](https://github.com/modelscope/ms-swift/issues/4030)。
   - 'react', 'hermes', 'qwen': 在`'default'`的基础上，将`tool_call`部分的loss权重调整为2。
-- sequence_parallel_size: 序列并行大小，默认是1。当前支持pt/sft/dpo。训练脚本参考[这里](https://github.com/modelscope/ms-swift/tree/main/examples/train/long_text/sequence_parallel.sh)。
+- sequence_parallel_size: 序列并行大小，默认是1。当前支持CPT/SFT/DPO/GRPO。训练脚本参考[这里](https://github.com/modelscope/ms-swift/tree/main/examples/train/long_text/sequence_parallel.sh)。
 - response_prefix: response的前缀字符，例如QwQ-32B将response_prefix设置为`'<think>\n'`。默认为None，根据模型自动设置。
   - 注意：若对deepseek-r1/qwq模型使用不包含`<think>...</think>`的数据集进行训练，请加在推理训练后模型时额外传入`--response_prefix ''`。
 - template_backend: 选择template后端，可选为'swift'、'jinja'，默认为'swift'。如果使用jinja，则使用transformers的`apply_chat_template`。
diff --git a/docs/source_en/Instruction/Command-line-parameters.md b/docs/source_en/Instruction/Command-line-parameters.md
@@ -79,7 +79,7 @@ Hints:
 - 🔥agent_template: Agent template, which determines how to convert the list of tools into a system, how to extract tool calls from the model's response, and specifies the template format for `{"role": "tool_call", "content": "xxx"}` and `{"role": "tool_response", "content": "xxx"}`. Optional values include "react_en", "hermes", "glm4", "qwen_en", "toolbench", etc. For more details, please check [here](https://github.com/modelscope/ms-swift/blob/main/swift/plugin/agent_template/__init__.py). The default value is None, meaning it will be selected based on the model type.
 - norm_bbox: Controls how to scale bounding boxes (bbox). Options are 'norm1000' and 'none'. 'norm1000' represents scaling bbox coordinates to one-thousandths, and 'none' means no scaling. Default is None, automatically selected based on the model.
 - use_chat_template: Use chat template or generation template, default is `True`. `swift pt` is automatically set to the generation template.
-- 🔥padding_free: Flattens the data in a batch to avoid padding, thereby reducing memory usage and accelerating training. Default is False. Currently supports `swift pt/sft`.
+- 🔥padding_free: Flattens the data in a batch to avoid padding, thereby reducing memory usage and accelerating training. Default is False. Currently supported in CPT/SFT/DPO/GRPO.
   - Note: When using `padding_free`, it should be combined with `--attn_impl flash_attn` and "transformers>=4.44". For details, see [this PR](https://github.com/huggingface/transformers/pull/31629). (Same as packing)
   - The supported multimodal models are the same as those supported for multimodal packing. Compared to packing, padding_free does not consume additional time or space.
   - Megatron-SWIFT uses `padding_free` by default, i.e., `qkv_format='thd'`, and no additional configuration is required.
@@ -89,7 +89,7 @@ Hints:
   - 'all': Calculate the loss for all tokens.
   - 'ignore_empty_think': On top of 'default', ignore the loss calculation for empty `'<think>\n\n</think>\n\n'`. See [this issue](https://github.com/modelscope/ms-swift/issues/4030) for more details.
   - `'react'`, `'hermes'`, `'qwen'`: On top of `'default'`, set the loss weight of the `tool_call` part to 2.
-- sequence_parallel_size: Sequence parallelism size, default is 1. Currently supported in pt/sft/dpo. The training script refers to [here](https://github.com/modelscope/ms-swift/tree/main/examples/train/long_text/sequence_parallel.sh).
+- sequence_parallel_size: Sequence parallelism size, default is 1. Currently supported in CPT/SFT/DPO/GRPO. The training script refers to [here](https://github.com/modelscope/ms-swift/tree/main/examples/train/long_text/sequence_parallel.sh).
 - response_prefix: The prefix character for the response, for example, setting the response_prefix to `'<think>\n'` for QwQ-32B. The default is None, and it is automatically set according to the model.
   - Note: If you are training the deepseek-r1/qwq model with a dataset that does not include `<think>...</think>`, please pass `--response_prefix ''` additionally when inferring after training.
 - template_backend: Selection of the template backend. Options are 'swift' and 'jinja', with 'swift' as the default. If using jinja, it applies transformer's `apply_chat_template`.
diff --git a/examples/train/padding_free/dpo.sh b/examples/train/padding_free/dpo.sh
@@ -0,0 +1,29 @@
+# with padding_free: 4 * 47GiB, 1.90s/it
+# without padding_free: 4 * 57GiB 3.32s/it
+NPROC_PER_NODE=4 \
+CUDA_VISIBLE_DEVICES=0,1,2,3 \
+swift rlhf \
+    --rlhf_type dpo \
+    --model Qwen/Qwen2.5-7B-Instruct \
+    --train_type full \
+    --dataset hjh0119/shareAI-Llama3-DPO-zh-en-emoji \
+    --torch_dtype bfloat16 \
+    --num_train_epochs 1 \
+    --per_device_train_batch_size 4 \
+    --per_device_eval_batch_size 4 \
+    --learning_rate 1e-5 \
+    --gradient_accumulation_steps 1 \
+    --eval_steps 100 \
+    --save_steps 100 \
+    --save_total_limit 2 \
+    --logging_steps 5 \
+    --max_length 8192 \
+    --output_dir output \
+    --warmup_ratio 0.05 \
+    --save_only_model true \
+    --dataloader_num_workers 4 \
+    --dataset_num_proc 4 \
+    --deepspeed zero3 \
+    --attn_impl flash_attn \
+    --save_only_model true \
+    --padding_free true
diff --git a/examples/train/padding_free/dpo_vlm.sh b/examples/train/padding_free/dpo_vlm.sh
@@ -0,0 +1,31 @@
+# with padding_free: 4 * 53GiB, 3.55s/it
+# without padding_free: 4 * 62GiB 4.41s/it
+PYTORCH_CUDA_ALLOC_CONF='expandable_segments:True' \
+CUDA_VISIBLE_DEVICES=0,1,2,3 \
+NPROC_PER_NODE=4 \
+MAX_PIXELS=1003520 \
+swift rlhf \
+    --rlhf_type dpo \
+    --model Qwen/Qwen2.5-VL-7B-Instruct \
+    --dataset 'swift/RLAIF-V-Dataset#20000' \
+    --train_type full \
+    --torch_dtype bfloat16 \
+    --num_train_epochs 1 \
+    --per_device_train_batch_size 4 \
+    --per_device_eval_batch_size 4 \
+    --learning_rate 1e-5 \
+    --freeze_vit true \
+    --gradient_accumulation_steps 1 \
+    --eval_steps 100 \
+    --save_steps 100 \
+    --save_total_limit 2 \
+    --deepspeed zero3 \
+    --logging_steps 5 \
+    --max_length 4096 \
+    --output_dir output \
+    --warmup_ratio 0.05 \
+    --dataloader_num_workers 4 \
+    --dataset_num_proc 4 \
+    --attn_impl flash_attn \
+    --save_only_model true \
+    --padding_free true
diff --git a/examples/train/rlhf/dpo/lora.sh b/examples/train/rlhf/dpo/lora.sh
@@ -1,4 +1,6 @@
 # 24GiB
+# It is recommended to use padding_free. For more details, please refer to:
+# https://github.com/modelscope/ms-swift/blob/main/examples/train/padding_free/dpo.sh
 CUDA_VISIBLE_DEVICES=0 \
 swift rlhf \
     --rlhf_type dpo \
diff --git a/swift/llm/argument/infer_args.py b/swift/llm/argument/infer_args.py
@@ -158,6 +158,7 @@ def _init_ddp(self):
         if not is_dist():
             return
         assert not self.eval_human and not self.stream, (
+            'In DDP scenarios, interactive interfaces and streaming output are not supported.'
             f'args.eval_human: {self.eval_human}, args.stream: {self.stream}')
         self._init_device()
         init_process_group(backend=self.ddp_backend, timeout=self.ddp_timeout)
diff --git a/swift/llm/train/sft.py b/swift/llm/train/sft.py
@@ -81,9 +81,10 @@ def _get_data_collator(self):
         padding_to = args.max_length if args.train_type == 'longlora' else None
         return partial(template.data_collator, padding_to=padding_to)
 
-    @staticmethod
-    def _save_val_dataset(output_dir: str, val_dataset):
-        if is_master() and isinstance(val_dataset, HfDataset):
+    def _save_val_dataset(self, val_dataset):
+        args = self.args
+        output_dir = getattr(args, 'output_dir', None) or getattr(args, 'save')
+        if is_master() and isinstance(val_dataset, HfDataset) and not args.val_dataset:
             os.makedirs(output_dir, exist_ok=True)
             val_dataset_path = os.path.join(output_dir, 'val_dataset.jsonl')
             append_to_jsonl(val_dataset_path, val_dataset.to_list())
@@ -216,8 +217,7 @@ def _stat_dataset(self, dataset: Union[HfDataset, PackingDataset]):
     def _encode_dataset(self, train_dataset, val_dataset):
         template = self.template
         args = self.args
-        output_dir = getattr(args, 'output_dir', None) or getattr(args, 'save')
-        self._save_val_dataset(output_dir, val_dataset)
+        self._save_val_dataset(val_dataset)
         is_grpo = hasattr(args, 'rlhf_type') and args.rlhf_type == 'grpo'
         predict_with_generate = getattr(args, 'predict_with_generate', False)
         if not is_grpo:
diff --git a/swift/trainers/arguments.py b/swift/trainers/arguments.py
@@ -87,9 +87,26 @@ def _new_checkpoint(*args, use_reentrant=None, **kwargs):
         except (ImportError, AttributeError):
             pass
 
+    @staticmethod
+    def _patch_liger_kernel():
+        # fix logits_to_keep
+        from liger_kernel.transformers.model import loss_utils
+        origin_LigerForCausalLMLoss = loss_utils.LigerForCausalLMLoss
+
+        def LigerForCausalLMLoss(hidden_states, *args, **kwargs):
+            hidden_states = hidden_states.contiguous()
+            return origin_LigerForCausalLMLoss(hidden_states, *args, **kwargs)
+
+        loss_utils.LigerForCausalLMLoss = LigerForCausalLMLoss
+        logger.info('Patch liger_kernel successfully.')
+
     def _init_liger(self):
         if self.use_liger_kernel:
             assert is_liger_available(), 'use_liger_kernel requires liger_kernels, try `pip install liger-kernel`'
+            try:
+                self._patch_liger_kernel()
+            except Exception:
+                pass
 
     def __post_init__(self):
         if is_mp() and self.use_liger_kernel:
diff --git a/swift/trainers/mixin.py b/swift/trainers/mixin.py
@@ -34,7 +34,7 @@
 from swift.llm import BatchSamplerShard, DataLoaderDispatcher, DataLoaderShard, Template
 from swift.plugin import MeanMetric, compute_acc, extra_tuners
 from swift.tuners import SwiftModel
-from swift.utils import get_logger, is_mp_ddp, ms_logger_context, seed_worker, use_torchacc
+from swift.utils import get_logger, is_mp, is_mp_ddp, ms_logger_context, seed_worker, use_torchacc
 from swift.utils.torchacc_utils import ta_trim_graph
 from ..utils.torch_utils import get_device_count
 from .arguments import TrainingArguments
@@ -484,6 +484,36 @@ def _evalscope_eval(self):
         self.model.train()
         return eval_dict
 
+    def get_logits_to_keep(self, labels):
+        if labels.shape[0] == 1 and not is_mp():
+            # device_map may encounter device mismatch issues.
+            loss_mask = (labels != -100)[0]
+            labels = labels[:, loss_mask]
+            labels = nn.functional.pad(labels, (1, 0), value=-100)
+            logits_to_keep = nn.functional.pad(loss_mask[1:], (0, 1), value=True)
+        else:
+            logits_to_keep = labels.shape[-1] - ((labels != -100).int().argmax(-1).min().item()) + 1
+            assert logits_to_keep > 0
+            labels = labels[:, -logits_to_keep:]
+        return labels, logits_to_keep
+
+    def get_cu_seqlens(self, position_ids, logits_to_keep) -> torch.Tensor:
+        assert position_ids.shape[0] == 1
+        position_ids = position_ids[0]
+        indices = torch.arange(position_ids.shape[0], device=position_ids.device)
+        cu_seqlens = torch.concat([
+            indices[position_ids == 0],
+            torch.tensor(position_ids.shape, device=position_ids.device),
+        ])
+        res_cu_seqlens = cu_seqlens.clone()
+        if isinstance(logits_to_keep, torch.Tensor):
+            for i in range(cu_seqlens.shape[0] - 1):
+                start, end = cu_seqlens[i], cu_seqlens[i + 1]
+                res_cu_seqlens[i + 1:] -= (~logits_to_keep[start:end]).sum()
+        elif isinstance(logits_to_keep, int):
+            res_cu_seqlens[1:] -= position_ids.shape[0] + 1 - logits_to_keep
+        return res_cu_seqlens
+
     def get_batch_samples(self, *args, **kwargs):
         res = super().get_batch_samples(*args, **kwargs)
         from swift.trainers.sequence_parallel import sequence_parallel
diff --git a/swift/trainers/rlhf_trainer/dpo_trainer.py b/swift/trainers/rlhf_trainer/dpo_trainer.py
diff --git a/swift/trainers/rlhf_trainer/rlhf_mixin.py b/swift/trainers/rlhf_trainer/rlhf_mixin.py
diff --git a/swift/trainers/sequence_parallel/ulysses.py b/swift/trainers/sequence_parallel/ulysses.py
diff --git a/swift/trainers/trainers.py b/swift/trainers/trainers.py