Merge branch 'main' into release/3.5

Jintao-Huang · Jintao-Huang · commit cb64cb724404 · 2025-06-08T01:38:15.000+08:00
diff --git a/docs/source/Instruction/命令行参数.md b/docs/source/Instruction/命令行参数.md
@@ -359,6 +359,7 @@ Vera使用`target_modules`, `target_regex`, `modules_to_save`三个参数.
 - packing_cache: 指定 packing 缓存目录。默认值为`None`，表示缓存将存储在环境变量 `$MODELSCOPE_CACHE`所指定的路径下。在跨节点使用 packing 功能时，需确保所有节点的 packing 缓存路径共享且一致。你可以通过设置`MODELSCOPE_CACHE`环境变量，或在命令行中添加 `--packing_cache <shared_path>`参数来实现这一要求。
 - 🔥lazy_tokenize: 是否使用lazy_tokenize。若该参数设置为False，则在训练之前对所有的数据集样本进行tokenize（多模态模型则包括从磁盘中读取图片）。该参数在LLM训练中默认设置为False，而MLLM训练默认为True，节约内存。
 - use_logits_to_keep: 通过在`forward`中根据labels传入logits_to_keep，减少无效logits的计算与存储，从而减少显存占用并加快训练速度。默认为None，进行自动选择。
+  - 注意：为了稳定性，多模态模型该值默认为False，需要手动设置。
 - acc_strategy: 训练和验证时计算acc的策略。可选为`seq`和`token`级别的acc，默认为`token`。
 - max_new_tokens: 覆盖生成参数。predict_with_generate=True时的最大生成token数量，默认64。
 - temperature: 覆盖生成参数。predict_with_generate=True时的temperature，默认0。
diff --git a/docs/source/Instruction/支持的模型和数据集.md b/docs/source/Instruction/支持的模型和数据集.md
@@ -438,6 +438,8 @@
 |[OpenBMB/MiniCPM-2B-dpo-fp32](https://modelscope.cn/models/OpenBMB/MiniCPM-2B-dpo-fp32)|minicpm|minicpm|transformers>=4.36.0|&#x2718;|-|[openbmb/MiniCPM-2B-dpo-fp32](https://huggingface.co/openbmb/MiniCPM-2B-dpo-fp32)|
 |[OpenBMB/MiniCPM-1B-sft-bf16](https://modelscope.cn/models/OpenBMB/MiniCPM-1B-sft-bf16)|minicpm|minicpm|transformers>=4.36.0|&#x2718;|-|[openbmb/MiniCPM-1B-sft-bf16](https://huggingface.co/openbmb/MiniCPM-1B-sft-bf16)|
 |[OpenBMB/MiniCPM-2B-128k](https://modelscope.cn/models/OpenBMB/MiniCPM-2B-128k)|minicpm_chatml|chatml|transformers>=4.36|&#x2718;|-|[openbmb/MiniCPM-2B-128k](https://huggingface.co/openbmb/MiniCPM-2B-128k)|
+|[OpenBMB/MiniCPM4-0.5B](https://modelscope.cn/models/OpenBMB/MiniCPM4-0.5B)|minicpm_chatml|chatml|transformers>=4.36|&#x2718;|-|[openbmb/MiniCPM4-0.5B](https://huggingface.co/openbmb/MiniCPM4-0.5B)|
+|[OpenBMB/MiniCPM4-8B](https://modelscope.cn/models/OpenBMB/MiniCPM4-8B)|minicpm_chatml|chatml|transformers>=4.36|&#x2718;|-|[openbmb/MiniCPM4-8B](https://huggingface.co/openbmb/MiniCPM4-8B)|
 |[OpenBMB/MiniCPM3-4B](https://modelscope.cn/models/OpenBMB/MiniCPM3-4B)|minicpm3|chatml|transformers>=4.36|&#x2718;|-|[openbmb/MiniCPM3-4B](https://huggingface.co/openbmb/MiniCPM3-4B)|
 |[OpenBMB/MiniCPM-MoE-8x2B](https://modelscope.cn/models/OpenBMB/MiniCPM-MoE-8x2B)|minicpm_moe|minicpm|transformers>=4.36|&#x2718;|-|[openbmb/MiniCPM-MoE-8x2B](https://huggingface.co/openbmb/MiniCPM-MoE-8x2B)|
 |[TeleAI/TeleChat-7B](https://modelscope.cn/models/TeleAI/TeleChat-7B)|telechat|telechat|-|&#x2718;|-|[Tele-AI/telechat-7B](https://huggingface.co/Tele-AI/telechat-7B)|
diff --git a/docs/source_en/Instruction/Command-line-parameters.md b/docs/source_en/Instruction/Command-line-parameters.md
@@ -368,6 +368,7 @@ Training arguments include the [base arguments](#base-arguments), [Seq2SeqTraine
 - packing_cache: Specifies the directory for packing cache. The default value is `None`, which means the cache will be stored in the path defined by the environment variable `$MODELSCOPE_CACHE`. When using the packing feature across multiple nodes, ensure that all nodes share the same packing cache directory. You can achieve this by setting the `MODELSCOPE_CACHE` environment variable or by adding the `--packing_cache <shared_path>` argument in the command line.
 - 🔥lazy_tokenize: Whether to use lazy tokenization. If set to False, all dataset samples are tokenized before training (for multimodal models, this includes reading images from disk). This parameter defaults to False for LLM training, and True for MLLM training, to save memory.
 - use_logits_to_keep: Pass `logits_to_keep` in the `forward` method based on labels to reduce the computation and storage of unnecessary logits, thereby reducing memory usage and accelerating training. The default is `None`, which enables automatic selection.
+  - Note: For stability, this value is set to False by default for multimodal models and needs to be manually enabled.
 - acc_strategy: Strategy for calculating accuracy during training and validation. Options are `seq`-level and `token`-level accuracy, with `token` as the default.
 - max_new_tokens: Generation parameter override. The maximum number of tokens to generate when `predict_with_generate=True`, defaulting to 64.
 - temperature: Generation parameter override. The temperature setting when `predict_with_generate=True`, defaulting to 0.
diff --git a/docs/source_en/Instruction/Supported-models-and-datasets.md b/docs/source_en/Instruction/Supported-models-and-datasets.md
@@ -438,6 +438,8 @@ The table below introduces the models integrated with ms-swift:
 |[OpenBMB/MiniCPM-2B-dpo-fp32](https://modelscope.cn/models/OpenBMB/MiniCPM-2B-dpo-fp32)|minicpm|minicpm|transformers>=4.36.0|&#x2718;|-|[openbmb/MiniCPM-2B-dpo-fp32](https://huggingface.co/openbmb/MiniCPM-2B-dpo-fp32)|
 |[OpenBMB/MiniCPM-1B-sft-bf16](https://modelscope.cn/models/OpenBMB/MiniCPM-1B-sft-bf16)|minicpm|minicpm|transformers>=4.36.0|&#x2718;|-|[openbmb/MiniCPM-1B-sft-bf16](https://huggingface.co/openbmb/MiniCPM-1B-sft-bf16)|
 |[OpenBMB/MiniCPM-2B-128k](https://modelscope.cn/models/OpenBMB/MiniCPM-2B-128k)|minicpm_chatml|chatml|transformers>=4.36|&#x2718;|-|[openbmb/MiniCPM-2B-128k](https://huggingface.co/openbmb/MiniCPM-2B-128k)|
+|[OpenBMB/MiniCPM4-0.5B](https://modelscope.cn/models/OpenBMB/MiniCPM4-0.5B)|minicpm_chatml|chatml|transformers>=4.36|&#x2718;|-|[openbmb/MiniCPM4-0.5B](https://huggingface.co/openbmb/MiniCPM4-0.5B)|
+|[OpenBMB/MiniCPM4-8B](https://modelscope.cn/models/OpenBMB/MiniCPM4-8B)|minicpm_chatml|chatml|transformers>=4.36|&#x2718;|-|[openbmb/MiniCPM4-8B](https://huggingface.co/openbmb/MiniCPM4-8B)|
 |[OpenBMB/MiniCPM3-4B](https://modelscope.cn/models/OpenBMB/MiniCPM3-4B)|minicpm3|chatml|transformers>=4.36|&#x2718;|-|[openbmb/MiniCPM3-4B](https://huggingface.co/openbmb/MiniCPM3-4B)|
 |[OpenBMB/MiniCPM-MoE-8x2B](https://modelscope.cn/models/OpenBMB/MiniCPM-MoE-8x2B)|minicpm_moe|minicpm|transformers>=4.36|&#x2718;|-|[openbmb/MiniCPM-MoE-8x2B](https://huggingface.co/openbmb/MiniCPM-MoE-8x2B)|
 |[TeleAI/TeleChat-7B](https://modelscope.cn/models/TeleAI/TeleChat-7B)|telechat|telechat|-|&#x2718;|-|[Tele-AI/telechat-7B](https://huggingface.co/Tele-AI/telechat-7B)|
diff --git a/swift/llm/dataset/dataset/llm.py b/swift/llm/dataset/dataset/llm.py
@@ -325,13 +325,20 @@ def preprocess(self, row: Dict[str, Any]) -> Dict[str, Any]:
 
 class StsbPreprocessor(ResponsePreprocessor):
 
+    def __init__(self, sim_threshold: Optional[float] = None):
+        self.sim_threshold = sim_threshold
+        super().__init__()
+
     def preprocess(self, row: Dict[str, Any]) -> Dict[str, Any]:
         row = {
             'query': row['sentence1'],
             'response': row['sentence2'],
             'label': row['score'],
         }
-        return super().preprocess(row)
+        if self.sim_threshold is None or float(row['label']) >= self.sim_threshold:
+            return super().preprocess(row)
+        else:
+            return None
 
 
 class StsbGeneratePreprocessor(ResponsePreprocessor):
@@ -364,6 +371,7 @@ def preprocess(self, row: Dict[str, Any]) -> Optional[Dict[str, Any]]:
         hf_dataset_id='sentence-transformers/stsb',
         subsets=[
             SubsetDataset('default', preprocess_func=StsbPreprocessor()),  # embedding
+            SubsetDataset('positive', preprocess_func=StsbPreprocessor(sim_threshold=0.75)),  # infonce
             SubsetDataset('generate', preprocess_func=StsbGeneratePreprocessor()),
             SubsetDataset('reg', preprocess_func=StsbRegressionPreprocessor()),
         ],
@@ -676,11 +684,22 @@ def repair_conversations(s: Union[str, Any]) -> Any:
         preprocess_func=MessagesPreprocessor(repair_messages=repair_conversations),
         tags=['chat', 'em']))
 
+
+class EmojiPreprocessr(ResponsePreprocessor):
+
+    def preprocess(self, row: Dict[str, Any]) -> Dict[str, Any]:
+        # Remove dirty characters
+        row['query'] = row['query'].replace('️', '')
+        row['response'] = row['response'].replace('️', '')
+        row['rejected_response'] = row['rejected_response'].replace('️', '')
+        return super().preprocess(row)
+
+
 register_dataset(
     DatasetMeta(
         ms_dataset_id='hjh0119/shareAI-Llama3-DPO-zh-en-emoji',
         hf_dataset_id='shareAI/DPO-zh-en-emoji',
-        preprocess_func=ResponsePreprocessor(columns={
+        preprocess_func=EmojiPreprocessr(columns={
             'answer_zh': 'response',
             'answer_en': 'rejected_response'
         }),
diff --git a/swift/llm/model/model/minicpm.py b/swift/llm/model/model/minicpm.py
@@ -183,6 +183,10 @@ def get_model_tokenizer_minicpmv_2_x(model_dir: str,
             ModelGroup([
                 Model('OpenBMB/MiniCPM-2B-128k', 'openbmb/MiniCPM-2B-128k'),
             ]),
+            ModelGroup([
+                Model('OpenBMB/MiniCPM4-0.5B', 'openbmb/MiniCPM4-0.5B'),
+                Model('OpenBMB/MiniCPM4-8B', 'openbmb/MiniCPM4-8B'),
+            ]),
         ],
         TemplateType.chatml,
         get_model_tokenizer_with_flash_attn,
diff --git a/swift/llm/model/register.py b/swift/llm/model/register.py
@@ -255,6 +255,21 @@ def get_model_tokenizer_from_local(model_dir: str,
             InitModelStrategy.init_parameters(model, init_strategy)
 
     model_info.config = model_config if model is None else model.config
+
+    pad_token = tokenizer.pad_token_id
+    if pad_token is None:
+        pad_token = tokenizer.eos_token_id
+    if tokenizer.eos_token_id is None:
+        tokenizer.eos_token_id = pad_token
+    if tokenizer.pad_token_id is None:
+        tokenizer.pad_token_id = pad_token
+    assert tokenizer.eos_token_id is not None
+    assert tokenizer.pad_token_id is not None
+
+    if model is not None:
+        # fix seq classification task
+        HfConfigFactory.set_model_config_attr(model, 'pad_token_id', pad_token)
+
     return model, tokenizer
 
 
@@ -583,20 +598,7 @@ def get_model_tokenizer(
     tokenizer.model_info = model_info
     tokenizer.model_meta = model_meta
 
-    pad_token = tokenizer.pad_token_id
-    if pad_token is None:
-        pad_token = tokenizer.eos_token_id
-    if tokenizer.eos_token_id is None:
-        tokenizer.eos_token_id = pad_token
-    if tokenizer.pad_token_id is None:
-        tokenizer.pad_token_id = pad_token
-    assert tokenizer.eos_token_id is not None
-    assert tokenizer.pad_token_id is not None
-
     if model is not None:
-        # fix seq classification task
-        HfConfigFactory.set_model_config_attr(model, 'pad_token_id', pad_token)
-
         model.model_info = model_info
         model.model_meta = model_meta
         model.model_dir = model_dir
diff --git a/swift/llm/template/template/emu3.py b/swift/llm/template/template/emu3.py
@@ -27,8 +27,10 @@ class Emu3GenTemplate(Template):
         'lowres, bad anatomy, bad hands, text, error, missing fingers, extra digit, fewer digits, cropped, '
         'worst quality, low quality, normal quality, jpeg artifacts, signature, watermark, username, blurry.')
 
-    def __init__(self, *args, **kwargs):
-        super().__init__(*args, **kwargs)
+    def init_processor(self, processor) -> None:
+        if processor is None:
+            return
+        super().init_processor(processor)
         self.bov = self.processor.tokenizer.encode(self.processor.visual_template[0].format(token_id=0))[0]
         self.eov = self.processor.tokenizer.encode(self.processor.visual_template[0].format(token_id=self.COOKBOOK_SIZE
                                                                                             - 1))[0]
diff --git a/swift/llm/template/template/qwen.py b/swift/llm/template/template/qwen.py
@@ -408,8 +408,10 @@ class Qwen2_5OmniTemplate(Qwen2_5VLTemplate):
     version = 'omni'
     placeholder_tokens = ['<|IMAGE|>', '<|AUDIO|>', '<|VIDEO|>']
 
-    def __init__(self, *args, **kwargs):
-        super().__init__(*args, **kwargs)
+    def init_processor(self, processor) -> None:
+        if processor is None:
+            return
+        super().init_processor(processor)
         from transformers.models.qwen2_5_omni.processing_qwen2_5_omni import Qwen2_5OmniProcessorKwargs
         default = Qwen2_5OmniProcessorKwargs._defaults
         self.seconds_per_chunk = default['videos_kwargs']['seconds_per_chunk']
diff --git a/swift/megatron/train/utils.py b/swift/megatron/train/utils.py
@@ -205,7 +205,7 @@ def get_batch(data_iterator):
 
     # TODO: this is pretty hacky, find a better way
     if (not mpu.is_pipeline_first_stage()) and (not mpu.is_pipeline_last_stage()):
-        return None, None, None, None, None
+        return {key: None for key in ['input_ids', 'attention_mask', 'position_ids']}
 
     # get batches based on the TP rank you are on
     batch = get_batch_on_this_tp_rank(data_iterator)
diff --git a/swift/trainers/mixin.py b/swift/trainers/mixin.py
@@ -110,6 +110,16 @@ def __init__(self,
             from swift.trainers.sequence_parallel import sequence_parallel
             sequence_parallel.prepare_trainer(self)
 
+    def get_use_logits_to_keep(self, default_value: bool):
+        use_logits_to_keep = self.args.use_logits_to_keep
+        if use_logits_to_keep is None:
+            base_model = self.template.get_base_model(self.model)
+            use_logits_to_keep = (not self.model.model_meta.is_multimodal
+                                  and 'logits_to_keep' in inspect.signature(base_model.forward).parameters
+                                  and default_value)
+        logger.info_once(f'use_logits_to_keep: {use_logits_to_keep}')
+        return use_logits_to_keep
+
     def _save_initial_model(self, output_dir):
         # pissa/olora/lora-ga
         model = unwrap_model(self.model)
diff --git a/swift/trainers/rlhf_trainer/dpo_trainer.py b/swift/trainers/rlhf_trainer/dpo_trainer.py
@@ -54,16 +54,11 @@ def concatenated_forward(
         batch = batch.copy()
         labels = batch.pop('labels', None)
 
-        base_model = self.template.get_base_model(self.model)
-        use_logits_to_keep = self.args.use_logits_to_keep
-        if use_logits_to_keep is None:
-            # padding_free or packing
-            use_logits_to_keep = 'logits_to_keep' in inspect.signature(
-                base_model.forward).parameters and self.template.sequence_parallel_size == 1
-        logger.info_once(f'use_logits_to_keep: {use_logits_to_keep}')
-
+        use_logits_to_keep = self.get_use_logits_to_keep(self.template.sequence_parallel_size == 1)
         if use_logits_to_keep:
-            labels, batch['logits_to_keep'] = self.get_logits_to_keep(labels)
+            labels, logits_to_keep = self.get_logits_to_keep(labels)
+            if logits_to_keep is not None:
+                batch['logits_to_keep'] = logits_to_keep
         if self.aux_loss_enabled:
             batch['output_router_logits'] = True
         if self.is_encoder_decoder:
diff --git a/swift/trainers/trainers.py b/swift/trainers/trainers.py
@@ -177,17 +177,11 @@ def compute_loss(self, model, inputs, return_outputs=False, num_items_in_batch=N
         if (self.label_smoother is not None or compute_loss_func is not None) and 'labels' in inputs:
             labels = inputs.pop('labels')
 
-        base_model = self.template.get_base_model(self.model)
-        use_logits_to_keep = self.args.use_logits_to_keep
-        if use_logits_to_keep is None:
-            # padding_free or packing
-            use_logits_to_keep = 'labels' in inputs and 'logits_to_keep' in inspect.signature(
-                base_model.forward).parameters
-        logger.info_once(f'use_logits_to_keep: {use_logits_to_keep}')
-
+        use_logits_to_keep = self.get_use_logits_to_keep('labels' in inputs)
         if use_logits_to_keep:
-            inputs['labels'], inputs['logits_to_keep'] = self.get_logits_to_keep(inputs['labels'])
-
+            inputs['labels'], logits_to_keep = self.get_logits_to_keep(inputs['labels'])
+            if logits_to_keep is not None:
+                inputs['logits_to_keep'] = logits_to_keep
         with self.template.compute_loss_context(self.model, inputs):
             outputs = model(**inputs)
         # Save past state if it exists