[megatron] Support megatron glm4.5v (#5661)

Jintao-Huang · web-flow · commit c8c6881dc64f · 2025-09-06T20:17:35.000+08:00
diff --git a/docs/source/Instruction/命令行参数.md b/docs/source/Instruction/命令行参数.md
@@ -148,6 +148,7 @@
   - 注意：多模态模型且是LoRA训练时，当设置了`--freeze_vit false`，且命令行中出现以下警告：`UserWarning: None of the inputs have requires_grad=True. Gradients will be None`，请设置`--vit_gradient_checkpointing false`，或提相关issue。全参数训练则不会出现该问题。
 - 🔥deepspeed: 默认为None。可以设置为'zero0', 'zero1', 'zero2', 'zero3', 'zero2_offload', 'zero3_offload'来使用ms-swift内置的deepspeed配置文件。你也可以传入自定义deepspeed配置文件的路径。
 - zero_hpz_partition_size: 默认为None，这个参数是ZeRO++的特性，即node内模型分片，node间数据分片，如果遇到grad_norm NaN，请尝试使用`--torch_dtype float16`。
+- deepspeed_autotp_size: DeepSpeed张量并行大小，默认为1。使用DeepSpeed AutoTP时需将参数`--deepspeed`设置为'zero0'、'zero1'或'zero2'。（注意：该功能只支持全参数）
 - 🔥per_device_train_batch_size: 默认值1。
 - 🔥per_device_eval_batch_size: 默认值1。
 - 🔥gradient_accumulation_steps: 梯度累加，默认为None，即设置gradient_accumulation_steps使得total_batch_size>=16。total_batch_size等于`per_device_train_batch_size * gradient_accumulation_steps * world_size`, 在GRPO训练中，默认为1。
diff --git a/docs/source/Instruction/支持的模型和数据集.md b/docs/source/Instruction/支持的模型和数据集.md
@@ -709,7 +709,7 @@
 |[ZhipuAI/cogagent-9b-20241220](https://modelscope.cn/models/ZhipuAI/cogagent-9b-20241220)|glm4v|glm4v|transformers>=4.42|&#x2718;|-|[zai-org/cogagent-9b-20241220](https://huggingface.co/zai-org/cogagent-9b-20241220)|
 |[ZhipuAI/GLM-4.1V-9B-Base](https://modelscope.cn/models/ZhipuAI/GLM-4.1V-9B-Base)|glm4_1v|glm4_1v|transformers>=4.53|&#x2718;|-|[zai-org/GLM-4.1V-9B-Base](https://huggingface.co/zai-org/GLM-4.1V-9B-Base)|
 |[ZhipuAI/GLM-4.1V-9B-Thinking](https://modelscope.cn/models/ZhipuAI/GLM-4.1V-9B-Thinking)|glm4_1v|glm4_1v|transformers>=4.53|&#x2718;|-|[zai-org/GLM-4.1V-9B-Thinking](https://huggingface.co/zai-org/GLM-4.1V-9B-Thinking)|
-|[ZhipuAI/GLM-4.5V](https://modelscope.cn/models/ZhipuAI/GLM-4.5V)|glm4_5v|glm4_5v|transformers>=4.56.0.dev|&#x2718;|-|[zai-org/GLM-4.5V](https://huggingface.co/zai-org/GLM-4.5V)|
+|[ZhipuAI/GLM-4.5V](https://modelscope.cn/models/ZhipuAI/GLM-4.5V)|glm4_5v|glm4_5v|transformers>=4.56.0.dev|&#x2714;|-|[zai-org/GLM-4.5V](https://huggingface.co/zai-org/GLM-4.5V)|
 |[ZhipuAI/GLM-4.5V-FP8](https://modelscope.cn/models/ZhipuAI/GLM-4.5V-FP8)|glm4_5v|glm4_5v|transformers>=4.56.0.dev|&#x2718;|-|[zai-org/GLM-4.5V-FP8](https://huggingface.co/zai-org/GLM-4.5V-FP8)|
 |[ZhipuAI/glm-edge-v-2b](https://modelscope.cn/models/ZhipuAI/glm-edge-v-2b)|glm_edge_v|glm_edge_v|transformers>=4.46|&#x2718;|vision|[zai-org/glm-edge-v-2b](https://huggingface.co/zai-org/glm-edge-v-2b)|
 |[ZhipuAI/glm-edge-4b-chat](https://modelscope.cn/models/ZhipuAI/glm-edge-4b-chat)|glm_edge_v|glm_edge_v|transformers>=4.46|&#x2718;|vision|[zai-org/glm-edge-4b-chat](https://huggingface.co/zai-org/glm-edge-4b-chat)|
diff --git a/docs/source/Megatron-SWIFT/多模态模型.md b/docs/source/Megatron-SWIFT/多模态模型.md
@@ -1,6 +1,6 @@
 # 多模态模型
 
-ms-swift引入了Megatron的并行技术来加速多模态大模型的训练。目前支持Qwen2.5-VL, Qwen2.5-Omni等模型的CPT/SFT/DPO。完整支持的模型可以参考[支持的模型与数据集文档](../Instruction/支持的模型和数据集.md)。
+ms-swift引入了Megatron的并行技术来加速多模态大模型的训练。目前支持Qwen2.5-VL, Qwen2.5-Omni, InternVL3.5, GLM4.5v等模型的CPT/SFT/DPO。完整支持的模型可以参考[支持的模型与数据集文档](../Instruction/支持的模型和数据集.md)。
 
 环境准备请参考Megatron-SWIFT的[快速开始文档](./快速开始.md)。
 
diff --git a/docs/source_en/Instruction/Command-line-parameters.md b/docs/source_en/Instruction/Command-line-parameters.md
@@ -150,7 +150,8 @@ This parameter list inherits from transformers `Seq2SeqTrainingArguments`, with
 - 🔥vit_gradient_checkpointing: Whether to enable gradient_checkpointing for the vit part during multi-modal model training. Defaults to None, meaning it is set to `gradient_checkpointing`. For an example, please refer to [here](https://github.com/modelscope/ms-swift/blob/main/examples/train/multimodal/vit_gradient_checkpointing.sh).
   - Note: For multimodal models using LoRA training, when `--freeze_vit false` is set and the following warning appears in the command line: `UserWarning: None of the inputs have requires_grad=True. Gradients will be None`, please set `--vit_gradient_checkpointing false`, or raise a related issue. This problem does not occur during full-parameter training.
 - 🔥deepspeed: Defaults to None. It can be set to 'zero0', 'zero1', 'zero2', 'zero3', 'zero2_offload', 'zero3_offload' to use the built-in deepspeed configuration file of ms-swift. You can also provide a path to a custom DeepSpeed configuration file.
-- zero_hpz_partition_size: Default is `None`. This parameter is a feature of `ZeRO++`, which implements model sharding within nodes and data sharding between nodes. If you encounter grad_norm `NaN` issues, please try using `--torch_dtype float16`
+- zero_hpz_partition_size: Default is `None`. This parameter is a feature of `ZeRO++`, which implements model sharding within nodes and data sharding between nodes. If you encounter grad_norm `NaN` issues, please try using `--torch_dtype float16`.
+- deepspeed_autotp_size: DeepSpeed tensor parallelism size, default is 1. When using DeepSpeed AutoTP, the argument `--deepspeed` must be set to 'zero0', 'zero1', or 'zero2'. (Note: This feature only supports full-parameter training.)
 - 🔥per_device_train_batch_size: Default is 1.
 - 🔥per_device_eval_batch_size: Default is 1.
 - 🔥gradient_accumulation_steps: Gradient accumulation, default is None, meaning set gradient_accumulation_steps such that total_batch_size >= 16. The total_batch_size equals `per_device_train_batch_size * gradient_accumulation_steps * world_size`. In GRPO Training, the default is 1.
diff --git a/docs/source_en/Instruction/Supported-models-and-datasets.md b/docs/source_en/Instruction/Supported-models-and-datasets.md
@@ -709,7 +709,7 @@ The table below introduces the models integrated with ms-swift:
 |[ZhipuAI/cogagent-9b-20241220](https://modelscope.cn/models/ZhipuAI/cogagent-9b-20241220)|glm4v|glm4v|transformers>=4.42|&#x2718;|-|[zai-org/cogagent-9b-20241220](https://huggingface.co/zai-org/cogagent-9b-20241220)|
 |[ZhipuAI/GLM-4.1V-9B-Base](https://modelscope.cn/models/ZhipuAI/GLM-4.1V-9B-Base)|glm4_1v|glm4_1v|transformers>=4.53|&#x2718;|-|[zai-org/GLM-4.1V-9B-Base](https://huggingface.co/zai-org/GLM-4.1V-9B-Base)|
 |[ZhipuAI/GLM-4.1V-9B-Thinking](https://modelscope.cn/models/ZhipuAI/GLM-4.1V-9B-Thinking)|glm4_1v|glm4_1v|transformers>=4.53|&#x2718;|-|[zai-org/GLM-4.1V-9B-Thinking](https://huggingface.co/zai-org/GLM-4.1V-9B-Thinking)|
-|[ZhipuAI/GLM-4.5V](https://modelscope.cn/models/ZhipuAI/GLM-4.5V)|glm4_5v|glm4_5v|transformers>=4.56.0.dev|&#x2718;|-|[zai-org/GLM-4.5V](https://huggingface.co/zai-org/GLM-4.5V)|
+|[ZhipuAI/GLM-4.5V](https://modelscope.cn/models/ZhipuAI/GLM-4.5V)|glm4_5v|glm4_5v|transformers>=4.56.0.dev|&#x2714;|-|[zai-org/GLM-4.5V](https://huggingface.co/zai-org/GLM-4.5V)|
 |[ZhipuAI/GLM-4.5V-FP8](https://modelscope.cn/models/ZhipuAI/GLM-4.5V-FP8)|glm4_5v|glm4_5v|transformers>=4.56.0.dev|&#x2718;|-|[zai-org/GLM-4.5V-FP8](https://huggingface.co/zai-org/GLM-4.5V-FP8)|
 |[ZhipuAI/glm-edge-v-2b](https://modelscope.cn/models/ZhipuAI/glm-edge-v-2b)|glm_edge_v|glm_edge_v|transformers>=4.46|&#x2718;|vision|[zai-org/glm-edge-v-2b](https://huggingface.co/zai-org/glm-edge-v-2b)|
 |[ZhipuAI/glm-edge-4b-chat](https://modelscope.cn/models/ZhipuAI/glm-edge-4b-chat)|glm_edge_v|glm_edge_v|transformers>=4.46|&#x2718;|vision|[zai-org/glm-edge-4b-chat](https://huggingface.co/zai-org/glm-edge-4b-chat)|
diff --git a/docs/source_en/Megatron-SWIFT/Multimodal-Model.md b/docs/source_en/Megatron-SWIFT/Multimodal-Model.md
@@ -1,6 +1,6 @@
 # Multimodal Models
 
-ms-swift introduces Megatron's parallelization techniques to accelerate the training of large multimodal models. Currently, it supports CPT/SFT/DPO for models such as Qwen2.5-VL, Qwen2.5-Omni. For a complete list of supported models, please refer to the [Supported Models and Datasets documentation](../Instruction/Supported-models-and-datasets.md).
+ms-swift introduces Megatron's parallelization techniques to accelerate the training of large multimodal models. Currently, it supports CPT/SFT/DPO for models such as Qwen2.5-VL, Qwen2.5-Omni, InternVL3.5, GLM4.5v. For a complete list of supported models, please refer to the [Supported Models and Datasets documentation](../Instruction/Supported-models-and-datasets.md).
 
 For environment setup, please refer to the Megatron-SWIFT [Quick Start guide](./Quick-start.md).
 
diff --git a/swift/llm/template/template/glm.py b/swift/llm/template/template/glm.py
@@ -262,7 +262,7 @@ def _jinja_encode(self, inputs: StdTemplateInputs):
 register_template(GLM4_1VTemplateMeta(MLLMTemplateType.glm4_1v, template_cls=GLM4_1VTemplate))
 
 
-class GLM4_5VTemplate(Template):
+class GLM4_5VTemplate(GLM4_5Template):
     placeholder_tokens = ['<|image|>']
     support_padding_free = True  # https://github.com/huggingface/transformers/issues/39685
     use_model = True
diff --git a/swift/megatron/model/constant.py b/swift/megatron/model/constant.py
@@ -7,3 +7,4 @@ class MegatronModelType:
     qwen2_5_omni = 'qwen2_5_omni'
 
     internvl3 = 'internvl3'
+    glm4_5v = 'glm4_5v'
diff --git a/swift/megatron/model/gpt/config.py b/swift/megatron/model/gpt/config.py
@@ -42,10 +42,10 @@ def convert_gpt_hf_config(config) -> Dict[str, Any]:
         n_shared_experts = res.pop('n_shared_experts')
     elif llm_architectures in {'Ernie4_5_ForCausalLM', 'Ernie4_5_MoeForCausalLM'}:
         res['rotary_interleaved'] = True
-    elif llm_architectures == 'Glm4MoeForCausalLM':
+    elif llm_architectures == 'Glm4MoeForCausalLM' or architectures == 'Glm4vMoeForConditionalGeneration':
         res['moe_router_score_function'] = 'sigmoid'
 
-    if architectures in {'Qwen2VLForConditionalGeneration', 'Qwen2_5_VLForConditionalGeneration', 'Qwen2_5OmniModel'}:
+    if (res.get('rope_scaling') or {}).get('mrope_section') is not None:
         res['position_embedding_type'] = 'mrope'
         res['mrope_section'] = res['rope_scaling']['mrope_section']
 
diff --git a/swift/megatron/model/mm_gpt/__init__.py b/swift/megatron/model/mm_gpt/__init__.py
@@ -1 +1 @@
-from . import internvl, qwen
+from . import glm, internvl, qwen
diff --git a/swift/megatron/model/mm_gpt/glm.py b/swift/megatron/model/mm_gpt/glm.py
@@ -0,0 +1,57 @@
+from megatron.training import get_args
+
+from swift.llm import ModelType, Template
+from ..constant import MegatronModelType
+from ..gpt.hf2mcore import set_layer_state as set_layer_state_hf2mcore
+from ..gpt.mcore2hf import set_layer_state as set_layer_state_mcore2hf
+from ..register import register_megatron_model
+from .utils import HuggingFaceModule, MMGPTMegatronModelMeta
+
+
+def convert_hf2mcore_glm4_5v(hf_model, mg_model):
+    language_model = hf_model.model.language_model
+    mg_language_model = mg_model.language_model
+    args = get_args()
+    mg_language_model.embedding.word_embeddings.weight.data.copy_(language_model.embed_tokens.weight)
+    if args.untie_embeddings_and_output_weights:
+        mg_language_model.output_layer.weight.data.copy_(hf_model.lm_head.weight)
+    mg_language_model.decoder.final_layernorm.weight.data.copy_(language_model.norm.weight)
+    for layer_idx in range(args.num_layers):
+        set_layer_state_hf2mcore(args, mg_language_model, language_model, layer_idx)
+    mg_model.visual.visual.load_state_dict(hf_model.model.visual.state_dict())
+
+
+def convert_mcore2hf_glm4_5v(hf_model, mg_model):
+    language_model = hf_model.model.language_model
+    mg_language_model = mg_model.language_model
+    args = get_args()
+    language_model.embed_tokens.weight.data.copy_(mg_language_model.embedding.word_embeddings.weight)
+    if args.untie_embeddings_and_output_weights:
+        hf_model.lm_head.weight.data.copy_(mg_language_model.output_layer.weight)
+    language_model.norm.weight.data.copy_(mg_language_model.decoder.final_layernorm.weight)
+    for layer_idx in range(args.num_layers):
+        set_layer_state_mcore2hf(args, mg_language_model, language_model, layer_idx)
+    hf_model.model.visual.load_state_dict(mg_model.visual.visual.state_dict())
+
+
+class Glm4_5vVit(HuggingFaceModule):
+    module_mapping = {'model.visual': 'visual'}
+    vision_tower = ['visual']
+    aligner = ['visual.merger']
+
+    def __init__(self, config):
+        from transformers.models.glm4v_moe import Glm4vMoeTextModel
+        super().__init__(config, Glm4vMoeTextModel)
+
+    def get_inputs_embeds(self, inputs_embeds, **kwargs):
+        return Template._get_inputs_embeds_hf(inputs_embeds, kwargs, self.visual, self.processor, self.model_config)
+
+
+register_megatron_model(
+    MMGPTMegatronModelMeta(
+        MegatronModelType.glm4_5v, [
+            ModelType.glm4_5v,
+        ],
+        convert_hf2mcore=convert_hf2mcore_glm4_5v,
+        convert_mcore2hf=convert_mcore2hf_glm4_5v,
+        visual_cls=Glm4_5vVit))
diff --git a/swift/megatron/utils/convert.py b/swift/megatron/utils/convert.py
@@ -146,7 +146,9 @@ def test_convert_precision(hf_model, mg_model, template, torch_dtype=torch.float
     ignore_modules = (model_arch.vision_tower + model_arch.aligner) if is_multimodal else []
 
     hf_modules = _find_modules(hf_model, ignore_modules=ignore_modules)
-    with torch.inference_mode(), _model_cpu_forward_context(hf_modules, torch_dtype, share_embedding=share_embedding):
+    with torch.inference_mode(), _model_cpu_forward_context(
+            hf_modules, torch_dtype, share_embedding=share_embedding), template.forward_context(hf_model, inputs):
+        inputs.pop('text_position_ids', None)
         hf_logits = hf_model(**inputs).logits
     hf_model.to('cpu')
 
@@ -210,7 +212,7 @@ def convert_hf2mcore(args: ExportArguments) -> None:
 
     megatron_model_meta = get_megatron_model_meta(args.model_type)
     assert megatron_model_meta is not None, f'Model: {args.model} is not supported.'
-    kwargs = megatron_model_meta.convert_hf_config(hf_model.config)
+    kwargs = megatron_model_meta.convert_hf_config(processor.model_info.config)
     logger.info(f'megatron_config: {kwargs}')
     _check_megatron_kwargs(kwargs)
     current_convert_kwargs = convert_kwargs.copy()
@@ -246,7 +248,7 @@ def convert_mcore2hf(args: ExportArguments) -> None:
 
     megatron_model_meta = get_megatron_model_meta(args.model_type)
     assert megatron_model_meta is not None, f'Model: {args.model} is not supported.'
-    kwargs = megatron_model_meta.convert_hf_config(hf_model.config)
+    kwargs = megatron_model_meta.convert_hf_config(processor.model_info.config)
     logger.info(f'megatron_config: {kwargs}')
     _check_megatron_kwargs(kwargs)
     current_convert_kwargs = convert_kwargs.copy()
diff --git a/tests/megatron/test_align/test_llm.py b/tests/megatron/test_align/test_llm.py
@@ -155,6 +155,10 @@ def test_internvl3_5_moe():
     _test_model('OpenGVLab/InternVL3_5-30B-A3B')
 
 
+def test_glm4_5v():
+    _test_model('ZhipuAI/GLM-4.5V')
+
+
 if __name__ == '__main__':
     # test_qwen2()
     # test_llama2()
@@ -185,4 +189,5 @@ def test_internvl3_5_moe():
     # test_qwen2_5_omni()
     # test_internvl3()
     # test_internvl3_5()
-    test_internvl3_5_moe()
+    # test_internvl3_5_moe()
+    test_glm4_5v()

Original file line number	Diff line number	Diff line change
`@@ -7,3 +7,4 @@ class MegatronModelType:`
`7`	`7`	`qwen2_5_omni = 'qwen2_5_omni'`
`8`	`8`
`9`	`9`	`internvl3 = 'internvl3'`
	`10`	`+ glm4_5v = 'glm4_5v'`
Original file line number	Diff line number	Diff line change
`@@ -1 +1 @@`
`1`		`-from . import internvl, qwen`
	`1`	`+from . import glm, internvl, qwen`