modelscope
diff --git a/‎docs/source/Instruction/Supported-models-and-datasets.md‎
Lines changed: 1 addition & 1 deletion b/‎docs/source/Instruction/Supported-models-and-datasets.md‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎docs/source/Megatron-SWIFT/Command-line-parameters.md‎
Lines changed: 5 additions & 0 deletions b/‎docs/source/Megatron-SWIFT/Command-line-parameters.md‎
Lines changed: 5 additions & 0 deletions
diff --git a/‎docs/source/Megatron-SWIFT/Quick-start.md‎
Lines changed: 1 addition & 1 deletion b/‎docs/source/Megatron-SWIFT/Quick-start.md‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎docs/source_en/Instruction/Supported-models-and-datasets.md‎
Lines changed: 1 addition & 1 deletion b/‎docs/source_en/Instruction/Supported-models-and-datasets.md‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎docs/source_en/Megatron-SWIFT/Command-line-parameters.md‎
Lines changed: 5 additions & 0 deletions b/‎docs/source_en/Megatron-SWIFT/Command-line-parameters.md‎
Lines changed: 5 additions & 0 deletions
diff --git a/‎docs/source_en/Megatron-SWIFT/Quick-start.md‎
Lines changed: 1 addition & 1 deletion b/‎docs/source_en/Megatron-SWIFT/Quick-start.md‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎swift/megatron/arguments/megatron_args.py‎
Lines changed: 4 additions & 0 deletions b/‎swift/megatron/arguments/megatron_args.py‎
Lines changed: 4 additions & 0 deletions
diff --git a/‎swift/megatron/init.py‎
Lines changed: 159 additions & 0 deletions b/‎swift/megatron/init.py‎
Lines changed: 159 additions & 0 deletions
diff --git a/‎swift/megatron/model/gpt_bridge.py‎
Lines changed: 29 additions & 6 deletions b/‎swift/megatron/model/gpt_bridge.py‎
Lines changed: 29 additions & 6 deletions
@@ -414,7 +414,7 @@
 |[ZhipuAI/GLM-4.7](https://modelscope.cn/models/ZhipuAI/GLM-4.7)|glm4_moe|glm4_7|transformers>=4.54|&#x2714;|-|[zai-org/GLM-4.7](https://huggingface.co/zai-org/GLM-4.7)|
 |[ZhipuAI/GLM-4.7-FP8](https://modelscope.cn/models/ZhipuAI/GLM-4.7-FP8)|glm4_moe|glm4_7|transformers>=4.54|&#x2718;|-|[zai-org/GLM-4.7-FP8](https://huggingface.co/zai-org/GLM-4.7-FP8)|
 |[ZhipuAI/GLM-4.7-Flash](https://modelscope.cn/models/ZhipuAI/GLM-4.7-Flash)|glm4_moe_lite|glm4_7|transformers>=5.0.0.dev|&#x2714;|-|[zai-org/GLM-4.7-Flash](https://huggingface.co/zai-org/GLM-4.7-Flash)|
-|[ZhipuAI/GLM-5](https://modelscope.cn/models/ZhipuAI/GLM-5)|glm_moe_dsa|glm4_7|transformers>=5.2.0|&#x2718;|-|[zai-org/GLM-5](https://huggingface.co/zai-org/GLM-5)|
+|[ZhipuAI/GLM-5](https://modelscope.cn/models/ZhipuAI/GLM-5)|glm_moe_dsa|glm4_7|transformers>=5.2.0|&#x2714;|-|[zai-org/GLM-5](https://huggingface.co/zai-org/GLM-5)|
 |[ZhipuAI/glm-edge-1.5b-chat](https://modelscope.cn/models/ZhipuAI/glm-edge-1.5b-chat)|glm_edge|chatglm4|transformers>=4.46|&#x2718;|-|[zai-org/glm-edge-1.5b-chat](https://huggingface.co/zai-org/glm-edge-1.5b-chat)|
 |[ZhipuAI/glm-edge-4b-chat](https://modelscope.cn/models/ZhipuAI/glm-edge-4b-chat)|glm_edge|chatglm4|transformers>=4.46|&#x2718;|-|[zai-org/glm-edge-4b-chat](https://huggingface.co/zai-org/glm-edge-4b-chat)|
 |[codefuse-ai/CodeFuse-CodeGeeX2-6B](https://modelscope.cn/models/codefuse-ai/CodeFuse-CodeGeeX2-6B)|codefuse_codegeex2|codefuse|transformers<4.34|&#x2718;|coding|[codefuse-ai/CodeFuse-CodeGeeX2-6B](https://huggingface.co/codefuse-ai/CodeFuse-CodeGeeX2-6B)|
 
@@ -187,6 +187,11 @@
 - moe_pad_expert_input_to_capacity: 对每个专家（expert）的输入进行填充，使其长度与专家容量（expert capacity length）对齐，默认为False。该操作仅在设置了 `--moe_expert_capacity_factor` 参数后才生效。
 - moe_token_drop_policy: 可选为'probs', 'position'。默认为'probs'。
 
+**DSA参数**
+- dsa_indexer_loss_coeff: DSA 索引器 KL 散度损失的系数。设置为 0 可禁用索引器损失。默认为None。
+- dsa_indexer_use_sparse_loss: 是否使用稀疏 DSA 索引器损失。如果为 True，索引器损失将使用 top-k 索引进行计算。默认为False。
+
+
 **MTP参数**
 - mtp_num_layers: 多token预测（MTP）层的数量。MTP将每个位置的预测范围扩展到多个未来token。此MTP实现使用D个顺序模块依次预测D个额外的token。默认为None。（需要"megatron-core>=0.14"）
   - 注意：mtp_num_layers的值，将不自动从config.json获取，需手动设置。你可以参考config.json中的`num_nextn_predict_layers`字段填写该值。使用mcore-bridge时，将优先从safetensors文件中加载MTP权重，若无法找到，则进行随机初始化。（若要使用blockwise fp8 + mtp，请使用mcore>=0.15）
 
@@ -66,7 +66,7 @@ modelscope-registry.us-west-1.cr.aliyuncs.com/modelscope-repo/modelscope:ubuntu2
 | python       | >=3.9        | 3.10/3.11        |                    |
 | cuda         |              | cuda12      |                    |
 | torch        | >=2.0        | 2.8.0       |                    |
-| transformer_engine    | >=2.3       |   2.10.0    |                  |
+| transformer_engine    | >=2.3       |   2.12.0    |                  |
 | apex |   |  0.1 | |
 | megatron_core    |   >=0.12,<0.16    | 0.15      |                  |
 | flash_attn    |        | 2.8.3/3.0.0b1   |                  |
 
@@ -415,7 +415,7 @@ The table below introduces the models integrated with ms-swift:
 |[ZhipuAI/GLM-4.7](https://modelscope.cn/models/ZhipuAI/GLM-4.7)|glm4_moe|glm4_7|transformers>=4.54|&#x2714;|-|[zai-org/GLM-4.7](https://huggingface.co/zai-org/GLM-4.7)|
 |[ZhipuAI/GLM-4.7-FP8](https://modelscope.cn/models/ZhipuAI/GLM-4.7-FP8)|glm4_moe|glm4_7|transformers>=4.54|&#x2718;|-|[zai-org/GLM-4.7-FP8](https://huggingface.co/zai-org/GLM-4.7-FP8)|
 |[ZhipuAI/GLM-4.7-Flash](https://modelscope.cn/models/ZhipuAI/GLM-4.7-Flash)|glm4_moe_lite|glm4_7|transformers>=5.0.0.dev|&#x2714;|-|[zai-org/GLM-4.7-Flash](https://huggingface.co/zai-org/GLM-4.7-Flash)|
-|[ZhipuAI/GLM-5](https://modelscope.cn/models/ZhipuAI/GLM-5)|glm_moe_dsa|glm4_7|transformers>=5.2.0|&#x2718;|-|[zai-org/GLM-5](https://huggingface.co/zai-org/GLM-5)|
+|[ZhipuAI/GLM-5](https://modelscope.cn/models/ZhipuAI/GLM-5)|glm_moe_dsa|glm4_7|transformers>=5.2.0|&#x2714;|-|[zai-org/GLM-5](https://huggingface.co/zai-org/GLM-5)|
 |[ZhipuAI/glm-edge-1.5b-chat](https://modelscope.cn/models/ZhipuAI/glm-edge-1.5b-chat)|glm_edge|chatglm4|transformers>=4.46|&#x2718;|-|[zai-org/glm-edge-1.5b-chat](https://huggingface.co/zai-org/glm-edge-1.5b-chat)|
 |[ZhipuAI/glm-edge-4b-chat](https://modelscope.cn/models/ZhipuAI/glm-edge-4b-chat)|glm_edge|chatglm4|transformers>=4.46|&#x2718;|-|[zai-org/glm-edge-4b-chat](https://huggingface.co/zai-org/glm-edge-4b-chat)|
 |[codefuse-ai/CodeFuse-CodeGeeX2-6B](https://modelscope.cn/models/codefuse-ai/CodeFuse-CodeGeeX2-6B)|codefuse_codegeex2|codefuse|transformers<4.34|&#x2718;|coding|[codefuse-ai/CodeFuse-CodeGeeX2-6B](https://huggingface.co/codefuse-ai/CodeFuse-CodeGeeX2-6B)|
 
@@ -198,6 +198,11 @@ For guidance on selecting parallelization strategies, please refer to the [Train
 - moe_pad_expert_input_to_capacity: Pad the input of each expert so that its length aligns with the expert capacity length. Default is `False`. This option only takes effect if `--moe_expert_capacity_factor` is set.
 - moe_token_drop_policy: Options are 'probs' and 'position'. Default is 'probs'.
 
+**DSA Parameters**
+
+- dsa_indexer_loss_coeff: Coefficient for the DSA indexer KL divergence loss. Set to 0 to disable indexer loss. Default is None.
+- dsa_indexer_use_sparse_loss: Whether to use sparse DSA indexer loss. If True, the indexer loss will be computed using the top-k indices. Default is False.
+
 
 **MTP Parameters**
 - mtp_num_layers: Number of Multi-Token Prediction (MTP) layers. MTP extends the prediction scope at each position to multiple future tokens. This MTP implementation uses D sequential modules to sequentially predict D additional tokens. Default is None. (requires "megatron-core>=0.14")
 
@@ -66,7 +66,7 @@ Recommended Operating Environment:
 | python       | >=3.9        | 3.10/3.11        |                    |
 | cuda         |              | cuda12      |                    |
 | torch        | >=2.0        | 2.8.0    |                    |
-| transformer_engine    | >=2.3       |  2.10.0  |                  |
+| transformer_engine    | >=2.3       |  2.12.0  |                  |
 | apex |   |  0.1 | |
 | megatron_core    |    >=0.12,<0.16    | 0.15      |                  |
 | flash_attn    |        | 2.8.3/3.0.0b1   |                  |
 
@@ -528,6 +528,10 @@ class MegatronArguments(RLHFMegatronArgumentsMixin, MegatronTunerMixin):
     attn_impl: Optional[str] = None
     gradient_checkpointing_kwargs: Optional[Union[dict, str]] = None
 
+    # dsa
+    dsa_indexer_loss_coeff: Optional[float] = None
+    dsa_indexer_use_sparse_loss: bool = False
+
     # other
     check_model: bool = True
     torch_dtype: Optional[Union[torch.dtype, str]] = None
 
@@ -136,13 +136,22 @@ def forward(
             core_attn_out = self._checkpointed_attention_forward(
                 query, key, value, attention_mask, packed_seq_params=packed_seq_params)
         else:
+            extra_kwargs = {}
+            if self.config.experimental_attention_variant == 'dsa':
+                # For dsa we need to pass in the original hidden states and the compressed
+                # query representation.
+                extra_kwargs['x'] = hidden_states
+                extra_kwargs['qr'] = q_compressed
+                # for easy injection of rotary_pos_emb (patch)
+                packed_seq_params = (packed_seq_params, rotary_pos_emb)
             core_attn_out = self.core_attention(
                 query,
                 key,
                 value,
                 attention_mask,
                 packed_seq_params=packed_seq_params,
                 attn_mask_type=attn_mask_type,
+                **extra_kwargs,
             )
         if thd_qkv_format:
             if core_attn_out.ndim == 2:
@@ -789,6 +798,152 @@ def _new_load_inline(*args, **kwargs):
         cpp_extension.load_inline = load_inline
 
 
+def _patch_dsa():
+    from megatron.core.models.common.embeddings.rope_utils import apply_rotary_pos_emb
+    from megatron.core.models.gpt import experimental_attention_variant_module_specs
+    from megatron.core.packed_seq_params import PackedSeqParams
+    from megatron.core.tensor_parallel.mappings import gather_from_sequence_parallel_region
+    from megatron.core.transformer.experimental_attention_variant.dsa import rotate_activation
+    DSAIndexer = experimental_attention_variant_module_specs.DSAIndexer
+
+    class NewDSAIndexer(DSAIndexer):
+
+        def forward_before_topk(
+            self,
+            x: torch.Tensor,
+            qr: torch.Tensor,
+            packed_seq_params: Optional[PackedSeqParams] = None,
+        ):
+            """All computations before topk."""
+            # =========================================
+            # Gather inputs if sp is enabled
+            # =========================================
+            packed_seq_params, rotary_pos_emb = packed_seq_params  # patch
+            assert packed_seq_params is None, 'Packed sequence is not supported for DSAttention'
+
+            if self.config.sequence_parallel and self.pg_collection.tp.size() > 1:
+                x = gather_from_sequence_parallel_region(x, group=self.pg_collection.tp)
+                qr = gather_from_sequence_parallel_region(qr, group=self.pg_collection.tp)
+
+            # =========================================
+            # Get sequence length and batch size
+            # =========================================
+            seqlen, bsz, _ = x.size()
+
+            # =========================================
+            # q linear and apply rope to q
+            # =========================================
+            # [seqlen, batch, q_lora_rank] -> [seqlen, batch, index_n_heads * index_head_dim]
+            q, _ = self.linear_wq_b(qr)
+            # [seqlen, batch, index_n_heads * index_head_dim]
+            #   -> [seqlen, batch, index_n_heads, index_head_dim]
+            q = q.reshape(seqlen, bsz, self.index_n_heads, self.index_head_dim)
+            q = self._apply_rope(q, rotary_pos_emb)  # mscale will be passed in by patch
+
+            # =========================================
+            # k linear and apply rope to k
+            # =========================================
+            # [seqlen, batch, hidden_size] -> [seqlen, batch, index_head_dim]
+            k, _ = self.linear_wk(x)
+            k = self.k_norm(k)
+            # [seqlen, batch, index_head_dim] -> [seqlen, batch, 1, index_head_dim]
+            k = k.reshape(seqlen, bsz, 1, self.index_head_dim)
+            k = self._apply_rope(k, rotary_pos_emb)
+            # [seqlen, batch, 1, index_head_dim] -> [seqlen, batch, index_head_dim]
+            k = k.reshape(seqlen, bsz, self.index_head_dim)
+
+            # =========================================
+            # Rotate activation
+            # =========================================
+            q = rotate_activation(q)
+            k = rotate_activation(k)
+
+            # =========================================
+            # Prepare weights for index scores
+            # =========================================
+            # [seqlen, batch, hidden_size] -> [seqlen, batch, index_n_heads]
+            weights, _ = self.linear_weights_proj(x)
+            weights = weights * (self.index_n_heads**-0.5) * self.softmax_scale
+
+            return q, k, weights
+
+        def _apply_rope(self, x: torch.Tensor, rotary_pos_emb: torch.Tensor):
+            """Apply RoPE to the input tensor."""
+            # x_nope [seqlen, batch, *, index_head_dim - qk_pos_emb_head_dim]
+            # x_pe   [seqlen, batch, *, qk_pos_emb_head_dim]
+            x_pe, x_nope = torch.split(
+                x, [self.index_head_dim - self.qk_pos_emb_head_dim, self.qk_pos_emb_head_dim], dim=-1)
+            x_pe = apply_rotary_pos_emb(
+                x_pe,
+                rotary_pos_emb,
+                config=self.config,
+                cu_seqlens=None,
+                cp_group=self.pg_collection.cp,
+            )
+            # [seqlen, batch, *, index_head_dim]
+            x = torch.cat([x_pe, x_nope], dim=-1)
+            return x
+
+        def forward_with_scores(
+            self,
+            x: torch.Tensor,
+            qr: torch.Tensor,
+            mask: Optional[torch.Tensor] = None,
+            packed_seq_params: Optional[PackedSeqParams] = None,
+        ) -> Tuple[torch.Tensor, torch.Tensor]:
+            """
+            Forward pass for DSA Indexer that returns both index scores and top-k indices.
+
+            This is used when KL loss is enabled to compare indexer scores with true attention scores.
+
+            Args:
+                x: hidden states [seqlen, batch, hidden_size].
+                qr: Low-rank query tensor [seqlen, batch, q_lora_rank].
+                mask: Attention mask [batch, seqlen, seqlen].
+                packed_seq_params: Packed sequence parameters for variable length sequences.
+
+            Returns:
+                index_scores: Index scores [batch, seqlen, seqlen].
+                topk_indices: Top-k indices [batch, seqlen, index_topk].
+            """
+            try:
+                from megatron.core.transformer.experimental_attention_variant.dsa import fused_qk_topk_naive
+            except ImportError:
+                raise ImportError('fused_qk_topk_naive is not available. Please install megatron-core from source. '
+                                  '`pip install git+https://github.com/NVIDIA/Megatron-LM.git`')
+            # [seqlen, batch, index_n_heads * index_head_dim]
+            # [seqlen, batch, index_head_dim]
+            # [seqlen, batch, index_n_heads]
+            q, k, weights = self.forward_before_topk(x, qr, packed_seq_params)
+
+            # [batch, seqlen, seqlen], [batch, seqlen, index_topk]
+            index_scores, topk_indices = fused_qk_topk_naive(q, k, weights, self.index_topk, mask)
+
+            return index_scores, topk_indices
+
+        def forward(self,
+                    x: torch.Tensor,
+                    qr: torch.Tensor,
+                    mask: Optional[torch.Tensor] = None,
+                    packed_seq_params: Optional[PackedSeqParams] = None):
+            """
+            Forward pass for DSA Indexer.
+
+            Args:
+                x: hidden states [seqlen, batch, hidden_size].
+                qr: Low-rank query tensor [seqlen, batch, q_lora_rank].
+                mask: Attention mask [batch, seqlen, seqlen].
+                packed_seq_params: Packed sequence parameters for variable length sequences.
+
+            Returns:
+                topk_indices: Top-k indices for sparse attention [batch, seqlen, index_topk].
+            """
+            _, topk_indices = self.forward_with_scores(x, qr, mask, packed_seq_params)
+            return topk_indices
+
+    experimental_attention_variant_module_specs.DSAIndexer = NewDSAIndexer
+
+
 def init_megatron_env():
     os.environ.pop('VLLM_USE_MODELSCOPE', None)
     logging_level = logging.root.level
@@ -804,6 +959,10 @@ def init_megatron_env():
     _patch_mrope()
     _patch__write_item()
     _patch_mtp()
+    try:
+        _patch_dsa()
+    except ImportError:
+        pass
     logging.root.setLevel(logging_level)  # revert logger level
     from swift.megatron import tuners  # patch lora
     try:
 
@@ -737,7 +737,7 @@ def _get_hf_grouped(self):
         if self.model_type in {
                 'qwen2_moe', 'qwen3_moe', 'deepseek_v2', 'deepseek_v3', 'dots1', 'ernie4_5_moe', 'glm4_moe',
                 'glm4_moe_lite', 'glm4v_moe', 'minimax_m2', 'olmoe', 'qwen3_next', 'kimi_vl', 'qwen3_omni_moe',
-                'qwen3_5_moe'
+                'qwen3_5_moe', 'glm_moe_dsa'
         }:
             return False, False
         return None, None
@@ -1257,6 +1257,22 @@ def _set_mlp_state(
             hf_state_dict = self._add_prefix(hf_state_dict, hf_prefix)
         return hf_state_dict
 
+    def _set_indexer(self, mg_indexer, hf_state_dict, hf_prefix: str, to_mcore: bool):
+        if to_mcore:
+            hf_state_dict = self._remove_prefix(hf_state_dict, hf_prefix)
+        else:
+            hf_state_dict = {}
+        self._set_state_dict(mg_indexer, 'linear_wq_b.weight', hf_state_dict, 'wq_b.weight', to_mcore)
+        self._set_state_dict(mg_indexer, 'linear_wk.weight', hf_state_dict, 'wk.weight', to_mcore)
+        self._set_state_dict(mg_indexer, 'k_norm.weight', hf_state_dict, 'k_norm.weight', to_mcore)
+        self._set_state_dict(mg_indexer, 'k_norm.bias', hf_state_dict, 'k_norm.bias', to_mcore)
+        self._set_state_dict(mg_indexer, 'linear_weights_proj.weight', hf_state_dict, 'weights_proj.weight', to_mcore)
+        if to_mcore:
+            hf_state_dict = {}
+        else:
+            hf_state_dict = self._add_prefix(hf_state_dict, hf_prefix)
+        return hf_state_dict
+
     def _set_mla_attn_state(
         self,
         mg_attn,
@@ -1279,11 +1295,18 @@ def _set_mla_attn_state(
                              to_mcore)
         self._set_state_dict(mg_attn, 'linear_kv_up_proj.weight', hf_state_dict, 'kv_b_proj.weight', to_mcore)
         if self.config.qk_layernorm:
-            if self.config.q_lora_rank is not None:
-                self._set_state_dict(mg_attn, 'linear_q_up_proj.layer_norm_weight', hf_state_dict,
-                                     'q_a_layernorm.weight', to_mcore)
-            self._set_state_dict(mg_attn, 'linear_kv_up_proj.layer_norm_weight', hf_state_dict, 'kv_a_layernorm.weight',
-                                 to_mcore)
+            if self.config.experimental_attention_variant == 'dsa':
+                if self.config.q_lora_rank is not None:
+                    self._set_state_dict(mg_attn, 'q_layernorm.weight', hf_state_dict, 'q_a_layernorm.weight', to_mcore)
+                self._set_state_dict(mg_attn, 'kv_layernorm.weight', hf_state_dict, 'kv_a_layernorm.weight', to_mcore)
+            else:
+                if self.config.q_lora_rank is not None:
+                    self._set_state_dict(mg_attn, 'linear_q_up_proj.layer_norm_weight', hf_state_dict,
+                                         'q_a_layernorm.weight', to_mcore)
+                self._set_state_dict(mg_attn, 'linear_kv_up_proj.layer_norm_weight', hf_state_dict,
+                                     'kv_a_layernorm.weight', to_mcore)
+        if self.config.experimental_attention_variant == 'dsa':
+            hf_state_dict.update(self._set_indexer(mg_attn.core_attention.indexer, hf_state_dict, 'indexer.', to_mcore))
         if to_mcore:
             hf_state_dict = {}
         else: