Skip to content

transformers5.2.0训练moe模型,如果使用torch_npu的moe优化算子,loss会变成nan #10248

@piekey1994

Description

@piekey1994

Reminder

  • I have read the above rules and searched the existing issues.

System Info

  • llamafactory version: 0.9.5.dev0
  • Platform: Linux-5.4.119-19.0009.44-aarch64-with-glibc2.35
  • Python version: 3.11.13
  • PyTorch version: 2.8.0+cpu (NPU)
  • Transformers version: 5.2.0
  • Datasets version: 3.2.0
  • Accelerate version: 1.12.0
  • PEFT version: 0.18.1
  • NPU type: Ascend910B3
  • CANN version: 8.3.RC2
  • TRL version: 0.24.0
  • DeepSpeed version: 0.18.6+unknown
  • vLLM version: 0.12.0
  • Default data directory: detected

Reproduction

自己参考着之前src/llamafactory/v1/plugins/model_plugins/kernels/ops/mlp/npu_fused_moe.py的实现,改了一下5.2.0里qwen3-moe的替换代码如下:

class NpuMoeFused5_2:
    """Container for NPU fused MoE forward functions."""

    @staticmethod
    def npu_moe_experts_forward(
        self,
        hidden_states: torch.Tensor,
        top_k_index: torch.Tensor,
        top_k_weights: torch.Tensor,
    ) -> torch.Tensor:
        permuted_hidden_states, row_ids_map = torch_npu.npu_moe_token_permute(
            hidden_states, top_k_index.to(torch.int32)
        )
        tokens_per_expert = torch.histc(top_k_index, bins=self.num_experts, min=0, max=self.num_experts)
        intermediate_hidden_states = GmmFunction.apply(permuted_hidden_states, self.gate_up_proj.transpose(1, 2), tokens_per_expert)
        intermediate_activations = torch_npu.npu_swiglu(intermediate_hidden_states, dim=-1)
        output = GmmFunction.apply(intermediate_activations, self.down_proj.transpose(1, 2), tokens_per_expert)
        next_states = torch_npu.npu_moe_token_unpermute(output, row_ids_map, probs=top_k_weights)
        return next_states

if not is_transformers_version_greater_than("5.0.0"):
    kernel_moe_mapping["Qwen3MoeForCausalLM"] = {
        "Qwen3MoeSparseMoeBlock": Qwen3NpuMoeFused.qwen3moe_sparse_moe_block_forward
    }
else:
    kernel_moe_mapping["Qwen3MoeForCausalLM"] = {
        "Qwen3MoeExperts": NpuMoeFused5_2.npu_moe_experts_forward
    }

然后用fsdp+lora的形式做sft,第一个step还是正常的,第二个step的loss就是nan了。而且debug的时候发现,第二个step的hidden_states在进mlp之前,也就是attention的时候就已经变成nan了。而且我的梯度累计是2,按理来说第一个step对模型本身没有任何更新,不知道为啥会这样。关掉这个moe优化算子就能跑,但是好慢。
有没有大佬遇到这个问题,我也用同样的方式替换3.5的moe算子,也是会出现loss变成nan

Others

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingnpuThis problem is related to NPU devicespendingThis problem is yet to be addressed

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions