-
Notifications
You must be signed in to change notification settings - Fork 8.3k
Open
Labels
bugSomething isn't workingSomething isn't workingnpuThis problem is related to NPU devicesThis problem is related to NPU devicespendingThis problem is yet to be addressedThis problem is yet to be addressed
Description
Reminder
- I have read the above rules and searched the existing issues.
System Info
llamafactoryversion: 0.9.5.dev0- Platform: Linux-5.4.119-19.0009.44-aarch64-with-glibc2.35
- Python version: 3.11.13
- PyTorch version: 2.8.0+cpu (NPU)
- Transformers version: 5.2.0
- Datasets version: 3.2.0
- Accelerate version: 1.12.0
- PEFT version: 0.18.1
- NPU type: Ascend910B3
- CANN version: 8.3.RC2
- TRL version: 0.24.0
- DeepSpeed version: 0.18.6+unknown
- vLLM version: 0.12.0
- Default data directory: detected
Reproduction
自己参考着之前src/llamafactory/v1/plugins/model_plugins/kernels/ops/mlp/npu_fused_moe.py的实现,改了一下5.2.0里qwen3-moe的替换代码如下:
class NpuMoeFused5_2:
"""Container for NPU fused MoE forward functions."""
@staticmethod
def npu_moe_experts_forward(
self,
hidden_states: torch.Tensor,
top_k_index: torch.Tensor,
top_k_weights: torch.Tensor,
) -> torch.Tensor:
permuted_hidden_states, row_ids_map = torch_npu.npu_moe_token_permute(
hidden_states, top_k_index.to(torch.int32)
)
tokens_per_expert = torch.histc(top_k_index, bins=self.num_experts, min=0, max=self.num_experts)
intermediate_hidden_states = GmmFunction.apply(permuted_hidden_states, self.gate_up_proj.transpose(1, 2), tokens_per_expert)
intermediate_activations = torch_npu.npu_swiglu(intermediate_hidden_states, dim=-1)
output = GmmFunction.apply(intermediate_activations, self.down_proj.transpose(1, 2), tokens_per_expert)
next_states = torch_npu.npu_moe_token_unpermute(output, row_ids_map, probs=top_k_weights)
return next_states
if not is_transformers_version_greater_than("5.0.0"):
kernel_moe_mapping["Qwen3MoeForCausalLM"] = {
"Qwen3MoeSparseMoeBlock": Qwen3NpuMoeFused.qwen3moe_sparse_moe_block_forward
}
else:
kernel_moe_mapping["Qwen3MoeForCausalLM"] = {
"Qwen3MoeExperts": NpuMoeFused5_2.npu_moe_experts_forward
}
然后用fsdp+lora的形式做sft,第一个step还是正常的,第二个step的loss就是nan了。而且debug的时候发现,第二个step的hidden_states在进mlp之前,也就是attention的时候就已经变成nan了。而且我的梯度累计是2,按理来说第一个step对模型本身没有任何更新,不知道为啥会这样。关掉这个moe优化算子就能跑,但是好慢。
有没有大佬遇到这个问题,我也用同样的方式替换3.5的moe算子,也是会出现loss变成nan
Others
No response
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't workingnpuThis problem is related to NPU devicesThis problem is related to NPU devicespendingThis problem is yet to be addressedThis problem is yet to be addressed