🚀 The feature, motivation and pitch
Description
When MTP (Multi-Token Prediction) is enabled, the original fused kernel fused_sigmoid_gating_delta_rule_update_kernel_0 is split into two separate kernels: fused_recurrent_gated_delta_rule_fwd_kernel_1 and fused_gdn_gating_kernel_1.
However, the combined latency of the two split kernels is nearly twice as high as the original fused kernel, resulting in a severe performance regression.
Performance Numbers
Baseline (MTP disabled):
Kernel: fused_sigmoid_gating_delta_rule_update_kernel_0
Total duration: 8.876613 ms
Avg duration: 0.184929 ms
With MTP enabled (split into two kernels):
Kernels: fused_recurrent_gated_delta_rule_fwd_kernel_1 + fused_gdn_gating_kernel_1
Total duration: 15.61422 ms
Avg duration: 0.325296 ms
Expected Behavior
The decomposed operators (fused_recurrent_gated_delta_rule_fwd_kernel_1 + fused_gdn_gating_kernel_1) should have comparable or improved total latency relative to the original fused_sigmoid_gating_delta_rule_update_kernel_0 when MTP is enabled, rather than introducing substantial overhead.
Environment
Hardware: Ascend NPU
Framework: vLLM-Ascend / Ascend CANN
Model: Qwen3.5-27B BF16
I can't add photos to this issue. If you have any questions, please contact me.
Looking forward to official optimization suggestions or fixes. Thanks!
Alternatives
No response
Additional context
No response