Skip to content

[Feature]: Performance Regression: Operator fused_sigmoid_gating_delta_rule_update_kernel_0 split into fused_recurrent_gated_delta_rule_fwd_kernel_1 + fused_gdn_gating_kernel_1 with MTP enabled introduces significant latency increase #7900

@yuxingcyx

Description

@yuxingcyx

🚀 The feature, motivation and pitch

Description

When MTP (Multi-Token Prediction) is enabled, the original fused kernel fused_sigmoid_gating_delta_rule_update_kernel_0 is split into two separate kernels: fused_recurrent_gated_delta_rule_fwd_kernel_1 and fused_gdn_gating_kernel_1.
However, the combined latency of the two split kernels is nearly twice as high as the original fused kernel, resulting in a severe performance regression.

Performance Numbers

Baseline (MTP disabled):
Kernel: fused_sigmoid_gating_delta_rule_update_kernel_0
Total duration: 8.876613 ms
Avg duration: 0.184929 ms
With MTP enabled (split into two kernels):
Kernels: fused_recurrent_gated_delta_rule_fwd_kernel_1 + fused_gdn_gating_kernel_1
Total duration: 15.61422 ms
Avg duration: 0.325296 ms

Expected Behavior

The decomposed operators (fused_recurrent_gated_delta_rule_fwd_kernel_1 + fused_gdn_gating_kernel_1) should have comparable or improved total latency relative to the original fused_sigmoid_gating_delta_rule_update_kernel_0 when MTP is enabled, rather than introducing substantial overhead.

Environment

Hardware: Ascend NPU
Framework: vLLM-Ascend / Ascend CANN
Model: Qwen3.5-27B BF16

I can't add photos to this issue. If you have any questions, please contact me.
Looking forward to official optimization suggestions or fixes. Thanks!

Alternatives

No response

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions