[Feature]: Performance Regression: Operator fused_sigmoid_gating_delta_rule_update_kernel_0 split into fused_recurrent_gated_delta_rule_fwd_kernel_1 + fused_gdn_gating_kernel_1 with MTP enabled introduces significant latency increase

### 🚀 The feature, motivation and pitch

### Description

When **MTP (Multi-Token Prediction)** is enabled, the original fused kernel fused_sigmoid_gating_delta_rule_update_kernel_0 is split into two separate kernels: fused_recurrent_gated_delta_rule_fwd_kernel_1 and fused_gdn_gating_kernel_1.
However, the combined latency of the two split kernels is nearly **twice** as high as the original fused kernel, resulting in a **severe performance regression**.

### Performance Numbers

**Baseline (MTP disabled):**
Kernel: fused_sigmoid_gating_delta_rule_update_kernel_0
Total duration: **8.876613 ms**
Avg duration: **0.184929 ms**
**With MTP enabled (split into two kernels):**
Kernels: fused_recurrent_gated_delta_rule_fwd_kernel_1 + fused_gdn_gating_kernel_1
Total duration: **15.61422 ms**
Avg duration: **0.325296 ms**

### Expected Behavior
The decomposed operators (fused_recurrent_gated_delta_rule_fwd_kernel_1 + fused_gdn_gating_kernel_1) should have comparable or improved total latency relative to the original fused_sigmoid_gating_delta_rule_update_kernel_0 when MTP is enabled, rather than introducing substantial overhead.

### Environment
Hardware: Ascend NPU
Framework: vLLM-Ascend / Ascend CANN
Model: Qwen3.5-27B BF16

I can't add photos to this issue. If you have any questions, please contact me.
Looking forward to official optimization suggestions or fixes. Thanks!



### Alternatives

_No response_

### Additional context

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature]: Performance Regression: Operator fused_sigmoid_gating_delta_rule_update_kernel_0 split into fused_recurrent_gated_delta_rule_fwd_kernel_1 + fused_gdn_gating_kernel_1 with MTP enabled introduces significant latency increase #7900

🚀 The feature, motivation and pitch

Description

Performance Numbers

Expected Behavior

Environment

Alternatives

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Feature]: Performance Regression: Operator fused_sigmoid_gating_delta_rule_update_kernel_0 split into fused_recurrent_gated_delta_rule_fwd_kernel_1 + fused_gdn_gating_kernel_1 with MTP enabled introduces significant latency increase #7900

Description

🚀 The feature, motivation and pitch

Description

Performance Numbers

Expected Behavior

Environment

Alternatives

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions