Skip to content

[NPU] Add optimized NPU mhc#1173

Open
lowdy1 wants to merge 1 commit intolinkedin:mainfrom
lowdy1:mhc_npu
Open

[NPU] Add optimized NPU mhc#1173
lowdy1 wants to merge 1 commit intolinkedin:mainfrom
lowdy1:mhc_npu

Conversation

@lowdy1
Copy link
Copy Markdown
Contributor

@lowdy1 lowdy1 commented Mar 28, 2026

Add Ascend NPU Triton kernels for the three mHC sub-operators:

  • Fused matmul + RMS normalization (forward/backward)
  • Sinkhorn routing with split pre/post/residual coefficients (forward/backward)
  • Pre-aggregate weighted sum (forward/backward)
  • Post + residual mixing (forward/backward)

NPU optimizations applied:

  • Unified UB tiling via compute_default_tiling_strategy for matrix
  • Persistent grid-stride loops (tl.range + num_programs)
  • Adaptive BLOCK_N/BLOCK_M for core utilisation at small seq_len
  • Fused backward coefficient assembly kernel

Hardware Type: Atlas 800I A2

  • run make test to ensure correctness
  • run make checkstyle to ensure code style
  • run make test-convergence to ensure convergence

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant