[NPU] Add optimized NPU mhc#1173

Open

lowdy1 wants to merge 1 commit intolinkedin:mainfrom

Contributor

lowdy1 commented Mar 28, 2026

Add Ascend NPU Triton kernels for the three mHC sub-operators:

Fused matmul + RMS normalization (forward/backward)
Sinkhorn routing with split pre/post/residual coefficients (forward/backward)
Pre-aggregate weighted sum (forward/backward)
Post + residual mixing (forward/backward)

NPU optimizations applied:

Unified UB tiling via compute_default_tiling_strategy for matrix
Persistent grid-stride loops (tl.range + num_programs)
Adaptive BLOCK_N/BLOCK_M for core utilisation at small seq_len
Fused backward coefficient assembly kernel

Hardware Type: Atlas 800I A2

run make test to ensure correctness
run make checkstyle to ensure code style
run make test-convergence to ensure convergence


          add mhc npu

d0bf52a

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet