Skip to content

[NPU] Add NPU optimized mHC#1168

Draft
noemotiovon wants to merge 1 commit intolinkedin:mainfrom
noemotiovon:mhc
Draft

[NPU] Add NPU optimized mHC#1168
noemotiovon wants to merge 1 commit intolinkedin:mainfrom
noemotiovon:mhc

Conversation

@noemotiovon
Copy link
Copy Markdown
Contributor

@noemotiovon noemotiovon commented Mar 26, 2026

Add Ascend NPU Triton kernels for the three mHC sub-operators:

  • Fused matmul + RMS normalization (forward/backward)
  • Sinkhorn routing with split pre/post/residual coefficients (forward/backward)
  • Pre-aggregate weighted sum (forward/backward)
  • Post + residual mixing (forward/backward)

NPU optimizations applied:

  • Unified UB tiling via compute_default_tiling_strategy
  • Persistent grid-stride loops (tl.range + num_programs)
  • Cache modifiers (.cg/.ca/.cs) on all loads/stores
  • Adaptive BLOCK_N/BLOCK_M for core utilisation at small seq_len
  • Fused backward coefficient assembly kernel (replaces ~20 PyTorch ops with 1 Triton kernel)

Hardware Type: Atlas 800I A2

  • run make test to ensure correctness
  • run make checkstyle to ensure code style
  • run make test-convergence to ensure convergence

  Add Ascend NPU Triton kernels for the three mHC sub-operators:
  - Fused matmul + RMS normalization (forward/backward)
  - Sinkhorn routing with split pre/post/residual coefficients (forward/backward)
  - Pre-aggregate weighted sum (forward/backward)
  - Post + residual mixing (forward/backward)

  NPU optimizations applied:
  - Unified UB tiling via compute_default_tiling_strategy
  - Persistent grid-stride loops (tl.range + num_programs)
  - Cache modifiers (.cg/.ca/.cs) on all loads/stores
  - Adaptive BLOCK_N/BLOCK_M for core utilisation at small seq_len
  - Fused backward coefficient assembly kernel (replaces ~20 PyTorch ops with 1 Triton kernel)
@noemotiovon
Copy link
Copy Markdown
Contributor Author

Hi @lowdy1, could you continue my work?

@lowdy1
Copy link
Copy Markdown
Contributor

lowdy1 commented Mar 27, 2026

Hi @lowdy1, could you continue my work?

If you don’t have the bandwidth right now, I'm glad to continue working on this.

@noemotiovon
Copy link
Copy Markdown
Contributor Author

Hi @lowdy1, could you continue my work?

If you don’t have the bandwidth right now, I'm glad to continue working on this.

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants