Commit 5057ca8

authored

[NPU]: fused_add_rms_norm kernel distinguish the chunking strategy (#1100)

## Summary  Based on #1070 Because the original kernel uses n_cols as BLOCK_SIZE, and n_cols is smaller in the test, the test can pass normally. However, in the benchmark, n_cols is larger, and when running on the NPU, an ub overflow occurs. Therefore, for each row, we process it in chunks of BLOCK_SIZE. Maintain high performance even when using a smaller hidden size in most models, and also ensure support in cases where a larger hidden size is used.  ## Testing Done  <img width="1564" height="439" alt="image" src="https://github.com/user-attachments/assets/9de1c501-db2f-4dc1-9808-f3bf6e5abd75" />  - Hardware Type: Atlas 800I A2 - [x] run `make test` to ensure correctness - [x] run `make checkstyle` to ensure code style - [ ] run `make test-convergence` to ensure convergence

1 parent 2b63217 commit 5057ca8Copy full SHA for 5057ca8

1 file changed

+342

-50

lines changed

src/liger_kernel/ops/backends/_ascend/ops
- fused_add_rms_norm.py

1 file changed

+342

-50

lines changed

Comments

(0)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Commit 5057ca8

1 file changed

1 file changed

File tree

1 file changed

1 file changed

0 commit comments