Skip to content

Commit 5057ca8

Browse files
authored
[NPU]: fused_add_rms_norm kernel distinguish the chunking strategy (#1100)
## Summary <!--- This is a required section; please describe the main purpose of this proposed code change. ---> Based on #1070 Because the original kernel uses n_cols as BLOCK_SIZE, and n_cols is smaller in the test, the test can pass normally. However, in the benchmark, n_cols is larger, and when running on the NPU, an ub overflow occurs. Therefore, for each row, we process it in chunks of BLOCK_SIZE. Maintain high performance even when using a smaller hidden size in most models, and also ensure support in cases where a larger hidden size is used. <!--- ## Details This is an optional section; is there anything specific that reviewers should be aware of? ---> ## Testing Done <!--- This is a required section; please describe how this change was tested. ---> <img width="1564" height="439" alt="image" src="https://github.com/user-attachments/assets/9de1c501-db2f-4dc1-9808-f3bf6e5abd75" /> <!-- Replace BLANK with your device type. For example, A100-80G-PCIe Complete the following tasks before sending your PR, and replace `[ ]` with `[x]` to indicate you have done them. --> - Hardware Type: Atlas 800I A2 - [x] run `make test` to ensure correctness - [x] run `make checkstyle` to ensure code style - [ ] run `make test-convergence` to ensure convergence
1 parent 2b63217 commit 5057ca8

File tree

1 file changed

+342
-50
lines changed

1 file changed

+342
-50
lines changed

0 commit comments

Comments
 (0)