Commit 5057ca8
authored
[NPU]: fused_add_rms_norm kernel distinguish the chunking strategy (#1100)
## Summary
<!--- This is a required section; please describe the main purpose of
this proposed code change. --->
Based on #1070
Because the original kernel uses n_cols as BLOCK_SIZE, and n_cols is
smaller in the test, the test can pass normally. However, in the
benchmark, n_cols is larger, and when running on the NPU, an ub overflow
occurs. Therefore, for each row, we process it in chunks of BLOCK_SIZE.
Maintain high performance even when using a smaller hidden size in most
models, and also ensure support in cases where a larger hidden size is
used.
<!---
## Details
This is an optional section; is there anything specific that reviewers
should be aware of?
--->
## Testing Done
<!--- This is a required section; please describe how this change was
tested. --->
<img width="1564" height="439" alt="image"
src="https://github.com/user-attachments/assets/9de1c501-db2f-4dc1-9808-f3bf6e5abd75"
/>
<!--
Replace BLANK with your device type. For example, A100-80G-PCIe
Complete the following tasks before sending your PR, and replace `[ ]`
with
`[x]` to indicate you have done them.
-->
- Hardware Type: Atlas 800I A2
- [x] run `make test` to ensure correctness
- [x] run `make checkstyle` to ensure code style
- [ ] run `make test-convergence` to ensure convergence1 parent 2b63217 commit 5057ca8
File tree
1 file changed
+342
-50
lines changed- src/liger_kernel/ops/backends/_ascend/ops
1 file changed
+342
-50
lines changed
0 commit comments