Commit a6b6383
[ARM] Improve LLM performance & mem usage using int4-bf16 KleidiAI kernels (pytorch#158250)
Co-authored-by: Nikhil Gupta [[email protected]](mailto:[email protected])
This PR enables the use of KleidiAI INT4 kernels that directly produce BF16 outputs within PyTorch to boost LLM prefill & decode performance
**This change improves decode throughput by ~15% & reduces memory required to inference the model by 50%**
### Benchmark Setup
```
Model: meta-llama/Llama-3.1-8B
Test Platform: Neoverse V2
```
### Detailed Results
| Metric | With `--compile` | Without `--compile` |
|----------------------------------|---------------------------|---------------------------|
| Quantization Scheme | INT4 symmetric channelwise | INT4 symmetric channelwise |
| Input Precision | BF16 | BF16 |
| Number of Layers Quantized | 32 | 32 |
| Average Compression Ratio | 87.49% | 87.49% |
| Total Quantization Time (s) | 9.62 | 10.32 |
| Compile Time (First) (s) | 134.48 | 1.69 |
| Compile Time (Second) (s) | 80.44 | 1.60 |
| Compile Time (Subsequent) (s) | 0.19 | 0.22 |
| Prefill Tokens | 54 | 54 |
| Decoded Tokens | 33 | 33 |
| Prefill Time (s) | 0.19 | 0.22 |
| Decode Time (s) | 0.76 | 1.38 |
| E2E Generation Time (s) | 0.95 | 1.60 |
| Prefill Throughput (tokens/s) | 288.13 | 249.91 |
| Decode Throughput (tokens/s) | 43.42 | 23.83 |
Pull Request resolved: pytorch#158250
Approved by: https://github.com/malfet, https://github.com/aditew01, https://github.com/fadara01
Co-authored-by: Nikhil Gupta <[email protected]>
Co-authored-by: Nikita Shulga <[email protected]>1 parent 21c11da commit a6b6383
File tree
9 files changed
+664
-137
lines changed- aten/src/ATen/native
- cpu
- kleidiai
- test/inductor
- torch
9 files changed
+664
-137
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
3554 | 3554 | | |
3555 | 3555 | | |
3556 | 3556 | | |
3557 | | - | |
| 3557 | + | |
3558 | 3558 | | |
3559 | | - | |
| 3559 | + | |
3560 | 3560 | | |
3561 | 3561 | | |
3562 | 3562 | | |
| |||
0 commit comments