-
Notifications
You must be signed in to change notification settings - Fork 13.8k
arm64: optimize q4_k_q8_k kernel with i8mm #13886
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This PR improves q4_k_q8_k gemm kernel with arm64 i8mm instruction.
Tested on neoverse-n2 with llama3 8b q4_k_m quantization model.
- 34% ~ 50% S_PP uplift for all batch sizes
- 12% ~ 37% S_TG uplift for batch size 4 and above
Perplexity doesn't change with this PR.
```
// tested on neoverse-n2
$ llama-batched-bench \
-m Meta-Llama-3-8B-Instruct-Q4_K_M.gguf \
--no-mmap -fa \
-c 8192 -b 4096 -ub 512 -npp 128 -ntg 128 \
-npl 1,2,4,8,16,32 \
-t 64
---------------------------------------------------------------------
| PP | TG | B | S_PP t/s | S_TG t/s |
| | | | original | this pr | original | this pr |
|-------|--------|------|----------|----------|----------|----------|
| 128 | 128 | 1 | 110.12 | 147.83 | 24.36 | 24.28 |
| 128 | 128 | 2 | 121.16 | 172.42 | 46.36 | 47.93 |
| 128 | 128 | 4 | 120.15 | 169.75 | 74.68 | 84.00 |
| 128 | 128 | 8 | 130.97 | 196.81 | 91.04 | 114.74 |
| 128 | 128 | 16 | 131.01 | 196.88 | 101.43 | 135.79 |
| 128 | 128 | 32 | 130.85 | 196.51 | 106.97 | 147.29 |
---------------------------------------------------------------------
```
| uint32_t utmp[4]; | ||
|
|
||
| #if defined(__ARM_FEATURE_MATMUL_INT8) | ||
| if (nrc == 2) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cyb70289: Naive question - If I understand correctly, this is the number of rows and if it has to be 2 to use SMMLA how come we see gains with Batch size 1 in Prompt prefilling ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Prompt prefill is different from token generation. In PP, all the tokens are process at once, the activation shape is [batch_size, prompt_tokens, embedding_size]. So I8MM is always useful for PP even if batch=1 (unless the prompt has only one token). For TG, the activation shape if [batch_size, 1, embedding_size], I8MM only works for batch > 1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much @cyb70289 for taking the time to respond. That makes sense.
May I ask - what is nrc in the context of this micro-kernel ? Is it row count of the tile that this micro kernel is processing ? So, if I understand the I8MM path is triggered for cases where row count in the tile is == 2 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC, this nrc is a constant, either 1 or 2, as the updated type_traits_cpu[] in this patch. It indicats the maximal rows this kernel can handle in oneshot. It's not related to tensor shape. But it can be reduced to 1 when the tensor is just a vector, even if the kernel can handle 2.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The framework will feed the kernel with appropriate nrc rows of data based on its reported capability and actual data shape.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it - thanks. So basicaly the SMMLA is in action only when nrc is literally just 2. Thanks
|
@cyb70289 Hi, I'm testing this patch on an N2 machine on deepseek q4_k model. Seems that sometimes it goes into your optimized branch, sometimes it falls back to the SVE branch, is this normal? |
What's the batch size? For single batch, only the prompt prefill stage may enter the optimized path. |
This PR improves q4_k_q8_k gemm kernel with arm64 i8mm instruction.
Tested on neoverse-n2 with llama3 8b q4_k_m quantization model.
Perplexity doesn't change with this PR.
Make sure to read the contributing guidelines before submitting a PR