- 
                Notifications
    You must be signed in to change notification settings 
- Fork 13.5k
ggml-cpu: arm64: q4_K repack gemm and gemv implementations (i8mm) #16739
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Signed-off-by: Alberto Cabrera <[email protected]>
Signed-off-by: Alberto Cabrera <[email protected]>
Signed-off-by: Alberto Cabrera <[email protected]>
Signed-off-by: Alberto Cabrera <[email protected]>
Signed-off-by: Alberto Cabrera <[email protected]>
| q4sb_scales[i] = vmovl_s8(vld1_s8(aux_q4sb)); | ||
| } | ||
|  | ||
| const uint8_t *q4_base = q4_ptr[b].qs + sb * QK_K; | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix a few instances of this code style:
| const uint8_t *q4_base = q4_ptr[b].qs + sb * QK_K; | |
| const uint8_t * q4_base = q4_ptr[b].qs + sb * QK_K; | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Applied clang-format. Sorry about that!
| @ggerganov is there something else needed from my side or are we waiting another review? | 
| There seems to be a bug somewhere. Here is repro on M4 Max: ../scripts/get-wikitext-2.sh
make -j && ./bin/llama-perplexity -hf LiquidAI/LFM2-2.6B-GGUF:Q4_K_M -f ./wikitext-2-raw/wiki.test.raw -dev none
...
# PPL sky-rockets:
0.01.007.961 I perplexity: calculating perplexity over 581 chunks, n_ctx=512, batch_size=2048, n_seq=4
0.05.476.977 I perplexity: 4.47 seconds per pass - ETA 10.82 minutes
[1]6.8941,[2]1485.3563,[3]8468.4132,[4]21269.3291,[5]4800.3655,[6]9365.2385,[7]15453.2190,[8]22744.0153,^C | 
| I was able to replicate the PPL skyrocketing with the generic implementation as well: I'll try to figure out what is going on. Edit: Also happens with Q4_0 repack. Interesting that it happens from the second chunk onwards. I'll try to run in an AVX machine and see if it's something totally unrelated to the GEMMs themselves I also compared the tensor outputs of all mul mats for a couple of llama-eval-callback runs and the results were quite identical, except for the 0.0001 deviation here and there. What I don' t understand is how I was able to run the PPL with LFM correctly, I may have messed up the GGML_CPU_REPACK in the build, sorry about that. | 
| Hm yes -  | 
This PR improves q4_k_q8_k gemm and gemv in arm64 using i8mm and vecdot instructions.
Tested on an Apple M4 with Liquid LFM2-1.2B model:
Master build: 8cf6b42 (6824)
This PR: c4f1358
Perplexity remains unchanged (teste current build vs master):
As for test-backend-ops, I've checked the output of the layer tensors manually comparing REPACK vs master, since #16182 is still ongoing.
Any suggestions on how to better test the PR is welcomed.
Edit: CI failures seem completely unrelated.