-
Notifications
You must be signed in to change notification settings - Fork 13.5k
ggml-cpu: replace AArch64 NEON assembly with intrinsics in ggml_gemm_q4_0_4x4_q8_0() #10624
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
34e3241 to
a42dca4
Compare
|
Looks good to me. |
|
Adding the missing diff --git a/ggml/src/ggml-cpu/ggml-cpu-aarch64.c b/ggml/src/ggml-cpu/ggml-cpu-aarch64.c
index 11152385e..675c5d8e9 100644
--- a/ggml/src/ggml-cpu/ggml-cpu-aarch64.c
+++ b/ggml/src/ggml-cpu/ggml-cpu-aarch64.c
@@ -3808,7 +3808,7 @@ enum ggml_type ggml_aarch64_get_optimal_repack_type(const struct ggml_tensor * c
return GGML_TYPE_Q4_0_8_8;
}
if (ggml_cpu_has_neon() && ggml_cpu_has_matmul_int8()) {
- return GGML_TYPE_Q4_0_4_8;
+ //return GGML_TYPE_Q4_0_4_8;
}
if (ggml_cpu_has_neon() && ggml_cpu_has_dotprod()) {
return GGML_TYPE_Q4_0_4_4;make -j llama-bench && ./bin/llama-bench -m ../models/llama-3.2-1b-instruct/ggml-model-q4_0.gguf -m ../models/llama-3.2-3b-instruct/ggml-model-q4_0.gguf -t 2,4,8,16 -p 128 -n 32 -r 10
build: a42dca4 (4240) Not sure if these PP speed variations are anything other than noise. Perplexity remains the same. |
a42dca4 to
1f6855f
Compare
|
The inline asm version looks much more unrolled than the one we get with this PR and |
|
How is the performance before and after on your end? |
…q4_0_4x4_q8_0() Signed-off-by: Adrien Gallouët <[email protected]>
1f6855f to
c0df25a
Compare
|
@angt let me double check the latest on the X-Elite real quick and we'll merge it. |
|
I confirm that the t/s is better in the inline asm version (
|
|
@max-krasnyansky we can wait for a better version of the PR, I need to dig the PP case. |
Hi!
It's the same kind of PR as for
ggml_gemv_q4_0_4x4_q8_0().