Skip to content

Conversation

@angt
Copy link
Collaborator

@angt angt commented Dec 2, 2024

Hi!
It's the same kind of PR as for ggml_gemv_q4_0_4x4_q8_0().

@github-actions github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Dec 2, 2024
@angt angt force-pushed the ggml-cpu-replace-aarch64-neon-assembly-with-intrinsics-in-ggml_gemm_q4_0_4x4_q8_0 branch from 34e3241 to a42dca4 Compare December 2, 2024 21:27
@max-krasnyansky
Copy link
Collaborator

Looks good to me.

@ggerganov
Copy link
Member

Adding the missing return statement and applying the following patch to force the Q4_0_4_4 packing, I get the following results on M2 Ultra:

diff --git a/ggml/src/ggml-cpu/ggml-cpu-aarch64.c b/ggml/src/ggml-cpu/ggml-cpu-aarch64.c
index 11152385e..675c5d8e9 100644
--- a/ggml/src/ggml-cpu/ggml-cpu-aarch64.c
+++ b/ggml/src/ggml-cpu/ggml-cpu-aarch64.c
@@ -3808,7 +3808,7 @@ enum ggml_type ggml_aarch64_get_optimal_repack_type(const struct ggml_tensor * c
             return GGML_TYPE_Q4_0_8_8;
         }
         if (ggml_cpu_has_neon() && ggml_cpu_has_matmul_int8()) {
-            return GGML_TYPE_Q4_0_4_8;
+            //return GGML_TYPE_Q4_0_4_8;
         }
         if (ggml_cpu_has_neon() && ggml_cpu_has_dotprod()) {
             return GGML_TYPE_Q4_0_4_4;
make -j llama-bench && ./bin/llama-bench -m ../models/llama-3.2-1b-instruct/ggml-model-q4_0.gguf -m ../models/llama-3.2-3b-instruct/ggml-model-q4_0.gguf -t 2,4,8,16 -p 128 -n 32 -r 10
model size backend threads test master t/s PR t/s
llama 1B Q4_0 727.75 MiB CPU 2 pp128 274.96 ± 4.16 245.61 ± 3.91
llama 1B Q4_0 727.75 MiB CPU 2 tg32 80.76 ± 0.11 79.67 ± 0.23
llama 1B Q4_0 727.75 MiB CPU 4 pp128 537.08 ± 9.66 479.83 ± 3.89
llama 1B Q4_0 727.75 MiB CPU 4 tg32 145.58 ± 0.39 145.09 ± 0.50
llama 1B Q4_0 727.75 MiB CPU 8 pp128 899.55 ± 12.69 875.52 ± 7.99
llama 1B Q4_0 727.75 MiB CPU 8 tg32 207.08 ± 0.36 203.30 ± 0.54
llama 1B Q4_0 727.75 MiB CPU 16 pp128 1106.11 ± 81.65 1499.54 ± 52.45
llama 1B Q4_0 727.75 MiB CPU 16 tg32 211.66 ± 5.73 213.04 ± 2.08
llama 3B Q4_0 1.78 GiB CPU 2 pp128 96.39 ± 0.87 85.91 ± 1.27
llama 3B Q4_0 1.78 GiB CPU 2 tg32 35.39 ± 0.27 35.00 ± 0.19
llama 3B Q4_0 1.78 GiB CPU 4 pp128 194.96 ± 0.30 166.19 ± 0.77
llama 3B Q4_0 1.78 GiB CPU 4 tg32 65.25 ± 0.22 64.41 ± 0.13
llama 3B Q4_0 1.78 GiB CPU 8 pp128 332.51 ± 5.40 315.59 ± 2.74
llama 3B Q4_0 1.78 GiB CPU 8 tg32 91.37 ± 0.17 91.83 ± 0.21
llama 3B Q4_0 1.78 GiB CPU 16 pp128 504.89 ± 4.86 589.85 ± 3.01
llama 3B Q4_0 1.78 GiB CPU 16 tg32 89.63 ± 1.62 92.72 ± 1.03

build: a42dca4 (4240)

Not sure if these PP speed variations are anything other than noise.

Perplexity remains the same.

@angt angt force-pushed the ggml-cpu-replace-aarch64-neon-assembly-with-intrinsics-in-ggml_gemm_q4_0_4x4_q8_0 branch from a42dca4 to 1f6855f Compare December 3, 2024 13:25
@angt
Copy link
Collaborator Author

angt commented Dec 3, 2024

The inline asm version looks much more unrolled than the one we get with this PR and -O3.

@ggerganov
Copy link
Member

How is the performance before and after on your end?

@angt angt force-pushed the ggml-cpu-replace-aarch64-neon-assembly-with-intrinsics-in-ggml_gemm_q4_0_4x4_q8_0 branch from 1f6855f to c0df25a Compare December 9, 2024 19:34
@max-krasnyansky
Copy link
Collaborator

@angt let me double check the latest on the X-Elite real quick and we'll merge it.

@angt
Copy link
Collaborator Author

angt commented Dec 9, 2024

I confirm that the t/s is better in the inline asm version (b4239 is 991f8aab) for PP :(

Model Threads Test t/s 1f6855f t/s 991f8aa Speedup
llama 1B Q4_0_4_4 2 pp512 119.48 142.03 1.19
llama 1B Q4_0_4_4 2 tg128 39.80 39.97 1.00
llama 1B Q4_0_4_4 4 pp512 241.24 282.65 1.17
llama 1B Q4_0_4_4 4 tg128 74.77 75.40 1.01
llama 1B Q4_0_4_4 8 pp512 451.38 530.47 1.18
llama 1B Q4_0_4_4 8 tg128 124.83 127.18 1.02
llama 1B Q4_0_4_4 16 pp512 821.19 962.76 1.17
llama 1B Q4_0_4_4 16 tg128 188.16 186.23 0.99
llama 3B Q4_0_4_4 2 pp512 43.25 52.40 1.21
llama 3B Q4_0_4_4 2 tg128 18.00 17.91 0.99
llama 3B Q4_0_4_4 4 pp512 86.82 103.99 1.20
llama 3B Q4_0_4_4 4 tg128 33.98 33.77 0.99
llama 3B Q4_0_4_4 8 pp512 164.33 195.36 1.19
llama 3B Q4_0_4_4 8 tg128 57.56 56.85 0.99
llama 3B Q4_0_4_4 16 pp512 299.43 354.50 1.18
llama 3B Q4_0_4_4 16 tg128 85.20 83.95 0.99
qwen2 3B Q4_0_4_4 2 pp512 44.64 54.15 1.21
qwen2 3B Q4_0_4_4 2 tg128 18.73 18.41 0.98
qwen2 3B Q4_0_4_4 4 pp512 89.75 107.05 1.19
qwen2 3B Q4_0_4_4 4 tg128 34.68 35.23 1.02
qwen2 3B Q4_0_4_4 8 pp512 169.23 201.48 1.19
qwen2 3B Q4_0_4_4 8 tg128 57.42 58.53 1.02
qwen2 3B Q4_0_4_4 16 pp512 307.74 366.04 1.19
qwen2 3B Q4_0_4_4 16 tg128 83.97 83.80 1.00

@angt
Copy link
Collaborator Author

angt commented Dec 9, 2024

@max-krasnyansky we can wait for a better version of the PR, I need to dig the PP case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants