ggml-cpu: replace AArch64 NEON assembly with intrinsics in ggml_gemm_q4_0_4x4_q8_0() #10624

angt · 2024-12-02T12:16:30Z

Hi!
It's the same kind of PR as for ggml_gemv_q4_0_4x4_q8_0().

ggml/src/ggml-cpu/ggml-cpu-aarch64.c

max-krasnyansky · 2024-12-03T01:19:51Z

Looks good to me.

ggml/src/ggml-cpu/ggml-cpu-aarch64.c

ggerganov · 2024-12-03T13:03:10Z

Adding the missing return statement and applying the following patch to force the Q4_0_4_4 packing, I get the following results on M2 Ultra:

diff --git a/ggml/src/ggml-cpu/ggml-cpu-aarch64.c b/ggml/src/ggml-cpu/ggml-cpu-aarch64.c
index 11152385e..675c5d8e9 100644
--- a/ggml/src/ggml-cpu/ggml-cpu-aarch64.c
+++ b/ggml/src/ggml-cpu/ggml-cpu-aarch64.c
@@ -3808,7 +3808,7 @@ enum ggml_type ggml_aarch64_get_optimal_repack_type(const struct ggml_tensor * c
             return GGML_TYPE_Q4_0_8_8;
         }
         if (ggml_cpu_has_neon() && ggml_cpu_has_matmul_int8()) {
-            return GGML_TYPE_Q4_0_4_8;
+            //return GGML_TYPE_Q4_0_4_8;
         }
         if (ggml_cpu_has_neon() && ggml_cpu_has_dotprod()) {
             return GGML_TYPE_Q4_0_4_4;

make -j llama-bench && ./bin/llama-bench -m ../models/llama-3.2-1b-instruct/ggml-model-q4_0.gguf -m ../models/llama-3.2-3b-instruct/ggml-model-q4_0.gguf -t 2,4,8,16 -p 128 -n 32 -r 10

model	size	backend	threads	test	master t/s	PR t/s
llama 1B Q4_0	727.75 MiB	CPU	2	pp128	274.96 ± 4.16	245.61 ± 3.91
llama 1B Q4_0	727.75 MiB	CPU	2	tg32	80.76 ± 0.11	79.67 ± 0.23
llama 1B Q4_0	727.75 MiB	CPU	4	pp128	537.08 ± 9.66	479.83 ± 3.89
llama 1B Q4_0	727.75 MiB	CPU	4	tg32	145.58 ± 0.39	145.09 ± 0.50
llama 1B Q4_0	727.75 MiB	CPU	8	pp128	899.55 ± 12.69	875.52 ± 7.99
llama 1B Q4_0	727.75 MiB	CPU	8	tg32	207.08 ± 0.36	203.30 ± 0.54
llama 1B Q4_0	727.75 MiB	CPU	16	pp128	1106.11 ± 81.65	1499.54 ± 52.45
llama 1B Q4_0	727.75 MiB	CPU	16	tg32	211.66 ± 5.73	213.04 ± 2.08
llama 3B Q4_0	1.78 GiB	CPU	2	pp128	96.39 ± 0.87	85.91 ± 1.27
llama 3B Q4_0	1.78 GiB	CPU	2	tg32	35.39 ± 0.27	35.00 ± 0.19
llama 3B Q4_0	1.78 GiB	CPU	4	pp128	194.96 ± 0.30	166.19 ± 0.77
llama 3B Q4_0	1.78 GiB	CPU	4	tg32	65.25 ± 0.22	64.41 ± 0.13
llama 3B Q4_0	1.78 GiB	CPU	8	pp128	332.51 ± 5.40	315.59 ± 2.74
llama 3B Q4_0	1.78 GiB	CPU	8	tg32	91.37 ± 0.17	91.83 ± 0.21
llama 3B Q4_0	1.78 GiB	CPU	16	pp128	504.89 ± 4.86	589.85 ± 3.01
llama 3B Q4_0	1.78 GiB	CPU	16	tg32	89.63 ± 1.62	92.72 ± 1.03

build: a42dca4 (4240)

Not sure if these PP speed variations are anything other than noise.

Perplexity remains the same.

angt · 2024-12-03T15:53:58Z

The inline asm version looks much more unrolled than the one we get with this PR and -O3.

ggerganov · 2024-12-05T19:08:21Z

How is the performance before and after on your end?

…q4_0_4x4_q8_0() Signed-off-by: Adrien Gallouët <[email protected]>

max-krasnyansky · 2024-12-09T19:39:47Z

@angt let me double check the latest on the X-Elite real quick and we'll merge it.

angt · 2024-12-09T21:19:30Z

I confirm that the t/s is better in the inline asm version (b4239 is 991f8aab) for PP :(

Model	Threads	Test	t/s `1f6855f`	t/s `991f8aa`	Speedup
llama 1B Q4_0_4_4	2	pp512	119.48	142.03	1.19
llama 1B Q4_0_4_4	2	tg128	39.80	39.97	1.00
llama 1B Q4_0_4_4	4	pp512	241.24	282.65	1.17
llama 1B Q4_0_4_4	4	tg128	74.77	75.40	1.01
llama 1B Q4_0_4_4	8	pp512	451.38	530.47	1.18
llama 1B Q4_0_4_4	8	tg128	124.83	127.18	1.02
llama 1B Q4_0_4_4	16	pp512	821.19	962.76	1.17
llama 1B Q4_0_4_4	16	tg128	188.16	186.23	0.99
llama 3B Q4_0_4_4	2	pp512	43.25	52.40	1.21
llama 3B Q4_0_4_4	2	tg128	18.00	17.91	0.99
llama 3B Q4_0_4_4	4	pp512	86.82	103.99	1.20
llama 3B Q4_0_4_4	4	tg128	33.98	33.77	0.99
llama 3B Q4_0_4_4	8	pp512	164.33	195.36	1.19
llama 3B Q4_0_4_4	8	tg128	57.56	56.85	0.99
llama 3B Q4_0_4_4	16	pp512	299.43	354.50	1.18
llama 3B Q4_0_4_4	16	tg128	85.20	83.95	0.99
qwen2 3B Q4_0_4_4	2	pp512	44.64	54.15	1.21
qwen2 3B Q4_0_4_4	2	tg128	18.73	18.41	0.98
qwen2 3B Q4_0_4_4	4	pp512	89.75	107.05	1.19
qwen2 3B Q4_0_4_4	4	tg128	34.68	35.23	1.02
qwen2 3B Q4_0_4_4	8	pp512	169.23	201.48	1.19
qwen2 3B Q4_0_4_4	8	tg128	57.42	58.53	1.02
qwen2 3B Q4_0_4_4	16	pp512	307.74	366.04	1.19
qwen2 3B Q4_0_4_4	16	tg128	83.97	83.80	1.00

angt · 2024-12-09T21:23:51Z

@max-krasnyansky we can wait for a better version of the PR, I need to dig the PP case.

github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Dec 2, 2024

ggerganov reviewed Dec 2, 2024

View reviewed changes

ggml/src/ggml-cpu/ggml-cpu-aarch64.c Outdated Show resolved Hide resolved

angt force-pushed the ggml-cpu-replace-aarch64-neon-assembly-with-intrinsics-in-ggml_gemm_q4_0_4x4_q8_0 branch from 34e3241 to a42dca4 Compare December 2, 2024 21:27

max-krasnyansky approved these changes Dec 3, 2024

View reviewed changes

ggerganov reviewed Dec 3, 2024

View reviewed changes

ggml/src/ggml-cpu/ggml-cpu-aarch64.c Show resolved Hide resolved

angt force-pushed the ggml-cpu-replace-aarch64-neon-assembly-with-intrinsics-in-ggml_gemm_q4_0_4x4_q8_0 branch from a42dca4 to 1f6855f Compare December 3, 2024 13:25

ggml-cpu: replace AArch64 NEON assembly with intrinsics in ggml_gemm_…

c0df25a

…q4_0_4x4_q8_0() Signed-off-by: Adrien Gallouët <[email protected]>

angt force-pushed the ggml-cpu-replace-aarch64-neon-assembly-with-intrinsics-in-ggml_gemm_q4_0_4x4_q8_0 branch from 1f6855f to c0df25a Compare December 9, 2024 19:34

ggerganov approved these changes Dec 9, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ggml-cpu: replace AArch64 NEON assembly with intrinsics in ggml_gemm_q4_0_4x4_q8_0() #10624

ggml-cpu: replace AArch64 NEON assembly with intrinsics in ggml_gemm_q4_0_4x4_q8_0() #10624

Uh oh!

angt commented Dec 2, 2024

Uh oh!

Uh oh!

max-krasnyansky commented Dec 3, 2024

Uh oh!

Uh oh!

ggerganov commented Dec 3, 2024

Uh oh!

angt commented Dec 3, 2024

Uh oh!

ggerganov commented Dec 5, 2024

Uh oh!

max-krasnyansky commented Dec 9, 2024

Uh oh!

angt commented Dec 9, 2024

Uh oh!

angt commented Dec 9, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ggml-cpu: replace AArch64 NEON assembly with intrinsics in ggml_gemm_q4_0_4x4_q8_0() #10624

Are you sure you want to change the base?

ggml-cpu: replace AArch64 NEON assembly with intrinsics in ggml_gemm_q4_0_4x4_q8_0() #10624

Uh oh!

Conversation

angt commented Dec 2, 2024

Uh oh!

Uh oh!

max-krasnyansky commented Dec 3, 2024

Uh oh!

Uh oh!

ggerganov commented Dec 3, 2024

Uh oh!

angt commented Dec 3, 2024

Uh oh!

ggerganov commented Dec 5, 2024

Uh oh!

max-krasnyansky commented Dec 9, 2024

Uh oh!

angt commented Dec 9, 2024

Uh oh!

angt commented Dec 9, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants