ggml-cpu: arm64: q4_K repack gemm and gemv implementations (i8mm) #16739

Alcpz · 2025-10-23T12:03:18Z

This PR improves q4_k_q8_k gemm and gemv in arm64 using i8mm and vecdot instructions.

Tested on an Apple M4 with Liquid LFM2-1.2B model:

./bin/llama-bench -p 256 -n 128 -pg 0,0 -t 8 -m models/LFM2-1.2B-Q4_K_M.gguf,models/LFM2-1.2B-Q4_K_pure.gguf

model	backend	test	t/s (master)	t/s (this PR)	speedup
lfm2 1.2B Q4_K - Medium	CPU	pp256	436.57 ± 0.40	673.30 ± 2.56	1.54
lfm2 1.2B Q4_K - Medium	CPU	tg128	217.84 ± 8.17	229.91 ± 1.22	1.06
lfm2 1.2B Q4_K - Medium (pure Q4_K)	CPU	pp256	462.25 ± 0.67	800.99 ± 3.61	1.73
lfm2 1.2B Q4_K - Medium (pure Q4_K)	CPU	tg128	241.74 ± 1.47	254.42 ± 2.42	1.05
llama 8B Q4_K - Medium	CPU	pp256	62.43 ± 1.19	99.52 ± 0.11	1.54
llama 8B Q4_K - Medium	CPU	tg128	36.70 ± 0.70	42.47 ± 0.32	1.15

Master build: 8cf6b42 (6824)
This PR: c4f1358

Perplexity remains unchanged (teste current build vs master):

Llama3.1: 7.8861 +/- 0.11849 
LFM2 1.2B: 16.9954 +/- 0.97671

As for test-backend-ops, I've checked the output of the layer tensors manually comparing REPACK vs master, since #16182 is still ongoing.

Any suggestions on how to better test the PR is welcomed.

Edit: CI failures seem completely unrelated.

Signed-off-by: Alberto Cabrera <[email protected]>

ggerganov · 2025-10-27T18:24:32Z

ggml/src/ggml-cpu/arch/arm/repack.cpp

+                    q4sb_scales[i] = vmovl_s8(vld1_s8(aux_q4sb));
+                }
+
+                const uint8_t *q4_base = q4_ptr[b].qs + sb * QK_K;


Fix a few instances of this code style:

Suggested change

const uint8_t *q4_base = q4_ptr[b].qs + sb * QK_K;

const uint8_t * q4_base = q4_ptr[b].qs + sb * QK_K;

Applied clang-format. Sorry about that!

Alcpz · 2025-10-31T11:40:37Z

@ggerganov is there something else needed from my side or are we waiting another review?

ggerganov · 2025-10-31T12:10:33Z

There seems to be a bug somewhere. Here is repro on M4 Max:

../scripts/get-wikitext-2.sh
make -j && ./bin/llama-perplexity -hf LiquidAI/LFM2-2.6B-GGUF:Q4_K_M -f ./wikitext-2-raw/wiki.test.raw -dev none

...

# PPL sky-rockets:
0.01.007.961 I perplexity: calculating perplexity over 581 chunks, n_ctx=512, batch_size=2048, n_seq=4
0.05.476.977 I perplexity: 4.47 seconds per pass - ETA 10.82 minutes
[1]6.8941,[2]1485.3563,[3]8468.4132,[4]21269.3291,[5]4800.3655,[6]9365.2385,[7]15453.2190,[8]22744.0153,^C

#16739 (comment)

Alcpz · 2025-10-31T13:17:44Z

I was able to replicate the PPL skyrocketing with the generic implementation as well:

# ggml_gemm_q4_K_8x8_q8_K_generic
perplexity: 34.48 seconds per pass - ETA 1.43 minutes
[1]9.6770,[2]1762.7802,[3]9505.4348,[4]22802.6452,[5]5311.2750,[6]10333.9703,[7]16582.8044,[8]23315.3388,[9]11093.7993,[10]14942.7293,

# ggml_gemm_q4_K_8x8_q8_K
perplexity: 2.71 seconds per pass - ETA 0.10 minutes
[1]9.7353,[2]1764.9156,[3]9519.3014,[4]22839.7651,[5]5320.7637,[6]10348.6530,[7]16591.6868,[8]23311.9378

I'll try to figure out what is going on.

Edit:

# Q4_0 Model
perplexity: 1.84 seconds per pass - ETA 0.07 minutes
[1]9.9763,[2]1820.5697,[3]9757.8288,[4]23501.0590,[5]5479.2732,[6]10610.0991,[7]17050.2390,[8]23943.4191,[9]11327.5779,[10]15263.4054,

Also happens with Q4_0 repack. Interesting that it happens from the second chunk onwards. I'll try to run in an AVX machine and see if it's something totally unrelated to the GEMMs themselves

I also compared the tensor outputs of all mul mats for a couple of llama-eval-callback runs and the results were quite identical, except for the 0.0001 deviation here and there.

What I don' t understand is how I was able to run the PPL with LFM correctly, I may have messed up the GGML_CPU_REPACK in the build, sorry about that.

ggerganov · 2025-10-31T14:18:21Z

Hm yes - Q4_0 with LFM is indeed also problematic. However Q4_0 with llama 3.1 8B is good. So this means there is a bug that occurs only for certain shapes.

Alcpz added 10 commits October 23, 2025 11:02

Enabled q4_K_8x8_q8_K path on ARM

1f7f498

wip: I8mm qs multiplication, pending bias

4e5be2c

cpu : arm : REPACK gemm q4_K8x8 implementation

f9e1527

Signed-off-by: Alberto Cabrera <[email protected]>

Guard gemm with proper features, improved superblock scale and min calc

28e30c2

Signed-off-by: Alberto Cabrera <[email protected]>

cpu: arm: Implemented REPACK gemv for Q4_K

0b1fec6

Signed-off-by: Alberto Cabrera <[email protected]>

Removed completed TODO

c14e3e4

Fixed missing guards when selecting optimal repack type for Q4_K

0b45665

Signed-off-by: Alberto Cabrera <[email protected]>

Fixed macro guard for gemv

f678c83

Fixed wrong comment in GEMV

ef01952

Fixed warning for unused variable

c4f1358

Alcpz requested review from ggerganov and slaren as code owners October 23, 2025 12:03

github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Oct 23, 2025

Alcpz changed the title ~~ggml-cpu: arm64: q4_K repack gemm and gemv implementations~~ ggml-cpu: arm64: q4_K repack gemm and gemv implementations (i8mm) Oct 27, 2025

vdotq_s32 -> ggml_vdotq_s32

5bbed90

Signed-off-by: Alberto Cabrera <[email protected]>

ggerganov previously approved these changes Oct 27, 2025

View reviewed changes

Clang-format issues

7081eda

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ggml-cpu: arm64: q4_K repack gemm and gemv implementations (i8mm) #16739

ggml-cpu: arm64: q4_K repack gemm and gemv implementations (i8mm) #16739

Alcpz commented Oct 23, 2025 •

edited

Loading

Uh oh!

ggerganov Oct 27, 2025

Uh oh!

Alcpz Oct 28, 2025

Uh oh!

Alcpz commented Oct 31, 2025

Uh oh!

ggerganov commented Oct 31, 2025

Uh oh!

Alcpz commented Oct 31, 2025 •

edited

Loading

Uh oh!

ggerganov commented Oct 31, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	const uint8_t q4_base = q4_ptr[b].qs + sb QK_K;
	const uint8_t * q4_base = q4_ptr[b].qs + sb * QK_K;

ggml-cpu: arm64: q4_K repack gemm and gemv implementations (i8mm) #16739

Are you sure you want to change the base?

ggml-cpu: arm64: q4_K repack gemm and gemv implementations (i8mm) #16739

Conversation

Alcpz commented Oct 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ggerganov Oct 27, 2025

Choose a reason for hiding this comment

Uh oh!

Alcpz Oct 28, 2025

Choose a reason for hiding this comment

Uh oh!

Alcpz commented Oct 31, 2025

Uh oh!

ggerganov commented Oct 31, 2025

Uh oh!

Alcpz commented Oct 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ggerganov commented Oct 31, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Alcpz commented Oct 23, 2025 •

edited

Loading

Alcpz commented Oct 31, 2025 •

edited

Loading