Skip to content

Conversation

@Alcpz
Copy link
Contributor

@Alcpz Alcpz commented Oct 23, 2025

This PR improves q4_k_q8_k gemm and gemv in arm64 using i8mm and vecdot instructions.

Tested on an Apple M4 with Liquid LFM2-1.2B model:

./bin/llama-bench -p 256 -n 128 -pg 0,0 -t 8 -m models/LFM2-1.2B-Q4_K_M.gguf,models/LFM2-1.2B-Q4_K_pure.gguf
model backend test t/s (master) t/s (this PR) speedup
lfm2 1.2B Q4_K - Medium CPU pp256 436.57 ± 0.40 673.30 ± 2.56 1.54
lfm2 1.2B Q4_K - Medium CPU tg128 217.84 ± 8.17 229.91 ± 1.22 1.06
lfm2 1.2B Q4_K - Medium (pure Q4_K) CPU pp256 462.25 ± 0.67 800.99 ± 3.61 1.73
lfm2 1.2B Q4_K - Medium (pure Q4_K) CPU tg128 241.74 ± 1.47 254.42 ± 2.42 1.05
llama 8B Q4_K - Medium CPU pp256 62.43 ± 1.19 99.52 ± 0.11 1.54
llama 8B Q4_K - Medium CPU tg128 36.70 ± 0.70 42.47 ± 0.32 1.15

Master build: 8cf6b42 (6824)
This PR: c4f1358

Perplexity remains unchanged (teste current build vs master):

Llama3.1: 7.8861 +/- 0.11849 
LFM2 1.2B: 16.9954 +/- 0.97671

As for test-backend-ops, I've checked the output of the layer tensors manually comparing REPACK vs master, since #16182 is still ongoing.

Any suggestions on how to better test the PR is welcomed.

Edit: CI failures seem completely unrelated.

@github-actions github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Oct 23, 2025
@Alcpz Alcpz changed the title ggml-cpu: arm64: q4_K repack gemm and gemv implementations ggml-cpu: arm64: q4_K repack gemm and gemv implementations (i8mm) Oct 27, 2025
Signed-off-by: Alberto Cabrera <[email protected]>
ggerganov
ggerganov previously approved these changes Oct 27, 2025
q4sb_scales[i] = vmovl_s8(vld1_s8(aux_q4sb));
}

const uint8_t *q4_base = q4_ptr[b].qs + sb * QK_K;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix a few instances of this code style:

Suggested change
const uint8_t *q4_base = q4_ptr[b].qs + sb * QK_K;
const uint8_t * q4_base = q4_ptr[b].qs + sb * QK_K;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Applied clang-format. Sorry about that!

@Alcpz
Copy link
Contributor Author

Alcpz commented Oct 31, 2025

@ggerganov is there something else needed from my side or are we waiting another review?

@ggerganov
Copy link
Member

There seems to be a bug somewhere. Here is repro on M4 Max:

../scripts/get-wikitext-2.sh
make -j && ./bin/llama-perplexity -hf LiquidAI/LFM2-2.6B-GGUF:Q4_K_M -f ./wikitext-2-raw/wiki.test.raw -dev none

...

# PPL sky-rockets:
0.01.007.961 I perplexity: calculating perplexity over 581 chunks, n_ctx=512, batch_size=2048, n_seq=4
0.05.476.977 I perplexity: 4.47 seconds per pass - ETA 10.82 minutes
[1]6.8941,[2]1485.3563,[3]8468.4132,[4]21269.3291,[5]4800.3655,[6]9365.2385,[7]15453.2190,[8]22744.0153,^C

@Alcpz
Copy link
Contributor Author

Alcpz commented Oct 31, 2025

I was able to replicate the PPL skyrocketing with the generic implementation as well:

# ggml_gemm_q4_K_8x8_q8_K_generic
perplexity: 34.48 seconds per pass - ETA 1.43 minutes
[1]9.6770,[2]1762.7802,[3]9505.4348,[4]22802.6452,[5]5311.2750,[6]10333.9703,[7]16582.8044,[8]23315.3388,[9]11093.7993,[10]14942.7293,

# ggml_gemm_q4_K_8x8_q8_K
perplexity: 2.71 seconds per pass - ETA 0.10 minutes
[1]9.7353,[2]1764.9156,[3]9519.3014,[4]22839.7651,[5]5320.7637,[6]10348.6530,[7]16591.6868,[8]23311.9378

I'll try to figure out what is going on.

Edit:

# Q4_0 Model
perplexity: 1.84 seconds per pass - ETA 0.07 minutes
[1]9.9763,[2]1820.5697,[3]9757.8288,[4]23501.0590,[5]5479.2732,[6]10610.0991,[7]17050.2390,[8]23943.4191,[9]11327.5779,[10]15263.4054,

Also happens with Q4_0 repack. Interesting that it happens from the second chunk onwards. I'll try to run in an AVX machine and see if it's something totally unrelated to the GEMMs themselves

I also compared the tensor outputs of all mul mats for a couple of llama-eval-callback runs and the results were quite identical, except for the 0.0001 deviation here and there.

What I don' t understand is how I was able to run the PPL with LFM correctly, I may have messed up the GGML_CPU_REPACK in the build, sorry about that.

@ggerganov
Copy link
Member

Hm yes - Q4_0 with LFM is indeed also problematic. However Q4_0 with llama 3.1 8B is good. So this means there is a bug that occurs only for certain shapes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants