ggml : block repack support for Q4_K quanti for AArch64 architecture #15719

hongyang-7 · 2025-09-01T16:46:40Z

This PR improves q4_k_q8_k kernel with block repacking support for AArch64 architecture, based on NEON.

Following structures and functions are implemented:

new quanti: block_q4_kx4 based on four q4_k blocks, along with offline repacking function
new quantize path: add NEON implementation for block_q8_Kx4 in ggml_quantize_mat_q8_K_4x8()
new gemv kernel: new ggml_gemv_q4_K_4x8_q8_K() NEON kernel for GGML_OP_MUL_MAT_ID/GGML_OP_MUL_MAT ops
new gemm kernel: new ggml_gemm_q4_K_4x8_q8_K() NEON kernel for GGML_OP_MUL_MAT_ID/GGML_OP_MUL_MAT ops

Test environment

Server: Neoverse-N2
System_info: n_threads = 64 (n_threads_batch = 64) / 256 | CPU : NEON = 1 | ARM_FMA = 1 | FP16_VA = 1 | MATMUL_INT8 = 1 | SVE = 1 | DOTPROD = 1 | SVE_CNT = 16 | OPENMP = 1 | REPACK = 1 |
Models: 2 of different scales

models	storage size	param size	quanti
meta-llama-3-8b-instruct.Q4_K_M.gguf	4.6G	8.03B	Q4_K_M
DeepSeek-V3-Q4_k_M.gguf	377G	671B	Q4_K_M

Bench results

Good gains were observed with this PR, for both S_PP and S_TG:

(1) meta-llama-3-8b-instruct.Q4_K_M.gguf

./bin/llama-batched-bench -m /mnt/models/meta-llama-3-8b-instruct.Q4_K_M.gguf -c 8192 -b 2048 -ub 512  -npp 128 -ntg 128 -npl 1,4,8,16,32 -t 64 --no-mmap

B	S_PP t/s (original)	S_PP t/s (this PR)	S_PP speedup	S_TG t/s (original)	S_TG t/s (this PR)	S_TG speedup
1	168.99	258.42	152.9%	36.34	35.91	98.8%
4	178.88	273.85	153.1%	76.84	95.93	124.8%
8	180.94	280.88	155.2%	102.88	125.94	122.4%
16	180.77	280.69	155.3%	127.70	174.44	136.6%
32	180.65	280.71	155.4%	139.46	194.32	139.3%
geomean			154.4%			123.5%

(2) DeepSeek-V3-Q4_k_M.gguf

./bin/llama-batched-bench -m /mnt/models/DeepSeek-V3-Q4_k_M.gguf -c 8192 -b 2048 -ub 512  -npp 128 -ntg 128 -npl 1,4,8,16,32 -t 64 --no-mmap

B	S_PP t/s (original)	S_PP t/s (this PR)	S_PP speedup	S_TG t/s (original)	S_TG t/s (this PR)	S_TG speedup
1	24.17	30.13	124.7%	6.52	6.46	99.1%
4	25.36	33.13	130.6%	12.18	12.65	103.9%
8	25.43	33.15	130.4%	14.85	15.41	103.8%
16	25.41	33.12	130.3%	16.76	17.72	105.7%
32	25.40	33.10	130.3%	18.19	19.82	109.0%
geomean			129.2%			104.2%

Perplexity

(1) meta-llama-3-8b-instruct.Q4_K_M.gguf

model	perplexity (Final estimate PPL)	Commit id
original	3.7533 +/- 0.14294	77dee9d
this PR	3.7589 +/- 0.14312	543e8eb

(2) DeepSeek-V3-Q4_k_M.gguf

model	perplexity (Final estimate PPL)	Commit id
original	1.0396 +/- 0.00654	77dee9d
this PR	1.0370 +/- 0.00611	543e8eb

Reference

Similar repack patch for q4_k on x86: Block interleaving support for Q4_K quantization for x86 AVX2 architecture #12332
PS: the x86 patch share the same structure block_q8_Kx4 with this patch, but the detailed layout is different.
Similar repack idea for q4_0 on arm: Arm AArch64: optimized GEMV and GEMM kernels for q4_0_q8_0, and q8_0_q8_0 quantization #5780

* new quanti: block_q4_kx4 with offline repack impl * new quantize path: add NEON impl for ggml_quantize_mat_q8_K_4x8 * new gemv kernel: new ggml_gemv_q4_K_4x8_q8_K NEON kernel for GGML_OP_MUL_MAT_ID/GGML_OP_MUL_MAT * new gemm kernel: new ggml_gemm_q4_K_4x8_q8_K NEON kernel for GGML_OP_MUL_MAT_ID/GGML_OP_MUL_MAT * performance boost for both S_PP and S_TG --------- Co-authored-by: yuanjia111 <[email protected]>

github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Sep 1, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ggml : block repack support for Q4_K quanti for AArch64 architecture #15719

ggml : block repack support for Q4_K quanti for AArch64 architecture #15719

Uh oh!

hongyang-7 commented Sep 1, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ggml : block repack support for Q4_K quanti for AArch64 architecture #15719

Are you sure you want to change the base?

ggml : block repack support for Q4_K quanti for AArch64 architecture #15719

Uh oh!

Conversation

hongyang-7 commented Sep 1, 2025

Test environment

Bench results

Perplexity

Reference

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant