Skip to content

Conversation

@hongyang-7
Copy link

This PR improves q4_k_q8_k kernel with block repacking support for AArch64 architecture, based on NEON.

Following structures and functions are implemented:

  • new quanti: block_q4_kx4 based on four q4_k blocks, along with offline repacking function
  • new quantize path: add NEON implementation for block_q8_Kx4 in ggml_quantize_mat_q8_K_4x8()
  • new gemv kernel: new ggml_gemv_q4_K_4x8_q8_K() NEON kernel for GGML_OP_MUL_MAT_ID/GGML_OP_MUL_MAT ops
  • new gemm kernel: new ggml_gemm_q4_K_4x8_q8_K() NEON kernel for GGML_OP_MUL_MAT_ID/GGML_OP_MUL_MAT ops

Test environment

  • Server: Neoverse-N2
  • System_info: n_threads = 64 (n_threads_batch = 64) / 256 | CPU : NEON = 1 | ARM_FMA = 1 | FP16_VA = 1 | MATMUL_INT8 = 1 | SVE = 1 | DOTPROD = 1 | SVE_CNT = 16 | OPENMP = 1 | REPACK = 1 |
  • Models: 2 of different scales
models storage size param size quanti
meta-llama-3-8b-instruct.Q4_K_M.gguf 4.6G 8.03B Q4_K_M
DeepSeek-V3-Q4_k_M.gguf 377G 671B Q4_K_M

Bench results

Good gains were observed with this PR, for both S_PP and S_TG:

(1) meta-llama-3-8b-instruct.Q4_K_M.gguf

./bin/llama-batched-bench -m /mnt/models/meta-llama-3-8b-instruct.Q4_K_M.gguf -c 8192 -b 2048 -ub 512  -npp 128 -ntg 128 -npl 1,4,8,16,32 -t 64 --no-mmap
B S_PP t/s
(original)
S_PP t/s
(this PR)
S_PP
speedup
S_TG t/s
(original)
S_TG t/s
(this PR)
S_TG
speedup
1 168.99 258.42 152.9% 36.34 35.91 98.8%
4 178.88 273.85 153.1% 76.84 95.93 124.8%
8 180.94 280.88 155.2% 102.88 125.94 122.4%
16 180.77 280.69 155.3% 127.70 174.44 136.6%
32 180.65 280.71 155.4% 139.46 194.32 139.3%
geomean     154.4%     123.5%

(2) DeepSeek-V3-Q4_k_M.gguf

./bin/llama-batched-bench -m /mnt/models/DeepSeek-V3-Q4_k_M.gguf -c 8192 -b 2048 -ub 512  -npp 128 -ntg 128 -npl 1,4,8,16,32 -t 64 --no-mmap
B S_PP t/s
(original)
S_PP t/s
(this PR)
S_PP
speedup
S_TG t/s
(original)
S_TG t/s
(this PR)
S_TG
speedup
1 24.17 30.13 124.7% 6.52 6.46 99.1%
4 25.36 33.13 130.6% 12.18 12.65 103.9%
8 25.43 33.15 130.4% 14.85 15.41 103.8%
16 25.41 33.12 130.3% 16.76 17.72 105.7%
32 25.40 33.10 130.3% 18.19 19.82 109.0%
geomean     129.2%     104.2%

Perplexity

(1) meta-llama-3-8b-instruct.Q4_K_M.gguf

model perplexity (Final estimate PPL) Commit id
original 3.7533 +/- 0.14294 77dee9d
this PR 3.7589 +/- 0.14312 543e8eb

(2) DeepSeek-V3-Q4_k_M.gguf

model perplexity (Final estimate PPL) Commit id
original 1.0396 +/- 0.00654 77dee9d
this PR 1.0370 +/- 0.00611 543e8eb

Reference

  1. Similar repack patch for q4_k on x86: Block interleaving support for Q4_K quantization for x86 AVX2 architecture #12332
    PS: the x86 patch share the same structure block_q8_Kx4 with this patch, but the detailed layout is different.
  2. Similar repack idea for q4_0 on arm: Arm AArch64: optimized GEMV and GEMM kernels for q4_0_q8_0, and q8_0_q8_0 quantization #5780

* new quanti: block_q4_kx4 with offline repack impl

* new quantize path: add NEON impl for ggml_quantize_mat_q8_K_4x8

* new gemv kernel: new ggml_gemv_q4_K_4x8_q8_K NEON kernel for GGML_OP_MUL_MAT_ID/GGML_OP_MUL_MAT

* new gemm kernel: new ggml_gemm_q4_K_4x8_q8_K NEON kernel for GGML_OP_MUL_MAT_ID/GGML_OP_MUL_MAT

* performance boost for both S_PP and S_TG

---------

Co-authored-by: yuanjia111 <[email protected]>
@github-actions github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Sep 1, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant