ggml : block repack support for Q4_K quanti for AArch64 architecture #15719
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR improves q4_k_q8_k kernel with block repacking support for AArch64 architecture, based on NEON.
Following structures and functions are implemented:
block_q4_kx4based on four q4_k blocks, along with offline repacking functionblock_q8_Kx4inggml_quantize_mat_q8_K_4x8()ggml_gemv_q4_K_4x8_q8_K()NEON kernel forGGML_OP_MUL_MAT_ID/GGML_OP_MUL_MATopsggml_gemm_q4_K_4x8_q8_K()NEON kernel forGGML_OP_MUL_MAT_ID/GGML_OP_MUL_MATopsTest environment
Bench results
Good gains were observed with this PR, for both S_PP and S_TG:
(1) meta-llama-3-8b-instruct.Q4_K_M.gguf
(original)
(this PR)
speedup
(original)
(this PR)
speedup
(2) DeepSeek-V3-Q4_k_M.gguf
(original)
(this PR)
speedup
(original)
(this PR)
speedup
Perplexity
(1) meta-llama-3-8b-instruct.Q4_K_M.gguf
(2) DeepSeek-V3-Q4_k_M.gguf
Reference
PS: the x86 patch share the same structure
block_q8_Kx4with this patch, but the detailed layout is different.