arm64: add i8mm route with SVE ggml_vec_dot_q4_K_q8_K and ggml_vec_dot_q6_K_… #15277

fj-y-saito · 2025-08-13T01:00:54Z

This PR improves q4_k_q8_k and q6_K_q8_K gemm kernel with arm64 i8mm instruction with SVE.
similar proposal for NEON support is made in PR #13886
Since it uses SVE instructions, it is characterized by improved performance even on machines with a SIMD width of 128 bits or more.

Verifying Features

This PR contains the SVE implementation of the vector dot used to compute the Q4_K quantization.
By running a Q4_K_M quantized model of Llama-3.1-8B, I confirmed that the values match.
I also verified that the perplexity matches between the NEON and SVE implementations.

NEON	SVE(this PR)
6.5772 +/- 0.04061	6.5774 +/- 0.04062

performance check

Performance was measured with AWS Graviton3.
Performance is improved as follows (measured with llama-bench).

original

| model                          |       size |     params | backend    | threads |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | --------------: | -------------------: |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |       8 |             pp1 |         17.60 ± 0.01 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |       8 |             pp2 |         22.74 ± 0.01 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |       8 |             pp4 |         24.83 ± 0.01 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |       8 |             pp8 |         26.57 ± 0.00 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |       8 |           pp512 |         27.50 ± 0.00 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |       8 |           tg128 |         17.30 ± 0.00 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      16 |             pp1 |         31.50 ± 0.07 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      16 |             pp2 |         42.44 ± 0.03 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      16 |             pp4 |         47.74 ± 0.01 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      16 |             pp8 |         51.98 ± 0.01 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      16 |           pp512 |         54.69 ± 0.00 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      16 |           tg128 |         31.29 ± 0.01 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      32 |             pp1 |         40.51 ± 0.05 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      32 |             pp2 |         66.38 ± 0.08 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      32 |             pp4 |         78.73 ± 0.04 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      32 |             pp8 |         87.98 ± 0.02 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      32 |           pp512 |         96.20 ± 0.00 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      32 |           tg128 |         40.36 ± 0.02 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      48 |             pp1 |         45.10 ± 0.05 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      48 |             pp2 |         74.95 ± 0.10 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      48 |             pp4 |         99.42 ± 0.06 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      48 |             pp8 |        114.52 ± 0.02 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      48 |           pp512 |        136.11 ± 0.01 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      48 |           tg128 |         44.74 ± 0.01 |

This PR

| model                          |       size |     params | backend    | threads |            test |                  t/s1|
| ------------------------------ | ---------: | ---------: | ---------- | ------: | --------------: | -------------------:1|
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |       8 |             pp1 |         17.36 ± 0.02 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |       8 |             pp2 |         27.59 ± 0.01 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |       8 |             pp4 |         31.10 ± 0.01 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |       8 |             pp8 |         33.53 ± 0.01 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |       8 |           pp512 |         35.36 ± 0.00 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |       8 |           tg128 |         17.20 ± 0.00 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      16 |             pp1 |         31.42 ± 0.03 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      16 |             pp2 |         50.81 ± 0.02 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      16 |             pp4 |         58.81 ± 0.01 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      16 |             pp8 |         65.04 ± 0.02 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      16 |           pp512 |         70.26 ± 0.00 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      16 |           tg128 |         31.08 ± 0.01 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      32 |             pp1 |         40.88 ± 0.10 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      32 |             pp2 |         73.11 ± 0.08 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      32 |             pp4 |         92.12 ± 0.02 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      32 |             pp8 |        105.67 ± 0.03 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      32 |           pp512 |        119.13 ± 0.00 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      32 |           tg128 |         40.56 ± 0.02 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      48 |             pp1 |         45.56 ± 0.11 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      48 |             pp2 |         76.08 ± 0.12 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      48 |             pp4 |        113.12 ± 0.23 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      48 |             pp8 |        134.91 ± 0.21 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      48 |           pp512 |        165.69 ± 0.01 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      48 |           tg128 |         44.94 ± 0.01 |

ggerganov

Btw, it's probably a better idea to implement GEMM improvements through the repack mechanism in ggml. It would give you more flexibility for rearranging the data to better fit the instructions.

ggerganov · 2025-08-13T09:24:23Z

ggml/src/ggml-cpu/arch/arm/quants.c

+                         r1 = svreinterpret_s8_s64(svzip2_s64(svreinterpret_s64_s8(q8bytes_0_h), svreinterpret_s64_s8(q8bytes_1_h)));
+                         r2 = svreinterpret_s8_s64(svzip1_s64(svreinterpret_s64_s8(q8bytes_0_l), svreinterpret_s64_s8(q8bytes_1_l)));
+                         r3 = svreinterpret_s8_s64(svzip2_s64(svreinterpret_s64_s8(q8bytes_0_l), svreinterpret_s64_s8(q8bytes_1_l)));
+                         sumi2 = svmmla_s32(svmmla_s32(svmmla_s32(svmmla_s32(svdup_n_s32(0), r0, l0), r1, l1), r2, l2), r3, l3);


Does svmmla_s32 require to check for __ARM_FEATURE_SVE_MATMUL_INT8?

Yes. svmmla_s32() is intrinsic function for sve instruction and it need i8mm feature to CPU.

fj-y-saito · 2025-08-25T07:27:32Z

Btw, it's probably a better idea to implement GEMM improvements through the repack mechanism in ggml. It would give you more flexibility for rearranging the data to better fit the instructions.

Could you please explain what the repack mechanism refers to, or point me to the relevant implementation details?
I tried looking into it and also found PR #10446, but I’m still not quite sure how the repack implementation actually works or whether that PR is directly related.

fj-y-saito · 2025-10-14T00:10:51Z

@ggerganov
Sorry for the late reply, and thanks for the suggestion — you’re right, using the repack mechanism could enable more flexible data rearrangement.
From what I’ve observed so far, it seems that repack might make memory access more efficient and reduce kernel-side rearrangement (e.g., zip instructions), which could be especially beneficial for low-batch inference.
I’ll take a closer look at the repack implementation to confirm this.
For now, I’d like to keep this PR as a baseline improvement with immediate performance benefits, and plan to explore repack integration in a follow-up PR.

fj-y-saito · 2025-10-27T04:12:30Z

Hi @ggerganov , just following up on this one
I wanted to check if you had any further thoughts on keeping this PR as a standalone improvement.
From my side, I think this PR already provides immediate performance benefits and can be merged as is.
Regarding the repack-based approach, I won’t be able to implement it in the near term, but it’s an interesting direction for future exploration.
Thanks again for your earlier feedback!

ggerganov · 2025-10-27T08:57:35Z

ggml/src/ggml-cpu/arch/arm/quants.c


 #if defined(__ARM_FEATURE_MATMUL_INT8)
+#ifdef __ARM_FEATURE_SVE


Should this be:

Suggested change

#ifdef __ARM_FEATURE_SVE

#if defined(__ARM_FEATURE_SVE) && defined(__ARM_FEATURE_SVE_MATMUL_INT8)

@ggerganov
Addressed the review comments in the latest commit.
The CI failure seems unrelated (network/download issue). Please re-run when convenient.

Hi @ggerganov
Just wanted to check in if you had a chance to look at this.
The review comments have been resolved, and the CI failure seems unrelated.
Thanks as always for your time.

…q8_K

ggerganov

Some coding style fixes.

ggerganov · 2025-11-06T14:35:00Z

ggml/src/ggml-cpu/arch/arm/quants.c

+  return vutmp;
+}
+#endif
+


Use 4 space indent

Thank you for your comments. I've done.

ggerganov · 2025-11-06T14:35:14Z

ggml/src/ggml-cpu/arch/arm/quants.c


-#if defined(__ARM_FEATURE_MATMUL_INT8)
+#if defined(__ARM_FEATURE_SVE) && defined(__ARM_FEATURE_MATMUL_INT8)
+    if (nrc==2) {


Suggested change

if (nrc==2) {

if (nrc == 2) {

ggerganov · 2025-11-06T14:36:01Z

ggml/src/ggml-cpu/arch/arm/quants.c

+                     svbool_t pg128_all  = svptrue_pat_b8(SV_VL16);
+                     for (int i = 0; i < nb; ++i) {
+                     svfloat32_t vy_d = svuzp1_f32(svdup_n_f32(vy0[i].d), svdup_n_f32(vy1[i].d));


Fix indentation of this for loop

ggerganov · 2025-11-06T14:36:32Z

ggml/src/ggml-cpu/arch/arm/quants.c

+        }
+        svst1_f32(pg32_2, s, sumf1);
+        svst1_f32(pg32_2, s + bs, svreinterpret_f32_u8(svext_u8(svreinterpret_u8_f32(sumf1), svdup_n_u8(0), 8)));
+        return ;


Suggested change

return ;

return;

ggerganov · 2025-11-06T14:39:01Z

ggml/src/ggml-cpu/arch/arm/quants.c

 }

+#ifdef __ARM_FEATURE_SVE
+static inline svuint32_t ggml_decode_q4scales_and_mins_for_mmla(const uint32_t *vx_scales) {


Suggested change

static inline svuint32_t ggml_decode_q4scales_and_mins_for_mmla(const uint32_t *vx_scales) {

static inline svuint32_t ggml_decode_q4scales_and_mins_for_mmla(const uint32_t * vx_scales) {

@ggerganov
Addressed the review comments in the latest commit.
The CI failure seems unrelated. Please re-run when convenient.

ggerganov

Thank you for the contribution. Because of limited SVE hardware it's a bit difficult to approve these changes. Will ping you for support if we face problems with this code in the future.

fj-y-saito · 2025-11-10T06:56:01Z

thank you @ggerganov
Sure, please feel free to reach out anytime if any issues come up with the SVE path.

github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Aug 13, 2025

fj-y-saito changed the title ~~add i8mm route with SVE ggml_vec_dot_q4_K_q8_K and ggml_vec_dot_q6_K_…~~ arm64: add i8mm route with SVE ggml_vec_dot_q4_K_q8_K and ggml_vec_dot_q6_K_… Aug 13, 2025

ggerganov reviewed Aug 13, 2025

View reviewed changes

ggerganov reviewed Oct 27, 2025

View reviewed changes

fj-y-saito requested a review from slaren as a code owner October 27, 2025 22:54

fj-y-saito force-pushed the feat-sve-i8mm-q4_K_quantization branch from 58f7970 to ca1884b Compare October 28, 2025 01:09

fj-y-saito added 3 commits October 29, 2025 06:59

add i8mm route with SVE ggml_vec_dot_q4_K_q8_K and ggml_vec_dot_q6_K_…

5485ad7

…q8_K

Surround SVE function with compiler directive

c91e5d5

fix compile switch

aca4ea8

fj-y-saito force-pushed the feat-sve-i8mm-q4_K_quantization branch from ca1884b to aca4ea8 Compare October 28, 2025 22:01

DajanaV mentioned this pull request Oct 28, 2025

UPSTREAM PR #15277: arm64: add i8mm route with SVE ggml_vec_dot_q4_K_q8_K and ggml_vec_dot_q6_K_… auroralabs-loci/llama.cpp#8

Closed

ggerganov reviewed Nov 6, 2025

View reviewed changes

fj-y-saito and others added 2 commits November 7, 2025 13:28

fix coding style

6452abd

ggml : fix indent

9641e83

ggerganov approved these changes Nov 7, 2025

View reviewed changes

ggerganov merged commit df70bed into ggml-org:master Nov 10, 2025
65 of 71 checks passed


		#if defined(__ARM_FEATURE_MATMUL_INT8)
		#ifdef __ARM_FEATURE_SVE

	#ifdef __ARM_FEATURE_SVE
	#if defined(__ARM_FEATURE_SVE) && defined(__ARM_FEATURE_SVE_MATMUL_INT8)

	static inline svuint32_t ggml_decode_q4scales_and_mins_for_mmla(const uint32_t *vx_scales) {
	static inline svuint32_t ggml_decode_q4scales_and_mins_for_mmla(const uint32_t * vx_scales) {

arm64: add i8mm route with SVE ggml_vec_dot_q4_K_q8_K and ggml_vec_dot_q6_K_… #15277

arm64: add i8mm route with SVE ggml_vec_dot_q4_K_q8_K and ggml_vec_dot_q6_K_… #15277

Uh oh!

Conversation

fj-y-saito commented Aug 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Verifying Features

performance check

original

This PR

Uh oh!

ggerganov left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fj-y-saito commented Aug 25, 2025

Uh oh!

fj-y-saito commented Oct 14, 2025

Uh oh!

fj-y-saito commented Oct 27, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ggerganov left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fj-y-saito Nov 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ggerganov left a comment

Choose a reason for hiding this comment

Uh oh!

fj-y-saito commented Nov 10, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

fj-y-saito commented Aug 13, 2025 •

edited

Loading

fj-y-saito Nov 7, 2025 •

edited

Loading