ggml: aarch64: implement SVE kernels for q6_K_q8_K vector dot #12361

fj-y-saito · 2025-03-13T08:14:18Z

This PR introduces support for SVE(Scalable Vector Extensions) kernels for the q6_K_q8_K vector dot on the Arm architecture. A similar proposal for SVE support is made in PR #11227.

Verifying Features

This PR contains the SVE implementation of the vector dot used to compute the Q4_K quantization.
By running a Q4_K_M quantized model of Llama-3.1-8B, I confirmed that the values match.
The values of the NEON and SVE implementations were compared one after another, and it was confirmed that the values always match.
I also verified that the perplexity matches between the NEON and SVE implementations.

NEON	SVE (this PR)
6.5778 +/- 0.04061	6.5778 +/- 0.04061

performance check

Performance was measured with FX700.
Performance is improved as follows (measured with llama-bench).

original

| model                          |       size |     params | backend    | threads |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      12 |           pp1 |          2.60   0.02 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      12 |           pp2 |          3.07   0.01 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      12 |           pp4 |          3.28   0.00 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      12 |           pp8 |          3.42   0.00 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      12 |         pp512 |          3.39   0.00 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      12 |         tg128 |          2.60   0.00 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      24 |           pp1 |          5.24   0.00 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      24 |           pp2 |          6.01   0.00 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      24 |           pp4 |          6.51   0.00 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      24 |           pp8 |          6.80   0.00 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      24 |         pp512 |          6.74   0.00 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      24 |         tg128 |          5.17   0.00 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      36 |           pp1 |          7.38   0.01 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      36 |           pp2 |          8.40   0.00 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      36 |           pp4 |          9.19   0.00 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      36 |           pp8 |          9.64   0.00 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      36 |         pp512 |         10.07   0.01 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      36 |         tg128 |          7.26   0.00 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      48 |           pp1 |          9.38   0.01 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      48 |           pp2 |         10.74   0.07 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      48 |           pp4 |         11.95   0.00 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      48 |           pp8 |         12.63   0.00 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      48 |         pp512 |         13.35   0.00 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      48 |         tg128 |          9.21   0.01 |

This PR

| model                          |       size |     params | backend    | threads |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      12 |           pp1 |          4.02   0.06 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      12 |           pp2 |          4.50   0.02 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      12 |           pp4 |          4.72   0.01 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      12 |           pp8 |          4.85   0.00 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      12 |         pp512 |          4.58   0.00 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      12 |         tg128 |          3.97   0.00 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      24 |           pp1 |          7.78   0.00 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      24 |           pp2 |          8.69   0.00 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      24 |           pp4 |          9.24   0.00 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      24 |           pp8 |          9.55   0.00 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      24 |         pp512 |          9.10   0.01 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      24 |         tg128 |          7.60   0.00 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      36 |           pp1 |         10.60   0.01 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      36 |           pp2 |         12.11   0.00 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      36 |           pp4 |         13.10   0.01 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      36 |           pp8 |         13.66   0.00 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      36 |         pp512 |         13.56   0.00 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      36 |         tg128 |         10.38   0.01 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      48 |           pp1 |         13.15   0.02 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      48 |           pp2 |         15.19   0.03 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      48 |           pp4 |         16.70   0.28 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      48 |           pp8 |         17.73   0.00 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      48 |         pp512 |         17.95   0.02 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      48 |         tg128 |         12.80   0.05 |

ggerganov · 2025-03-13T14:06:22Z

ggml/src/ggml-cpu/ggml-cpu-quants.c

+                        isum_tmp = svmla_s32_x(pg32_8, isum_tmp, svdot_s32(vzero, q6bytes_1, q8bytes_1), scale_lane_1);
+                        isum_tmp = svmla_s32_x(pg32_8, isum_tmp, svdot_s32(vzero, q6bytes_2, q8bytes_2), scale_lane_2);
+                        isum_tmp = svmla_s32_x(pg32_8, isum_tmp, svdot_s32(vzero, q6bytes_3, q8bytes_3), scale_lane_3);
+                        isum_tmp = svmla_s32_x(pg32_8, isum_tmp, svdot_s32(vzero, q6bytes_4, q8bytes_4), scale_lane_4);


Maybe something to try is to have 4 separate accumulators here. Don't have a machine that supports SVE to give this a try.

When I implemented this fix and measured the elapsed time of ggml_vec_dot_q6_K_q8_K in perf, I found a performance degradation of about 5%.
So I think it's better to leave it as it is.

I consider the following:
By providing a separate accumulator:

The seven dependencies on the critical path in the loop were reduced by one through the fix.

Three add instructions were added to sum up the sepalated accumulators outside the for loop (reducing the dependencies to two).

Instruction dependencies are reduced to 7->1, while the number of instructions is increased by 3.
If the number of loop rotations is large, the proposed modification is expected to improve performance. However, in this case, since the number of loop count was only 2, performance degradation due to the increase in the number of instructions was dominant.

a-ghorbani · 2025-03-20T20:05:12Z

Hey, is this implementation expected to give a boost on mobile devices as well (which afaik usually have a 128-bit SVE vector), or is this meant for server-grade CPUs (with larger SVE widths)?I tested it on my Pixel 9 (which has SVE and SVE2 at a 128-bit) and couldn't see any performance improvement. so wanted to check if I am missing something or it is expected.

fj-y-saito · 2025-03-24T00:34:20Z

This imprementation will improve performance if the processor meets the following conditions:

SIMD width is 128

Number of NEON pipelines < Number of SVE pipelines

SIMD width is 256 or more

Number of NEON pipelines < Number of SVE pipelines x 2

BMalaca · 2025-11-10T11:59:38Z

@fj-y-saito what compiler did you use for these tests? I'm trying to replicate the tg128 result in a FX700 but your results are 50% faster in the tg128. Also, if you could share your cmake options that'd be great. Thanks!

fj-y-saito · 2025-11-13T22:49:44Z

I used gcc 11.3.0 for the tests.
I didn’t specify any special options for CMake (cmake -D...), just the default build.

At the moment, I'm not sure why the performance differs in this specific test.
If you could share your compiler version and the CMake options you used,
that would help me compare our setups.

Add SVE support for q6_K_q8_K

db5baa6

github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Mar 13, 2025

ggerganov approved these changes Mar 13, 2025

View reviewed changes

ggerganov merged commit d9a1452 into ggml-org:master Mar 18, 2025
47 checks passed

arthw pushed a commit to arthw/llama.cpp that referenced this pull request Mar 19, 2025

ggml : add SVE support for q6_K_q8_K (ggml-org#12361)

e866ff8

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ggml: aarch64: implement SVE kernels for q6_K_q8_K vector dot #12361

ggml: aarch64: implement SVE kernels for q6_K_q8_K vector dot #12361

Uh oh!

fj-y-saito commented Mar 13, 2025

Uh oh!

ggerganov Mar 13, 2025

Uh oh!

fj-y-saito Mar 18, 2025 •

edited

Loading

Uh oh!

Uh oh!

a-ghorbani commented Mar 20, 2025

Uh oh!

fj-y-saito commented Mar 24, 2025

Uh oh!

BMalaca commented Nov 10, 2025

Uh oh!

fj-y-saito commented Nov 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ggml: aarch64: implement SVE kernels for q6_K_q8_K vector dot #12361

ggml: aarch64: implement SVE kernels for q6_K_q8_K vector dot #12361

Uh oh!

Conversation

fj-y-saito commented Mar 13, 2025

Verifying Features

performance check

Uh oh!

ggerganov Mar 13, 2025

Choose a reason for hiding this comment

Uh oh!

fj-y-saito Mar 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

a-ghorbani commented Mar 20, 2025

Uh oh!

fj-y-saito commented Mar 24, 2025

SIMD width is 128

SIMD width is 256 or more

Uh oh!

BMalaca commented Nov 10, 2025

Uh oh!

fj-y-saito commented Nov 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

fj-y-saito Mar 18, 2025 •

edited

Loading