Much faster prompt processing for k-quants (ARM_NEON) #552

ikawrakow · 2025-06-24T09:38:38Z

It is time to give some attention to the ARM_NEON back-end, which has fallen behind quite a bit.

This PR corresponds to PRs #531, #533, #534, #546, #549, #550, and applies the on-the-fly repacking technique to k-quants (Q2_K, Q3_K, Q4_K, Q5_K, Q6_K) and to IQ4_XS for the ARM_NEON implementation.

Here is a PP-512 performance comparison between the main branch and this PR for LlaMA-3.1-8B-Instruct on M2-Max

type	t/s (main)	t/s (PR)	Speedup
Q2_K	85.74	168.07	1.960
Q3_K	45.68	170.83	3.740
Q4_K	58.24	114.78	1.971
Q5_K	54.88	114.92	2.094
Q6_K	47.67	123.98	2.601
IQ4_XS	71.19	167.84	2.358

Q2_K, Q3_K and IQ4_XS join the top-tier group in terms of prompt processing speed.

Q4_K and Q5_K get repacked to Q8_1, and this ends up being slower than Q4_K_R4/Q5_K_R4, so it may have been better to simply repack to the corresponding row-interleaved variant. This is left for a future PR.

55.8 -> 167.5 t/s. iq2_xxs is at 93.7 t/s

46.4 -> 166.6 t/s. iq2_xs_r4 is at 72.3 t/s.

42.8 t/s -> 166.8 t/s. iq2_s_r4 is at 71.1 t/s.

51.8 t/s -> 165.6 t/s. iq3_xxs_r4 is at 84.6 t/s.

46.0 t/s -> 162.0 t/s. iq3_s_r4 is at 79.4 t/s

85.7 t/s -> 168.1 t/s. q2_k_r4 is at 111.2 t/s.

45.7 t/s -> 170.8 t/s. q3_k_r4 is at 110.3 t/s.

47.7 t/s -> 124 t/s. q6_k_r4 is at 112.7 t/s.

58.2 t/s -> 114.8 t/s. iq4_k_r4 is at 130.9 t/s. As I had to add a new implementation for q8_1-quantized activations, TG became slightly faster too (25.1 -> 25.9 t/s).

54.9 -> 114.9 t/s. q5_k_r4 is at 116.2 t/s.

71.2 -> 167.8 t/s. iq4_xs_r4 is at 138.6 t/s.

Iwan Kawrakow added 12 commits June 23, 2025 13:50

iq2_xxs

edb5f9c

55.8 -> 167.5 t/s. iq2_xxs is at 93.7 t/s

iq2_xs

8b33186

46.4 -> 166.6 t/s. iq2_xs_r4 is at 72.3 t/s.

iq2_s

c52f589

42.8 t/s -> 166.8 t/s. iq2_s_r4 is at 71.1 t/s.

iq3_xxs

2696567

51.8 t/s -> 165.6 t/s. iq3_xxs_r4 is at 84.6 t/s.

iq3_s

548a5f3

46.0 t/s -> 162.0 t/s. iq3_s_r4 is at 79.4 t/s

q2_k

6818e14

85.7 t/s -> 168.1 t/s. q2_k_r4 is at 111.2 t/s.

q3_K

52ad57b

45.7 t/s -> 170.8 t/s. q3_k_r4 is at 110.3 t/s.

q6_k

78d531c

47.7 t/s -> 124 t/s. q6_k_r4 is at 112.7 t/s.

q4_k

d1b4b34

58.2 t/s -> 114.8 t/s. iq4_k_r4 is at 130.9 t/s. As I had to add a new implementation for q8_1-quantized activations, TG became slightly faster too (25.1 -> 25.9 t/s).

q5_k

915a4a3

54.9 -> 114.9 t/s. q5_k_r4 is at 116.2 t/s.

iq4_xs

c3c60c3

71.2 -> 167.8 t/s. iq4_xs_r4 is at 138.6 t/s.

Merge remote-tracking branch 'origin/main' into ik/gemm_neon_kquants

e18b10b

ikawrakow merged commit 64f6c2d into main Jun 24, 2025

ikawrakow mentioned this pull request Jun 24, 2025

Much faster prompt processing for IQ1_S and IQ1_M on ARM_NEON #553

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Much faster prompt processing for k-quants (ARM_NEON) #552

Much faster prompt processing for k-quants (ARM_NEON) #552

Uh oh!

ikawrakow commented Jun 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Much faster prompt processing for k-quants (ARM_NEON) #552

Much faster prompt processing for k-quants (ARM_NEON) #552

Uh oh!

Conversation

ikawrakow commented Jun 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants