Skip to content

Conversation

ikawrakow
Copy link
Owner

It is time to give some attention to the ARM_NEON back-end, which has fallen behind quite a bit.

This PR corresponds to PRs #531, #533, #534, #546, #549, #550, and applies the on-the-fly repacking technique to k-quants (Q2_K, Q3_K, Q4_K, Q5_K, Q6_K) and to IQ4_XS for the ARM_NEON implementation.

Here is a PP-512 performance comparison between the main branch and this PR for LlaMA-3.1-8B-Instruct on M2-Max

type t/s (main) t/s (PR) Speedup
Q2_K 85.74 168.07 1.960
Q3_K 45.68 170.83 3.740
Q4_K 58.24 114.78 1.971
Q5_K 54.88 114.92 2.094
Q6_K 47.67 123.98 2.601
IQ4_XS 71.19 167.84 2.358

Q2_K, Q3_K and IQ4_XS join the top-tier group in terms of prompt processing speed.

Q4_K and Q5_K get repacked to Q8_1, and this ends up being slower than Q4_K_R4/Q5_K_R4, so it may have been better to simply repack to the corresponding row-interleaved variant. This is left for a future PR.

Iwan Kawrakow added 12 commits June 23, 2025 13:50
55.8 -> 167.5 t/s. iq2_xxs is at 93.7 t/s
46.4 -> 166.6 t/s. iq2_xs_r4 is at 72.3 t/s.
42.8 t/s -> 166.8 t/s. iq2_s_r4 is at 71.1 t/s.
51.8 t/s -> 165.6 t/s. iq3_xxs_r4 is at 84.6 t/s.
46.0 t/s -> 162.0 t/s. iq3_s_r4 is at 79.4 t/s
85.7 t/s -> 168.1 t/s. q2_k_r4 is at 111.2 t/s.
45.7 t/s -> 170.8 t/s. q3_k_r4 is at 110.3 t/s.
47.7 t/s -> 124 t/s. q6_k_r4 is at 112.7 t/s.
58.2 t/s -> 114.8 t/s. iq4_k_r4 is at 130.9 t/s.

As I had to add a new implementation for q8_1-quantized
activations, TG became slightly faster too
(25.1 -> 25.9 t/s).
54.9 -> 114.9 t/s. q5_k_r4 is at 116.2 t/s.
71.2 -> 167.8 t/s. iq4_xs_r4 is at 138.6 t/s.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant