Skip to content

Conversation

ikawrakow
Copy link
Owner

It is time to give some attention to the ARM_NEON back-end, which has fallen behind quite a bit.

This PR corresponds to PRs #531, #533, #534, #546, #549, and applies the on-the-fly repacking technique to i-quants (IQ2_XXS, IQ2_XS, IQ2_S, IQ3_XXS, IQ3_S) for the ARM_NEON implementation.

Here is a PP-512 performance comparison between the main branch and this PR for LlaMA-3.1-8B-Instruct on M2-Max

type t/s (main) t/s (PR) Speedup
IQ2_XXS 55.79 167.55 3.003
IQ2_XS 46.40 166.65 3.592
IQ2_S 42.75 166.83 3.903
IQ3_XXS 51.84 165.56 3.194
IQ3_S 46.02 162.03 3.521

At this point i- and IQK quants are the top tier quants for prompt processing speed on ARM_NEON.

Iwan Kawrakow added 5 commits June 23, 2025 13:50
55.8 -> 167.5 t/s. iq2_xxs is at 93.7 t/s
46.4 -> 166.6 t/s. iq2_xs_r4 is at 72.3 t/s.
42.8 t/s -> 166.8 t/s. iq2_s_r4 is at 71.1 t/s.
51.8 t/s -> 165.6 t/s. iq3_xxs_r4 is at 84.6 t/s.
46.0 t/s -> 162.0 t/s. iq3_s_r4 is at 79.4 t/s
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant