Skip to content

Conversation

ikawrakow
Copy link
Owner

It is time to give some attention to the ARM_NEON back-end, which has fallen behind quite a bit.

This PR corresponds to PRs #531, #533, #534 and applies the on-the-fly repacking technique to Q4_0, Q4_1, Q5_0, Q5_1, Q6_0, Q8_0, IQ4_NL for the ARM_NEON implementation.

Here is a PP-512 performance comparison between the main branch and this PR for LlaMA-3.1-8B-Instruct on M2-Max

type t/s (main) t/s (PR) Speedup
Q4_0 83.58 128.41 1.536
Q5_0 74.20 128.57 1.733
Q6_0 74.25 128.79 1.735
Q8_0 84.45 128.63 1.523
IQ4_NL 84.47 128.09 1.516
Q4_1 74.44 115.36 1.550
Q5_1 64.16 114.89 1.791

Iwan Kawrakow added 8 commits June 20, 2025 10:47
Much slower than the fp16 based trellis. I guess, Apple doesn't
have int8_t SIMD on the M2-Max GPU.
83.6 t/s -> 128.4 t/s. q4_0_r8 is at 123.5 t/s
74.2 t/s -> 128.5 t/s. q5_0_r4 is at 111.4 t/s.
74.2 t/s -> 128.8 t/s. q6_0_r4 is at 107.2 t/s.
84.5 -> 128.7 t/s. q8_0_r8 is at 131 t/s.
84.5 t/s -> 128.1 t/s. iq4_nl_r4 is at 120.4 t/s
74.4 -> 115.4 t/s. There is no repacked variant
64.2 t/s -> 114.9 t/s. There is no repacked variant.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant