Faster ARM_NEON GEMM implementation for legacy quants #546

ikawrakow · 2025-06-21T14:29:08Z

It is time to give some attention to the ARM_NEON back-end, which has fallen behind quite a bit.

This PR corresponds to PRs #531, #533, #534 and applies the on-the-fly repacking technique to Q4_0, Q4_1, Q5_0, Q5_1, Q6_0, Q8_0, IQ4_NL for the ARM_NEON implementation.

Here is a PP-512 performance comparison between the main branch and this PR for LlaMA-3.1-8B-Instruct on M2-Max

type	t/s (main)	t/s (PR)	Speedup
Q4_0	83.58	128.41	1.536
Q5_0	74.20	128.57	1.733
Q6_0	74.25	128.79	1.735
Q8_0	84.45	128.63	1.523
IQ4_NL	84.47	128.09	1.516
Q4_1	74.44	115.36	1.550
Q5_1	64.16	114.89	1.791

Much slower than the fp16 based trellis. I guess, Apple doesn't have int8_t SIMD on the M2-Max GPU.

83.6 t/s -> 128.4 t/s. q4_0_r8 is at 123.5 t/s

74.2 t/s -> 128.5 t/s. q5_0_r4 is at 111.4 t/s.

74.2 t/s -> 128.8 t/s. q6_0_r4 is at 107.2 t/s.

84.5 -> 128.7 t/s. q8_0_r8 is at 131 t/s.

84.5 t/s -> 128.1 t/s. iq4_nl_r4 is at 120.4 t/s

74.4 -> 115.4 t/s. There is no repacked variant

64.2 t/s -> 114.9 t/s. There is no repacked variant.

Iwan Kawrakow added 8 commits June 20, 2025 10:47

iq2_kt and iq3_kt work with new int trellis

a0ba58e

Much slower than the fp16 based trellis. I guess, Apple doesn't have int8_t SIMD on the M2-Max GPU.

q4_0

1f31789

83.6 t/s -> 128.4 t/s. q4_0_r8 is at 123.5 t/s

q5_0

f8efac6

74.2 t/s -> 128.5 t/s. q5_0_r4 is at 111.4 t/s.

q6_0

a834e4b

74.2 t/s -> 128.8 t/s. q6_0_r4 is at 107.2 t/s.

q8_0

a78bed0

84.5 -> 128.7 t/s. q8_0_r8 is at 131 t/s.

iq4_nl

8b10279

84.5 t/s -> 128.1 t/s. iq4_nl_r4 is at 120.4 t/s

q4_1

ce4fb58

74.4 -> 115.4 t/s. There is no repacked variant

q5_1

aaa1647

64.2 t/s -> 114.9 t/s. There is no repacked variant.

ikawrakow merged commit 4f97409 into main Jun 21, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Faster ARM_NEON GEMM implementation for legacy quants #546

Faster ARM_NEON GEMM implementation for legacy quants #546

Uh oh!

ikawrakow commented Jun 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Faster ARM_NEON GEMM implementation for legacy quants #546

Faster ARM_NEON GEMM implementation for legacy quants #546

Uh oh!

Conversation

ikawrakow commented Jun 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants