Much faster CPU prompt processing (part 2) #533

ikawrakow · 2025-06-17T16:38:09Z

This PR is a follow up of #531 and applies the technique to IQK quants.

Here is a PP-512 performance comparison for LlaMA-3.1-8B-Instruct on a Ryzen-7950X CPU between the main branch and this PR:

model	size	test	t/s (main)	t/s (PR)	Speedup
llama 8B IQ2_KS	2.05 GiB	pp512	203.08 ± 0.39	372.48 ± 3.69	1.834
llama 8B IQ2_K	2.22 GiB	pp512	195.04 ± 2.44	365.58 ± 4.25	1.874
llama 8B IQ3_K	3.21 GiB	pp512	167.65 ± 0.53	354.90 ± 3.44	2.117
llama 8B IQ4_KS	3.98 GiB	pp512	198.28 ± 0.57	362.81 ± 1.74	1.830
llama 8B IQ4_K	4.21 GiB	pp512	177.08 ± 1.71	360.58 ± 1.96	2.036
llama 8B IQ5_KS	4.91 GiB	pp512	182.40 ± 1.62	358.66 ± 3.39	1.966
llama 8B IQ5_K	5.14 GiB	pp512	158.74 ± 0.87	354.68 ± 0.75	2.234
llama 8B IQ6_K	6.19 GiB	pp512	147.07 ± 0.80	353.20 ± 3.48	2.402

To put things into perspective, the fastest mainline llama.cpp quant on this CPU is Q4_0, and I get 170 t/s with today's build (build: 860a9e4ee (5688)).

For a bit of history, when PR 6414 was added to llama.cpp, it received 92 👍, 32 🎉, 34 ❤️, and 30 🚀. It only supported Q4_0 and Q8_0, and speedup compared to the master branch at the time was in the range of 40-50%, for a PP-512 of 135 t/s on the Ryzen-7950X CPU used for the above table. There was a blog post received with great fanfare on HN.

203 t/s -> 357 t/s. iq4_ks_r4 is 242 t/s.

175 t/s -> 353 t/s. iq4_k_r4 is 208 t/s. PPL is actually lower!

180 t/s -> 359 t/s. iq5_ks_r4 is 210 t/s. PPL is actually lower - 7.4160 vs 7.4494 for LlaMA-3.1-8B-Instruct

...and that's why PPL was so high. It is also high on main. This fixes it.

148 t/s -> 350 t/s. There is no iq6_k_r4 PPL is actually lower because we have a bug in the existing implementation!

169 t/s -> 363 t/s. iq3_k_r4 is at 200 t/s.

190 t/s -> 364 t/s. iq2_k_r4 is at 232 t/s.

200 t/s -> 367 t/s. There is no iq2_ks_r4.

ubergarm · 2025-06-17T16:45:36Z

Thanks, this is huge. I feel like this will make ~70B dense models much better for hybrid inferencing on home rigs. Hope to try some quants soon!

Also holy cow the iqN_k are basically as fast as the iqN_ks!

Vhallo · 2025-06-17T16:50:04Z

Impressive work all around!

Nexesenex · 2025-06-17T18:31:50Z

Very impressive, @ikawrakow!
All your recent commits motivate me to put more of IK_Llama on my Kobold.Cpp fork.
I already have overall twice its mainline counterpart CPU PP perfs thanks to your amazing work, and I merged most of your quants, including the last Trellis!
Way to make an enthusiast happy!

* iq4_ks 203 t/s -> 357 t/s. iq4_ks_r4 is 242 t/s. * iq4_k 175 t/s -> 353 t/s. iq4_k_r4 is 208 t/s. PPL is actually lower! * iq5_ks 180 t/s -> 359 t/s. iq5_ks_r4 is 210 t/s. PPL is actually lower - 7.4160 vs 7.4494 for LlaMA-3.1-8B-Instruct * iq5_k - accuracy loss is too big * iq5_k - there was a bug with the shifts ...and that's why PPL was so high. It is also high on main. This fixes it. * iq6_k 148 t/s -> 350 t/s. There is no iq6_k_r4 PPL is actually lower because we have a bug in the existing implementation! * iq3_k 169 t/s -> 363 t/s. iq3_k_r4 is at 200 t/s. * iq2_k 190 t/s -> 364 t/s. iq2_k_r4 is at 232 t/s. * iq2_ks 200 t/s -> 367 t/s. There is no iq2_ks_r4. Co-Authored-By: Iwan Kawrakow <[email protected]>

Iwan Kawrakow added 9 commits June 17, 2025 08:24

iq4_ks

fa0620f

203 t/s -> 357 t/s. iq4_ks_r4 is 242 t/s.

iq4_k

1e9839a

175 t/s -> 353 t/s. iq4_k_r4 is 208 t/s. PPL is actually lower!

iq5_ks

e323a5b

180 t/s -> 359 t/s. iq5_ks_r4 is 210 t/s. PPL is actually lower - 7.4160 vs 7.4494 for LlaMA-3.1-8B-Instruct

iq5_k - accuracy loss is too big

4c00c08

iq5_k - there was a bug with the shifts

f682afb

...and that's why PPL was so high. It is also high on main. This fixes it.

iq6_k

b77b7a8

148 t/s -> 350 t/s. There is no iq6_k_r4 PPL is actually lower because we have a bug in the existing implementation!

iq3_k

8d4e5cb

169 t/s -> 363 t/s. iq3_k_r4 is at 200 t/s.

iq2_k

d99606d

190 t/s -> 364 t/s. iq2_k_r4 is at 232 t/s.

iq2_ks

b7744ee

200 t/s -> 367 t/s. There is no iq2_ks_r4.

ikawrakow merged commit dc96820 into main Jun 18, 2025

ikawrakow mentioned this pull request Jun 18, 2025

Much faster CPU prompt processing (part 3) #534

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Much faster CPU prompt processing (part 2) #533

Much faster CPU prompt processing (part 2) #533

Uh oh!

ikawrakow commented Jun 17, 2025

Uh oh!

ubergarm commented Jun 17, 2025 •

edited

Loading

Uh oh!

Vhallo commented Jun 17, 2025

Uh oh!

Nexesenex commented Jun 17, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Much faster CPU prompt processing (part 2) #533

Much faster CPU prompt processing (part 2) #533

Uh oh!

Conversation

ikawrakow commented Jun 17, 2025

Uh oh!

ubergarm commented Jun 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Vhallo commented Jun 17, 2025

Uh oh!

Nexesenex commented Jun 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ubergarm commented Jun 17, 2025 •

edited

Loading

Nexesenex commented Jun 17, 2025 •

edited

Loading