Skip to content

Conversation

@ikawrakow
Copy link
Owner

This PR is a follow up of #531 and applies the technique to IQK quants.

Here is a PP-512 performance comparison for LlaMA-3.1-8B-Instruct on a Ryzen-7950X CPU between the main branch and this PR:

model size test t/s (main) t/s (PR) Speedup
llama 8B IQ2_KS 2.05 GiB pp512 203.08 ± 0.39 372.48 ± 3.69 1.834
llama 8B IQ2_K 2.22 GiB pp512 195.04 ± 2.44 365.58 ± 4.25 1.874
llama 8B IQ3_K 3.21 GiB pp512 167.65 ± 0.53 354.90 ± 3.44 2.117
llama 8B IQ4_KS 3.98 GiB pp512 198.28 ± 0.57 362.81 ± 1.74 1.830
llama 8B IQ4_K 4.21 GiB pp512 177.08 ± 1.71 360.58 ± 1.96 2.036
llama 8B IQ5_KS 4.91 GiB pp512 182.40 ± 1.62 358.66 ± 3.39 1.966
llama 8B IQ5_K 5.14 GiB pp512 158.74 ± 0.87 354.68 ± 0.75 2.234
llama 8B IQ6_K 6.19 GiB pp512 147.07 ± 0.80 353.20 ± 3.48 2.402

To put things into perspective, the fastest mainline llama.cpp quant on this CPU is Q4_0, and I get 170 t/s with today's build (build: 860a9e4ee (5688)).

For a bit of history, when PR 6414 was added to llama.cpp, it received 92 👍, 32 🎉, 34 ❤️, and 30 🚀. It only supported Q4_0 and Q8_0, and speedup compared to the master branch at the time was in the range of 40-50%, for a PP-512 of 135 t/s on the Ryzen-7950X CPU used for the above table. There was a blog post received with great fanfare on HN.

Iwan Kawrakow added 9 commits June 17, 2025 08:24
203 t/s -> 357 t/s. iq4_ks_r4 is 242 t/s.
175 t/s -> 353 t/s. iq4_k_r4 is 208 t/s.

PPL is actually lower!
180 t/s -> 359 t/s. iq5_ks_r4 is 210 t/s.

PPL is actually lower - 7.4160 vs 7.4494 for LlaMA-3.1-8B-Instruct
...and that's why PPL was so high. It is also high on main.
This fixes it.
148 t/s -> 350 t/s. There is no iq6_k_r4

PPL is actually lower because we have a bug in the existing
implementation!
169 t/s -> 363 t/s. iq3_k_r4 is at 200 t/s.
190 t/s -> 364 t/s. iq2_k_r4 is at 232 t/s.
200 t/s -> 367 t/s. There is no iq2_ks_r4.
@ubergarm
Copy link
Contributor

ubergarm commented Jun 17, 2025

Thanks, this is huge. I feel like this will make ~70B dense models much better for hybrid inferencing on home rigs. Hope to try some quants soon!

Also holy cow the iqN_k are basically as fast as the iqN_ks!

@Vhallo
Copy link

Vhallo commented Jun 17, 2025

Impressive work all around!

@Nexesenex
Copy link
Contributor

Nexesenex commented Jun 17, 2025

Very impressive, @ikawrakow!
All your recent commits motivate me to put more of IK_Llama on my Kobold.Cpp fork.
I already have overall twice its mainline counterpart CPU PP perfs thanks to your amazing work, and I merged most of your quants, including the last Trellis!
Way to make an enthusiast happy!

@ikawrakow ikawrakow merged commit dc96820 into main Jun 18, 2025
Nexesenex pushed a commit to Nexesenex/ik_llama.cpp.nxs that referenced this pull request Jun 18, 2025
* iq4_ks

203 t/s -> 357 t/s. iq4_ks_r4 is 242 t/s.

* iq4_k

175 t/s -> 353 t/s. iq4_k_r4 is 208 t/s.

PPL is actually lower!

* iq5_ks

180 t/s -> 359 t/s. iq5_ks_r4 is 210 t/s.

PPL is actually lower - 7.4160 vs 7.4494 for LlaMA-3.1-8B-Instruct

* iq5_k - accuracy loss is too big

* iq5_k - there was a bug with the shifts

...and that's why PPL was so high. It is also high on main.
This fixes it.

* iq6_k

148 t/s -> 350 t/s. There is no iq6_k_r4

PPL is actually lower because we have a bug in the existing
implementation!

* iq3_k

169 t/s -> 363 t/s. iq3_k_r4 is at 200 t/s.

* iq2_k

190 t/s -> 364 t/s. iq2_k_r4 is at 232 t/s.

* iq2_ks

200 t/s -> 367 t/s. There is no iq2_ks_r4.

Co-Authored-By: Iwan Kawrakow <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants