- 
                Notifications
    You must be signed in to change notification settings 
- Fork 155
Much faster CPU prompt processing (part 2) #533
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
...and that's why PPL was so high. It is also high on main. This fixes it.
| Thanks, this is huge. I feel like this will make ~70B dense models much better for hybrid inferencing on home rigs. Hope to try some quants soon! Also holy cow the  | 
| Impressive work all around! | 
| Very impressive, @ikawrakow! | 
* iq4_ks 203 t/s -> 357 t/s. iq4_ks_r4 is 242 t/s. * iq4_k 175 t/s -> 353 t/s. iq4_k_r4 is 208 t/s. PPL is actually lower! * iq5_ks 180 t/s -> 359 t/s. iq5_ks_r4 is 210 t/s. PPL is actually lower - 7.4160 vs 7.4494 for LlaMA-3.1-8B-Instruct * iq5_k - accuracy loss is too big * iq5_k - there was a bug with the shifts ...and that's why PPL was so high. It is also high on main. This fixes it. * iq6_k 148 t/s -> 350 t/s. There is no iq6_k_r4 PPL is actually lower because we have a bug in the existing implementation! * iq3_k 169 t/s -> 363 t/s. iq3_k_r4 is at 200 t/s. * iq2_k 190 t/s -> 364 t/s. iq2_k_r4 is at 232 t/s. * iq2_ks 200 t/s -> 367 t/s. There is no iq2_ks_r4. Co-Authored-By: Iwan Kawrakow <[email protected]>
This PR is a follow up of #531 and applies the technique to
IQKquants.Here is a PP-512 performance comparison for LlaMA-3.1-8B-Instruct on a Ryzen-7950X CPU between the main branch and this PR:
To put things into perspective, the fastest mainline
llama.cppquant on this CPU isQ4_0, and I get 170 t/s with today's build (build: 860a9e4ee (5688)).For a bit of history, when PR 6414 was added to
llama.cpp, it received 92 👍, 32 🎉, 34 ❤️, and 30 🚀. It only supportedQ4_0andQ8_0, and speedup compared to the master branch at the time was in the range of 40-50%, for a PP-512 of 135 t/s on the Ryzen-7950X CPU used for the above table. There was a blog post received with great fanfare on HN.