Latest CPU performance comparison with llama.cpp #164
Replies: 6 comments 12 replies
-
|
I ran some benchmarks on an AVX2 machine (Xeon E5-2683 v4, 32 core, quad channel broadwell) on an IQ4_XS of Midnight Miqu 70B v1.5 via batched bench ( with arguments -pps -fa -t 32 -npp 128,256,512 -ntg 128,256 -npl 1,2,4,8,16,32 -c 32768 [context only needed to be set for llama.cpp as otherwise it would skip some tests but ik_llama.cpp defaulted to 32768] ), build 4404 for llama.cpp. No runtime repacking for ik_llama.cpp.
The table does not have PP results as they did not vary much between tests since the prompt is shared as that is more aligned with my usecase, but even then ik_llama.cpp was faster (~5.05 t/s vs ~2.70 t/s). I manually repacked it from the IQ4_XS and tested the R4 version of the quant on ik_llama.cpp more thoroughly results below.
Performance is good, but I don't understand why odd batch sizes seem to perform better. Also is converting from IQ4_XS to IQ4_XS_R4 via the quantize command not reccomended? I did it just for the test above and it went from: And after conversion: I only ask because I'm not sure if the 80 tensors going from q5_K to iq5_k is lossy. |
Beta Was this translation helpful? Give feedback.
-
|
@saood06 Thanks for testing.
Neither do I. I'll have to look into it.
Sorry, the goal was to make the
|
Beta Was this translation helpful? Give feedback.
-
Beta Was this translation helpful? Give feedback.
-
Do you plan to update the README.md with these numbers? The R4 quants are very impressive. |
Beta Was this translation helpful? Give feedback.
-
|
Out of curiousity, do you intend to maintain this fork as an alternative to llama.cpp perpetually? or is it more of a testing grounds before upstreaming? wondering if it's worth recommending people run this specifically for better performance or if it's more of a "bleeding edge" kind of project that people should just wait to get later when it's more ready |
Beta Was this translation helpful? Give feedback.
-
|
I was curious due to Deepseek's design to test the MHA 35B c4ai-command-r-v01.Q8_0 on my Xeon E5-2683 v4. Ran as much context as I had RAM for. TG is set 5 not 32 as it was slow.
|
Beta Was this translation helpful? Give feedback.

Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
There has been quite a bit of development here and in mainline
llama.cppsince the performance results on the front page were generated, so I decided to make a new CPU performance comparison.llama.cppbuild14b699ec (4384)(latest as of December 23 2024)llama.cppllama-benchtool forPP-512andTG-128ik_llama.cppthe command-line option-rtr 1is used when runningllama-bench. This causes all model weights to be repacked into row-interleaved format (if available)AVX2/Zen4performance is on a Ryzen-7950X,ARMis onM2-Maxfp16on M2-Max,bf16on the Ryzen-7950X)AVX2
ARM_NEON
llama.cpp'slow-quality 4-bit quantizationQ4_0onARM_NEON(which gets repacked to a 4-row interleaved format, formerly known asQ4_0_4_4) is competitive.IQ3_S(7X faster on the M2-Max, 5.2X faster on the Ryzen-7950X).ik_llama.cppis faster than the fastest type inllama.cppfor prompt processingik_llama.cpptype outperforms allllama.cpptypes exceptQ4_0andIQ4_NL.Prompt processing (prefill) champion
The fastest way to do prompt processing with
ik_llama.cppis the new 8-bit, 8-row interleavedQ8_K_R8type. Getting 370 t/s for LLaMA-3.1-8B (~7.5 billion parameters excluding token embeddings) corresponds to ~5.5 TFLOPS!Beta Was this translation helpful? Give feedback.
All reactions