Latest CPU performance comparison with llama.cpp #164
Replies: 6 comments 12 replies
-
I ran some benchmarks on an AVX2 machine (Xeon E5-2683 v4, 32 core, quad channel broadwell) on an IQ4_XS of Midnight Miqu 70B v1.5 via batched bench ( with arguments -pps -fa -t 32 -npp 128,256,512 -ntg 128,256 -npl 1,2,4,8,16,32 -c 32768 [context only needed to be set for llama.cpp as otherwise it would skip some tests but ik_llama.cpp defaulted to 32768] ), build 4404 for llama.cpp. No runtime repacking for ik_llama.cpp.
The table does not have PP results as they did not vary much between tests since the prompt is shared as that is more aligned with my usecase, but even then ik_llama.cpp was faster (~5.05 t/s vs ~2.70 t/s). I manually repacked it from the IQ4_XS and tested the R4 version of the quant on ik_llama.cpp more thoroughly results below.
Performance is good, but I don't understand why odd batch sizes seem to perform better. Also is converting from IQ4_XS to IQ4_XS_R4 via the quantize command not reccomended? I did it just for the test above and it went from: And after conversion: I only ask because I'm not sure if the 80 tensors going from q5_K to iq5_k is lossy. |
Beta Was this translation helpful? Give feedback.
-
@saood06 Thanks for testing.
Neither do I. I'll have to look into it.
Sorry, the goal was to make the
|
Beta Was this translation helpful? Give feedback.
-
Beta Was this translation helpful? Give feedback.
-
Do you plan to update the README.md with these numbers? The R4 quants are very impressive. |
Beta Was this translation helpful? Give feedback.
-
Out of curiousity, do you intend to maintain this fork as an alternative to llama.cpp perpetually? or is it more of a testing grounds before upstreaming? wondering if it's worth recommending people run this specifically for better performance or if it's more of a "bleeding edge" kind of project that people should just wait to get later when it's more ready |
Beta Was this translation helpful? Give feedback.
-
I was curious due to Deepseek's design to test the MHA 35B c4ai-command-r-v01.Q8_0 on my Xeon E5-2683 v4. Ran as much context as I had RAM for. TG is set 5 not 32 as it was slow.
|
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
There has been quite a bit of development here and in mainline
llama.cpp
since the performance results on the front page were generated, so I decided to make a new CPU performance comparison.llama.cpp
build14b699ec (4384)
(latest as of December 23 2024)llama.cpp
llama-bench
tool forPP-512
andTG-128
ik_llama.cpp
the command-line option-rtr 1
is used when runningllama-bench
. This causes all model weights to be repacked into row-interleaved format (if available)AVX2/Zen4
performance is on a Ryzen-7950X,ARM
is onM2-Max
fp16
on M2-Max,bf16
on the Ryzen-7950X)AVX2
ARM_NEON
llama.cpp's
low-quality 4-bit quantizationQ4_0
onARM_NEON
(which gets repacked to a 4-row interleaved format, formerly known asQ4_0_4_4
) is competitive.IQ3_S
(7X faster on the M2-Max, 5.2X faster on the Ryzen-7950X).ik_llama.cpp
is faster than the fastest type inllama.cpp
for prompt processingik_llama.cpp
type outperforms allllama.cpp
types exceptQ4_0
andIQ4_NL
.Prompt processing (prefill) champion
The fastest way to do prompt processing with
ik_llama.cpp
is the new 8-bit, 8-row interleavedQ8_K_R8
type. Getting 370 t/s for LLaMA-3.1-8B (~7.5 billion parameters excluding token embeddings) corresponds to ~5.5 TFLOPS!Beta Was this translation helpful? Give feedback.
All reactions