Replies: 9 comments 19 replies
-
Anyone who has the horse power to run Qwen3-235B-A22B, please feel free to add your results to this discussion. |
Beta Was this translation helpful? Give feedback.
-
Just "cooked" my first Looks pretty good! Only other somewhat comparable benchmark I've seen is from latest ktransformers v0.3 on a rig with better GPU and more RAM. 👈 Logs
In the mean time, I ran a quick comparison of the Q8_0 on the remote threadripper pro 24 core using a single RTX A6000 48GB VRAM GPU and offloading the rest to CPU for a somewhat similar "hybrid inference" test. Note that for some reason 👈 Logsik_llama.cpp
llama.cpp
Interestingly I could hear my fans spin up and down periodically every 15 seconds or so as the CPU ramped up and the GPU dropped down a bit. I noticed this more on the Q8_0 test visually with |
Beta Was this translation helpful? Give feedback.
-
@ubergarm Can you try the attached |
Beta Was this translation helpful? Give feedback.
-
OK, after thinking more about this, I can see why mainline has a better large context TG performance on CUDA for Qwen3-235B-A22B (and previously noted for LLaMA-4): these models have a quite large GQA factor, and I'm still using the old CUDA FA implementation that did not take advantage of that. Improved GQA FA performance was added in this mainline PR.
|
Beta Was this translation helpful? Give feedback.
-
Hello, @artus-dev and @ubergarm asked me to run some sweeps for Qwen3-235B-A22B. My homelab has a substantial server with a VM in it that has the following allocation:
I've run four sweeps as follows:
Both ik_llama.cpp and llama.cpp were compiled with CUDA and OpenBLAS support. The sweeps were run with the following quants:
The llama.cpp tests were conducted with the For the GPU tests, I kept the layer offloads identical between the two. This means that were was slightly less GPU VRAM utilization for the llama.cpp test because the model is smaller, but I felt that was the best way to keep the tests as comparable as I could manage:
Logs for the runs are as follows: ik_llama.cpp CPU logs
ik_llama.cpp GPU logs
llama.cpp CPU logs
llama.cpp GPU logs
I used the CPU performance PP comparison: GPU performance PP comparison: |
Beta Was this translation helpful? Give feedback.
-
Thank you for these results! I think it would be better to disable BLAS for both. CPU Prompt processing with Prompt processing speed on CUDA will also benefit from larger u-batches (e.g., The CUDA TG results are somewhat surprising (sharp performance drop with context length for |
Beta Was this translation helpful? Give feedback.
-
Some more data, this time compiled w/ no BLAS:
Note: the ik_llama.cpp CPU NO BLAS did hit a CUDA error on the very last iteration. ik_llama.cpp CPU NO BLAS logs
ik_llama.cpp 2x GPU NO BLAS logs
|
Beta Was this translation helpful? Give feedback.
-
ik_llama.cpp, no cuda, no blas:
ik_llama.cpp CPU NO CUDA NO BLAS logs
|
Beta Was this translation helpful? Give feedback.
-
Thanks! So, CPU PP is much better now and more inline with what I would have expected. Looking at the TG graph, it is clear that I still need to work on improving how the work is divided between the threads. The Qwen3 MoE models have a high GQA factor, so one should be able to achieve ~70-80% of zero-context performance at 16k tokens. But I see that the Epyc 9355 has 32 cores, so we are using hyper-threading? |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
The Qwen3 models were officially released, and support was added in
ik_llama.cpp
in PR #355, so I was curious to run some performance benchmarks. As much as I would like to try the flagship model, I don't have enough horse power for that, so I experimented with Qwen3-30B-A3B, the 30B total, 3B active parameter MoE model.This time I'm using a custom quantization where all experts are quantized with
IQ4_XS
, all attention tensors withQ5_K
, and the output tensor isQ6_K
. PPL for this model is only 1.25% above the PPL of thebf16
model, so it is a pretty decent quality quantization. Benchmarks are run on a Ryzen-7950X system with an RTX-4080 GPU. Compared are the latestik_kllama.cpp
andllama.cpp
versions as of this morning (April 29 2025).CPU-only performance
The command line for
ik_llama.cpp
isllama.cpp
is similar, except that there is no-rtr -fmoe
. I'm also including mainline results without Flash Attention (FA). In this case the K-cache is quantized withQ8_0
and the V-cache isfp16
.The following graph shows TG performance as a function of
N_KV
, the number of tokens in the KV cache. Performance is pretty close for empty KV cache, with a performance gap increasing withN_KV
. At 16k tokensik_llama.cpp
is 44% faster than mainline without FA, and 3.3 times faster than mainline with FA enabled.The next graph shows prompt processing (PP) speed as a function of
N_KV
. As usual for CPU only inference,ik_llama.cpp
is much faster than mainline for PP - 3.3X for smallN_KV
, increasing to 3.9X at 16k tokens. This is compared to mainline without FA. Compared tollama.cpp
with FA enabled,ik_llama.cpp
is 11.2X faster.llama.cpp CPU-only performance data without FA
llama.cpp CPU-only performance data with FA enabled
ik_llama.cpp CPU-only performance data
Hybrid inference
The custom
IQ4_XS
model is 15.4 GiB, so cannot be fully loaded on my 16 GB RTX-4080 GPU. This gives me the opportunity to try hybrid GPU+CPU inference via tensor overrides on both systems. The command line used in this case isI.e., everything is offloaded to the GPU except for the last 14 layers of the experts tensors. This leaves enough free VRAM to go up to a context of 32k tokens. In the case of
ik_llama.cpp
run-time-repacking (for the experts left on the CPU) and fused MoE(ffn_up*X)*silu(ffn_gate*X)
is enabled via-rtr -fmoe
.The next graph shows TG performance as a function of
N_KV
. Compared to DeepSeek, Here the performance advantage ofik_llama.cpp
is smaller and decreases with increasingN_KV
. As there is no MLA involved, and we are dealing just with a standard attention mechanism, the CUDA FA improvements in this mainline PR that I have not (yet) ported over toik_llama.cpp
counteract the performance gains from the fused MoE operations inik_llama.cpp
, so we end up with a relatively close TG performance.The next graph shows PP performance as a function of
N_KV
. Also here the performance gap decreases withN_KV
, from about 60% for smallN_KV
, to about 18% at 32k tokens.llama.cpp hybrid GPU+CPU performance data
ik_llama.cpp hybrid GPU+CPU performance data
Beta Was this translation helpful? Give feedback.
All reactions