Replies: 7 comments 8 replies
-
Thank you for these results. Quite amazing that it works reasonably well on an almost 8 years old CPU! I'm curious if you might get better performance by repacking the model (unlikely for TG, very likely for PP). You can repack either on the fly by adding
This shouldn't take very long, even for the 235B model. Another note: at least on the CPUs that I have available, one gets better performance using |
Beta Was this translation helpful? Give feedback.
-
Note: when I run llama-server with
|
Beta Was this translation helpful? Give feedback.
-
This grabbed my attention as I have never seen any significant difference between Attempt 1
Attempt 2
Hence, I think that the outcome is largely determined by the quality of the quantized model and by some luck. We know that in a random process (as we have here) slight differences in the computed token probabilities can make the model go on a very different path, even if the same seed was used. |
Beta Was this translation helpful? Give feedback.
-
Note: qwen3moe uses 8 experts by default. I found that we can speed up token generation(2.7 token/s->3.2 token/s) by reducing some experts used (from Top-8 to Top-6), without a significant drop in quality. parameter:
|
Beta Was this translation helpful? Give feedback.
-
you forgot to set -nkvo?
|
Beta Was this translation helpful? Give feedback.
-
You cannot compare |
Beta Was this translation helpful? Give feedback.
-
Well, I use
|
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Introduction
The Qwen3 models were officially released on 29th, April, 2025. This is a mixture-of-experts (MoE) models which 235B in total and 22B activated, here are the following features.
The qwen3moe had supported in in PR #355, I tried to run the biggest model Qwen3-235B-A22B-128K-GGUF with ik_llama.cpp on my Workstation, I need better generation quality, an my system has sufficient memory(Total 512G RAM), so I chose the relatively higher quality quantization
Q8_0
.System Info
Here are my SystemInfo(include hardware and software)
Memory Performance
CPU-backend performance
The command line for is
ik_llama.cpp
llama-sweep-bench:
ik_llama.cpp CPU-only performance data(Qwen3-235B-A22B-128K-Q8_0)
main: n_kv_max = 16384, n_batch = 2048, n_ubatch = 512, flash_attn = 1, n_gpu_layers = 0, n_threads = 20, n_threads_batch = 20
ik_llama.cpp CPU-only performance data(Qwen3-30B-A3B-128K-GGUF)
I also experimented with
Qwen3-30B-A3B-128K-Q8_0(unsloth/Qwen3-235B-A22B-128K-GGUF)
, Here are the results, well, the performance is much better than I though.main: n_kv_max = 16384, n_batch = 2048, n_ubatch = 512, flash_attn = 1, n_gpu_layers = 0, n_threads = 20, n_threads_batch = 20
Profiler Data
I also use
Intel VTune Profiler 2025.0.1
capture some interesting data when running llama-server withQwen3-30B-A3B-128K-Q8_0
, I will show them as well.Beta Was this translation helpful? Give feedback.
All reactions