Replies: 1 comment 6 replies
-
Don't use |
Beta Was this translation helpful? Give feedback.
6 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Introduction
I tried to run model Qwen3-30B-A3B-GGUF with ik_llama.cpp. Because I have a nvidia GPU(RTX 4060Ti) with 8G VRAM on my PC, so I compiled ik_llama.cpp with the cuda backend, and run with
-ot exps=CPU
to offload experts(ffn_down_exps, ffn_up_exps, gate_exps) to CPU.Build options:
I tested
q8_0
quantization andbf16
models, onq8_0
model, the prompt processing speed(PP) the token generate speed(TG) are very quickly, I got a speed of up to 165 token/s PP and 18 token/s TG, that's a good start. but when I ranbf16
model, the PP speed is much slower than before, It just 30-40token/s PP, 11-12 token/s TG, It's not even as good as only CPU ggml backend(about 51 token/s PP, 11 token/s TG), This performance is obviously not normal on bf16 models. It makes me confused. I've also found that the GPU spends quite a bit of time on the copy every time the token processing phase is processed, but quantization modes(like q8_0) don't have the above problem.cpu backend,
bf16
model(Qwen3-30B-A3B-BF16)cuda backend,
bf16
model(Qwen3-30B-A3B-BF16)cuda backend,
q8_0
model(Qwen3-30B-A3B-Q8_0)System Info
Here are my SystemInfo(include hardware and software)
Benchmark
Here are the results of my initialllama-sweep-bench testing for PP speed and TG speed, the command line for is
ik_llama.cpp
llama-sweep-bench:
ik_llama.cpp cuda backed (Model: Qwen3-30B-A3B-Q8_0)
ik_llama.cpp cuda backed (Model: Qwen3-30B-A3B-BF16)
Beta Was this translation helpful? Give feedback.
All reactions