qwen3 metrics on ancient hardware (2x xeon Vs 2x P100) #459
Replies: 17 comments 5 replies
-
You regex is incorrect, so everything goes to the GPU. Try |
Beta Was this translation helpful? Give feedback.
-
the regex works i can see the override being applied but thanks for the hint at shortening it since both main and ikllama were ignoring the --tensor-split i set i got around it by explicitly overriding every tensor distributing equally between the 2x 16GB GPUs this let me fill both cards but performance in both repos was pretty bad like 3pp, 5tg, this didn't change with -nkvo so not sure what's going on, tried both ubergarm/unsloth quants, -fmoe/-fa on/off offload split was 10 exp layers each gpu i found this enlightening https://nvidia.github.io/TensorRT-LLM/advanced/expert-parallelism.html |
Beta Was this translation helpful? Give feedback.
-
The attention tensors are on the GPU, so you don't really want to use What is the quantization type you are using? Full log, including command line are always very useful. If the log output is too long, you can put it in a gzipped text file and attach it to the issue. |
Beta Was this translation helpful? Give feedback.
-
when i do "exps.=CPU" only 6GB total are offloaded to the GPUs is that normal?
ram is 4x2400 ddr4 build flags |
Beta Was this translation helpful? Give feedback.
-
this tensor override thing makes no sense, i'm testing the Q2K quant it's using 40% of vram and if i set only one more tensor-layer the cuda malloc explodes |
Beta Was this translation helpful? Give feedback.
-
if you compile with pipeline parallel copies of 1, I think it's same as putting ngl 94. You can also try 93 and put some ffn*experts in order on the GPUs. (0,1,2,3,etc) The way it looks now is you randomly throw random layers all over the place. Those "blk.20.ffn_norm.weight" shits don't really do anything to improve speed when on GPU. I had best luck with numa distribute. Maybe you should do a benchmark of your ram bandwidth with mlc and see what you get. Then you'd know if its "good" or not. |
Beta Was this translation helpful? Give feedback.
-
@Fuckingnameless There is some more discussion on Also as @Ph0rk0z you might want to try compiling with Take your time and be systematic about your changes and regex and you'll get it dialed in. If you're 128GB RAM is in two numa nodes, consider changing bios to try to get it into a single numa node. Otherwise if you are forced to use multiple NUMA nodes, like @Ph0rk0z mentions, you can try stuff like I like to use have fun! |
Beta Was this translation helpful? Give feedback.
-
like i said i have to explicitly set these normal layers otherwise it's not offloading to gpu2
yeah i need to do some benchmarks with 4 active experts tg goes up 60% numa is not working right for me i need to fiddle with snoop modes is my guess |
Beta Was this translation helpful? Give feedback.
-
I'll check the --interleave=all, can confirm numa balancing = 0 helps even when doing --cpunodebind=0 i was actually using 128GB with 4x32GB ram sticks single node yesterday
i thought that was default, also read somewhere that doing 2 copies aka data parallel could be interesting on dual socket systems? |
Beta Was this translation helpful? Give feedback.
-
@Fuckingnameless
Yeah best performance today tends to be setting all RAM into a single NUMA node then don't bother with numactl etc. Keeps it a bit more simple that way too. So this might be your best BIOS config for now.
Default is So "data parallel" is not implemented in any llama.cpp in terms of loading the entire model weights into RAM multiple times, once for each numa node. It does exist somewhat in ktransformers when compiling that with Things like vllm and sglang to have "proper" tensor-parallel and data-parallel but only for multi-GPU nodes, not CPU NUMA nodes afaict. I have a whole discussion on the NUMA stuff here with a link to that experimental mirror branch with more discussions there. |
Beta Was this translation helpful? Give feedback.
-
Exact same results as taking a single layer off. Technically you manually decide what's on GPU anyway so NGL becomes irrelevant.
-ot "blk.(0|1|2|3|4|5|6|7|8|9|10|11|12).ffn.*=CUDAx" \ or exp marked layers -ot "blk.(34|35|36|37|38|39|40|41|42|43|44|45|46|47|48|49|50).ffn.exps.=CUDAx" If you do it sequentially and just fill as many layers before OOM, you'll have a better time. Put the -ot CPU line last to catch whatever isn't on gpu. CUDA0, CUDA1, on and on. -ot line for each. |
Beta Was this translation helpful? Give feedback.
-
for some reason it's not respecting what i set, just checked again and whatever exps not redirected to -ot =CPU go into CUDA1 I updated the OP with benchmarks |
Beta Was this translation helpful? Give feedback.
-
Try some different regex for CPU. In the benchmark command line above its missing the wildcard. |
Beta Was this translation helpful? Give feedback.
-
$ CUDA_VISIBLE_DEVICES=0,1 bin/llama-bench -t 31 -p 64,128,256 -n 32,64,128 -m moe/Qwen3-235B-A22B-UD-Q2_K_XL-00001-of-00002.gguf -ngl 94 -ot "blk.([0-9]|[1][0-3]).ffn_.=CUDA1","output.=CUDA1","blk.([0-3][0-9]|4[0-6]).ffn_norm.=CUDA1" -ot "blk.(4[7-9]|[5-9][0-9]).ffn_norm.=CUDA0" -ot "blk.([3][1-9]|[4-9][0-9]).ffn_.=CPU" -fa 1 -fmoe 1 -rtr 1 --numa distribute norm layers split 1/1, output layers on last gpu p100 2 node 2 cpu
4 exps
ubergarm s quant
|
Beta Was this translation helpful? Give feedback.
-
Edit: a discussion makes a lot more sense. Thanks @ikawrakow |
Beta Was this translation helpful? Give feedback.
-
trying to figure out why I was seeing a performance drop with numa-cpu inference on debian, tried xanmod 6.12/6.14 kernel, upgraded to debian-testing, tried cuda 12-8/12-9, one change at a time, best i could get was 32t/s on qwen3 30B booted back on linux mint vanilla I'm now a distrohopper |
Beta Was this translation helpful? Give feedback.
-
235B Q2 not so bad? |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
so i set a snoop mode in bios which does some kind of speculative decoding called Home dir w/ OSB+, and it gives a big boost with numa enabled
all tests with HT off
p100 numa off, numa balancing=0
CUDA_VISIBLE_DEVICES=0,1 numactl --cpunodebind=0 ~/Projects/ik_llama.cpp/build/bin/llama-bench -t 16 -p 64,128,256 -n 32,64,128 -m /media/gguf/moe/Qwen3-235B-A22B-UD-Q2_K_XL-00001-of-00002.gguf -ngl 94 -ot "([3][2-9]|[4-9][0-9]).ffn_.exps.=CPU" -ot "([4][7-9]|[5-9][0-9]).(attn|ffn).(q|k|v|norm|inp|output).=CUDA1","([11|12|13|14|15]).ffn_.*_exps.=CUDA1" -fa 1 -fmoe 1 -rtr 1 -sm layer --numa isolate -amb 512
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
Device 0: Tesla P100-PCIE-16GB, compute capability 6.0, VMM: yes
Device 1: Tesla P100-PCIE-16GB, compute capability 6.0, VMM: yes
4 experts
--numa distribute, GPUs on node0, numa_balancing=1
CUDA_VISIBLE_DEVICES=0,1 ~/Projects/ik_llama.cpp/build/bin/llama-bench -t 31 -p 64,128,256 -n 32,64,128 -m /media/gguf/moe/Qwen3-235B-A22B-UD-Q2_K_XL-00001-of-00002.gguf -ngl 94 -ot "([3][2-9]|[4-9][0-9]).ffn_.exps.=CPU" -ot "([4][7-9]|[5-9][0-9]).(attn|ffn).(q|k|v|norm|inp|output).=CUDA1","([11|12|13|14|15]).ffn_.*_exps.=CUDA1" -fa 1 -fmoe 1 -rtr 1 -sm layer --numa distribute -amb 512 -ser 4,1
ubergarm's quant
build: b3036a8 (3701)
and for the giggles:
CPU Only xeon 2697A v4 x2, numa_balancing=1, 4 experts
CUDA_VISIBLE_DEVICES= ~/Projects/ik_llama.cpp/build/bin/llama-bench -t 31 -p 32,64,128 -n 32,64,128,256 -m /media/gguf/moe/Qwen3-235B-A22B-UD-Q2_K_XL-00001-of-00002.gguf -ngl 0 -nkvo 0 -fa 1 -fmoe 1 -rtr 1 -sm layer --numa distribute -amb 512 -ser 4,1
ggml_cuda_init: failed to initialize CUDA: no CUDA-capable device is detected
WARNING: /proc/sys/kernel/numa_balancing is enabled, this has been observed to impair performance
̶#̶#̶#̶ ̶W̶h̶a̶t̶ ̶h̶a̶p̶p̶e̶n̶e̶d̶?̶
̶
̶w̶h̶e̶n̶ ̶i̶ ̶t̶r̶y̶ ̶t̶o̶ ̶l̶o̶a̶d̶ ̶t̶h̶e̶ ̶2̶3̶5̶B̶ ̶I̶Q̶3̶k̶/̶Q̶4̶ ̶o̶n̶ ̶3̶2̶G̶B̶ ̶v̶r̶a̶m̶ ̶+̶1̶2̶8̶G̶B̶ ̶i̶t̶ ̶t̶h̶r̶o̶w̶s̶ ̶t̶h̶i̶s̶ ̶e̶r̶r̶o̶r̶
̶!̶[̶I̶m̶a̶g̶e̶]̶(̶h̶t̶t̶p̶s̶:̶/̶/̶g̶i̶t̶h̶u̶b̶.̶c̶o̶m̶/̶u̶s̶e̶r̶-̶a̶t̶t̶a̶c̶h̶m̶e̶n̶t̶s̶/̶a̶s̶s̶e̶t̶s̶/̶3̶5̶f̶4̶f̶7̶9̶c̶-̶4̶4̶a̶0̶-̶4̶c̶8̶9̶-̶b̶9̶0̶1̶-̶d̶5̶9̶1̶d̶6̶d̶0̶0̶c̶7̶7̶)̶
̶
̶ ̶i̶ ̶t̶r̶i̶e̶d̶ ̶m̶a̶n̶y̶ ̶r̶e̶g̶e̶x̶ ̶c̶o̶m̶b̶i̶n̶a̶t̶i̶o̶n̶s̶ ̶r̶e̶d̶i̶r̶e̶c̶t̶i̶n̶g̶ ̶t̶e̶n̶s̶o̶r̶s̶ ̶t̶o̶ ̶C̶U̶D̶A̶1̶ ̶e̶t̶c̶ ̶b̶u̶t̶ ̶i̶t̶ ̶a̶l̶w̶a̶y̶s̶ ̶t̶r̶i̶e̶s̶ ̶t̶o̶ ̶a̶l̶l̶o̶c̶a̶t̶e̶ ̶1̶0̶0̶G̶B̶+̶ ̶o̶n̶ ̶C̶U̶D̶A̶0̶ ̶a̶s̶ ̶b̶u̶f̶f̶e̶r̶
̶
̶
̶
̶!̶[̶I̶m̶a̶g̶e̶]̶(̶h̶t̶t̶p̶s̶:̶/̶/̶g̶i̶t̶h̶u̶b̶.̶c̶o̶m̶/̶u̶s̶e̶r̶-̶a̶t̶t̶a̶c̶h̶m̶e̶n̶t̶s̶/̶a̶s̶s̶e̶t̶s̶/̶9̶4̶8̶5̶7̶d̶2̶d̶-̶7̶f̶e̶3̶-̶4̶a̶7̶8̶-̶8̶e̶5̶4̶-̶8̶8̶8̶d̶f̶0̶9̶e̶1̶9̶d̶2̶)̶
̶
̶E̶d̶i̶t̶;̶ ̶f̶i̶x̶e̶d̶ ̶b̶y̶ ̶d̶i̶s̶a̶b̶l̶i̶n̶g̶ ̶c̶u̶b̶l̶a̶s̶
Beta Was this translation helpful? Give feedback.
All reactions