Help about optimizing inference speed on QWEN3-30B-2507 on ryzen 7 8745HS #666
Replies: 2 comments 7 replies
-
I have a Ryzen 7950X. Running this model CPU-only, I get in the range of 500 t/s for prompt processing (PP), and 30 t/s for token generation (TG). My guess is that your CPU should be able to do at least 200 t/s PP. TG is memory bound, so more difficult to estimate. Vulkan is supported, but the important optimisations have not been ported to Vulkan yet, so performance should be very similar to llama.cpp. There are no docker images, and no near future plans to provide. |
Beta Was this translation helpful? Give feedback.
-
Can confirm similar on my MiniPC with the same CPU (8945HS(8C/16T, up to 5.2GHz) / 64GB DDR5) Additionally, experimenting with GPU offloading (RTX 3060, 12GB). Some amount of improvement on generation, but prompt processing jumped almost 2x - great performance compared to /ik_llama.cpp/build/bin/llama-bench -m /models/Qwen3-Coder-30B-A3B-Instruct-UD-Q4_K_XL.gguf -ngl 99 -fa -fmoe -ub 768 -ot 'blk.(1[8-9]|[2-4][0-9]).ffn_.*._exps=CPU' -rtr 1
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes
| model | size | params | backend | ngl | n_ubatch | rtr | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -------: | --: | ------------: | ---------------: |
============ Repacked 90 tensors
| qwen3moe ?B Q4_K - Medium | 16.45 GiB | 30.53 B | CUDA | 99 | 768 | 1 | pp512 | 445.85 ± 2.40 |
| qwen3moe ?B Q4_K - Medium | 16.45 GiB | 30.53 B | CUDA | 99 | 768 | 1 | tg128 | 36.28 ± 0.22 | Without re-packing my results drop quite a bit: /ik_llama.cpp/build/bin/llama-bench -m /models/Qwen3-Coder-30B-A3B-Instruct-UD-Q4_K_XL.gguf -ngl 99 -fa -fmoe -ot 'blk.(1[8-9]|[2-4][0-9]).ffn_.*._exps=CPU' -rtr 0 -ub 768
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes
| model | size | params | backend | ngl | n_ubatch | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -------: | ------------: | ---------------: |
| qwen3moe ?B Q4_K - Medium | 16.45 GiB | 30.53 B | CUDA | 99 | 768 | pp512 | 251.61 ± 0.80 |
| qwen3moe ?B Q4_K - Medium | 16.45 GiB | 30.53 B | CUDA | 99 | 768 | tg128 | 36.23 ± 0.18 | |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Hello, here's my current setup:
I'm aiming to optimize both prompt processing speed and generation speed.
So far, my tests with
llama-optimus
have shown better CPU performance than GPU — though at the cost of higher latency.I've offloaded 16 GB to the iGPU, and I have 64 GB of RAM available.
The
ik_llama.cpp
project seems promising. Do you think I can achieve better performance with my setup?Currently, I'm getting:
I haven’t found an official Docker image for
ik_llama.cpp
— is one planned?Also, is the project compatible with Vulkan or ROCm?
Beta Was this translation helpful? Give feedback.
All reactions