Help w/ Faster prefill on CPU-MoE? #718
Replies: 3 comments 11 replies
-
I haven't tested the Qwen3-Coder. Personally I would rather prefer DeepSeek (considering a very poor performance of Qwen3 previous releases ... newermind). So regarding to DeepSeek specifically, I found out that 3 x RTX 3090 can handle 160k context perfectly so with the highest precision quant for DeepSeek for example (6.2bpw, from @Thireus ) I am having up to 6 tps. #477 (reply in thread) . |
Beta Was this translation helpful? Give feedback.
-
Another thought regarding the LLM performance. It seems like its handy to have the LLMs for a different tasks. For example, if the task is simple, one would rather prefer speed to the precision so it would make more sense to use rather small quant to get the faster performance. Otherwise, if the task is more complex, one would rather wait longer to get the best available quality (lower perplexity). If so, it would make more sense to have not one, but many machines with different quants or different hardware setups (i.e. more GPUs if the longer context is preferred). If so, why not to get the cheapest hardware possible and to build more machines with LLMs preloaded? For example, Sapphire Rapids Xeons Engineering Samples with 56C are going for $140 a pop right now lol. ) [EDIT]: that is, if NVIDIA Blackwell with 96GB VRAM is going for ... 9k EUR? ... what's the point? One could rather get for example the Xeon 56C, Gigabyte MB, 512GB RAM DDR5, some RTX 3090 etc. -- that would cost about 5k EUR. Alternatively, its Lenovo Thinkstation P620 with an additional PSU (for a second or third GPU) and DDR4 3200 which is possibly around 3.5k EUR. Lol so one would have two machines which are able to run 120k context or better for a price of one GPU? I can't see how that make sense. |
Beta Was this translation helpful? Give feedback.
-
I cannot run Qwen3-Coder-480B-A35B myself and there haven't been any discussions about this model here, so I have never seen logs. Can you post the full output from starting the server? To give suggestions, I need to see KV and compute buffer sizes, where tensors are stored, etc. Thanks! |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Problem:
Our first long turn (prefill) is slow on CPU-MoE: both GPUs sit ~1–10% SM during prompt digestion, only rising after tokens start. Subsequent turns are fast (cache helps), but we’d like higher GPU utilization during prefill without OOMs.
Goal:
⸻
What we’re asking:
Best offload policy for CPU-MoE prefill (is -op 26,1,27,1,29,1 the right starting point?).
Practical -ub / -amb ranges on 2×24 GB for 128k ctx (8-bit KV), and how to balance with --n-gpu-layers.
Good -ot FFN pinning patterns for Qwen3-Coder-480B to keep both GPUs busy without prefill OOM.
Preferred NUMA mode on EPYC for large prefill: --numa distribute vs --numa isolate.
Any build-time flags (e.g., GGML_CUDA_MIN_BATCH_OFFLOAD) that help CPU-MoE prefill.
⸻
Minimal repro (single command)
This is the smallest command that’s stable for us on 2×4090 and shows the issue.
in host$ = Pop!_OS terminal (single-GPU server)
MODEL_FIRST="$(ls -1v $HOME/models/Qwen3-Coder-480B-A35B-Instruct/Qwen3-480B-A35B-Instruct-IQ5_K-00001-of-*.gguf | head -n1)"
CUDA_VISIBLE_DEVICES=1,0 $HOME/ik_llama.cpp/build/bin/llama-server
--model "$MODEL_FIRST"
--alias openai/local
--host 127.0.0.1 --port 8080
--ctx-size 131072
-fa -fmoe --cpu-moe
--split-mode layer --n-gpu-layers 63
-ctk q8_0 -ctv q8_0
-b 2048 -ub 512 -amb 512 \ # 640/640 sometimes OOMs with wider -ot
--threads 20 --threads-batch 20
--prompt-cache "$HOME/.cache/ik-llama/openai_local_8080.promptcache" --prompt-cache-all
--slot-save-path "$HOME/llama_slots/openai_local_8080"
--keep -1
--slot-prompt-similarity 0.35
-op 26,1,27,1,29,1 \ # offload policy to push PP work to CUDA
-ot 'blk.(3|4).ffn_.=CUDA0'
-ot 'blk.(5|6).ffn_.=CUDA1'
--metrics
Behavior:
First long turn (128k possible; KV at q8_0) shows low SM% during prefill (often 1–10%), then ~20–30% as tokens start.
Follow-up turns with the same prefix are near-instant (prompt/slot cache doing its job).
Widening -ot (e.g., add blocks 2/7) helps a bit until VRAM pressure forces us back to 512/512 or narrower -ot.
⸻
Context / what we’ve tried
Model: Qwen3-Coder-480B-A35B-Instruct (GGUF IQ5_K, 8 shards).
Approach: Experts on CPU for stability/VRAM headroom (--cpu-moe, --override-tensor exps=CPU), dense layers to GPU (--split-mode layer, --n-gpu-layers ~56–63), 8-bit KV (-ctk/-ctv q8_0).
Compute buffers: -ub/-amb 384/384 → 512/512 (stable). 640/640 sometimes OOMs when -ot is widened.
Threads: --threads 20 --threads-batch 20 (EPYC sweet spot across runs).
Prompt/slot caching: --prompt-cache … --prompt-cache-all --slot-save-path … --keep -1 + client cache_prompt:true → follow-ups are fast.
NUMA: tried distribute and isolate; not a clear win yet.
Build: CUDA on; experimenting with GGML_CUDA_MIN_BATCH_OFFLOAD=16, GGML_SCHED_MAX_COPIES=1, GGML_CUDA_FA_ALL_QUANTS=ON, GGML_IQK_FA_ALL_QUANTS=ON.
Throughput: ~11.4–12.0 tok/s gen at 128k ctx (IQ5_K). Prefill is the bottleneck.
⸻
Hardware
⸻
Thanks! Any recommended -op policies, -ub/-amb ranges, -ot patterns, and NUMA/build tips for CPU-MoE prefill on 2×4090 would be hugely appreciated. Happy to run micro-sweeps and share CSVs if that helps.
Beta Was this translation helpful? Give feedback.
All reactions