Maverick slow prompt with gpu #350

justinjja · 2025-04-27T04:14:22Z

justinjja
Apr 27, 2025

Any idea what the deal is with prompt speeds on Maverick?

1 3090 and a 56 core ddr4 epyc - Q4.5 - ~3500 tokens:
Prompt 6.24 T/s
Generation 31.7 T/s

Same but with the GPU disabled:
Prompt 95 T/s
Generation 5.6 T/s

Is it possible to leave prompt processing on the CPU and still use the GPU for generation?

saood06 · 2025-04-27T04:22:52Z

saood06
Apr 27, 2025
Collaborator

Do you mind providing the exact commands used to get those numbers (and any details about the quant used)?

0 replies

ikawrakow · 2025-04-27T06:45:38Z

ikawrakow
Apr 27, 2025
Maintainer

Please tell us your command line parameters.

I cannot run Maverick, but here is how I run Scout on a 32-core Ryzen-5975WX with a 16 GB RTX-4080:

./bin/llama-sweep-bench -m $model -t 32 -ngl 100 -ot "blk\.[0-9]\.ffn_up=CUDA0,blk\.[0-9]\.ffn_gate=CUDA0,exps=CPU" -rtr -fa -fmoe -ctk q8_0 -ctv q8_0 -c 16384 -ub 2048

where $model roughly corresponds in size to Unsloth's UD-Q2_K-XL (~40 GiB). And here is what I get in terms of performance as measured by llama-sweep-bench

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
2048	512	0	5.798	353.23	24.503	20.90
2048	512	2048	5.779	354.36	25.474	20.10
2048	512	4096	5.868	349.04	26.436	19.37
2048	512	6144	5.958	343.76	27.480	18.63
2048	512	8192	6.041	339.04	28.457	17.99
2048	512	10240	6.121	334.60	29.508	17.35
2048	512	12288	6.206	329.99	30.540	16.76
2048	512	14336	6.297	325.25	31.513	16.25

The above command puts all attention tensors, shared experts, and the first 10 layers of ffn_up_exps and ffn_down_exps tensors on the GPU, all remaining experts stay on the CPU. With 16k context, this requires about 14 GiB of VRAM. You can use something similar, adapting to the 24 GiB of VRAM you have, and the different size of the Maverick model.

0 replies

justinjja · 2025-04-27T16:51:53Z

justinjja
Apr 27, 2025
Author

Nice, thank you!
My command must have been bad.

Your command 5x'ed my prompt speed.
And upgrading my pcie from Gen3x4 to Gen4x16 got me another 4x on top of that.

I'm running unsloths 4.5 Bit dynamic gguf.

On my original test I'm now able to get:
128 prompt and 34 gen

New command:
CUDA_VISIBLE_DEVICES=0 ./llama-server -m mav.gguf -t 32 --n-gpu-layers 100 -ot "blk.[0-1].ffn_up=CUDA0,blk.[0-1].ffn_gate=CUDA0,exps=CPU" -fa -ctk q8_0 -ctv q8_0 -c 16384 -ub 2048 --host 0.0.0.0 --port 8000

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Maverick slow prompt with gpu #350

Uh oh!

{{title}}

Uh oh!

Replies: 3 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Maverick slow prompt with gpu #350

Uh oh!

justinjja Apr 27, 2025

Replies: 3 comments

Uh oh!

Uh oh!

saood06 Apr 27, 2025 Collaborator

Uh oh!

ikawrakow Apr 27, 2025 Maintainer

Uh oh!

justinjja Apr 27, 2025 Author

justinjja
Apr 27, 2025

saood06
Apr 27, 2025
Collaborator

ikawrakow
Apr 27, 2025
Maintainer

justinjja
Apr 27, 2025
Author