What's the best tensor allocation strategy for fast prompt processing in MoE models? #15280

mashdragon · 2025-08-13T03:04:53Z

mashdragon
Aug 13, 2025

I could get very fast prompt processing in dense models by offloading some number of layers, say at least the first 1/4, to the GPU. For MoE models, however, I'm getting pitiful prompt processing speeds. Usually, token eval is beating the prompt eval speed!
Hardware: 1x3090 GPU trying to get faster prompt processing of Llama 4 Scout range MoE models. But it's even slow on a two-GPU setup: ~3 tok/s prompt eval, ~4 tok/s eval on Q5_K_M quantization.

What is the best strategy for improving the prompt eval speed?

Changing the split mode?
Offloading all MoE non-shared expert layers to CPU and keeping the rest on GPU? This is what I've been doing...

How should I allocate tensors? My CPU is not particularly strong, and I have 72 GB RAM.

abc-nix · 2025-08-13T06:04:34Z

abc-nix
Aug 13, 2025

If you offload all non-expert layers to GPU (using --ngl 999) and all expert tensors to CPU (--cpu-moe), you can get a good PP speed increase. Then, you can just start offloading moe tensors by using the --n-cpu-moe and reducing the number until GPU is full (on multi-GPU you need to manage layer/tensor split). Flash attention usually increases PP and reduces memory use, but may decrease TG.

Always have the fastest GPU with the best PCIe bandwidth as GPU0 for better PP. If necessary, change the device order with the --device command.

You can reduce the KV cache size by quantizing it to free up a bit more vram (and fit more moe tensors). How much is up to you (I see some go down even to q4_0, but I find q8_0 acceptable).

Changing the --ubatch-size amount also affects prompt processing. The higher it is, the faster it gets (but also the more memory it takes). Start with --ubatch-size 256, then move up to 512, 1024, and 2048 (more than this is not worth it unless you still have a lot of free VRAM). There is a big jump between 512 and 1024, but the jump is not as large moving to 2048. Here you are sacrificing TG for PP, as you can offload fewer tensors.

If bandwidth limited, or using small ubatch size, some recommends using --no-op-offload.

1 reply

mashdragon Aug 13, 2025
Author

Thank you very much, the help menu does not adequately describe ubatch-size, which is directly for prompt processing, and I had no idea about it.

Also: in my single-GPU setup, the primary GPU is being fully utilized as expected. But in a dual-GPU setup, the second GPU has VRAM usage but almost zero utilization while the primary GPU is running. I think this contributes to what I observe in that increases to the ubatch size unfortunately have no effect.

I'm still amazed at the difference between dense models (100-200 p.p. tok/s for 70B dense with 1/4 layers offloaded to GPU) vs. these larger MoE models (4 tok/s for Llama 4 16x17B Scout, all non-expert layers offloaded to GPU).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

What's the best tensor allocation strategy for fast prompt processing in MoE models? #15280

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

What's the best tensor allocation strategy for fast prompt processing in MoE models? #15280

Uh oh!

mashdragon Aug 13, 2025

Replies: 1 comment · 1 reply

Uh oh!

Uh oh!

abc-nix Aug 13, 2025

Uh oh!

mashdragon Aug 13, 2025 Author

mashdragon
Aug 13, 2025

Replies: 1 comment 1 reply

abc-nix
Aug 13, 2025

mashdragon Aug 13, 2025
Author