What's the best tensor allocation strategy for fast prompt processing in MoE models? #15280
Replies: 1 comment 1 reply
-
If you offload all non-expert layers to GPU (using Always have the fastest GPU with the best PCIe bandwidth as GPU0 for better PP. If necessary, change the device order with the You can reduce the KV cache size by quantizing it to free up a bit more vram (and fit more moe tensors). How much is up to you (I see some go down even to Changing the If bandwidth limited, or using small ubatch size, some recommends using |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
I could get very fast prompt processing in dense models by offloading some number of layers, say at least the first 1/4, to the GPU. For MoE models, however, I'm getting pitiful prompt processing speeds. Usually, token eval is beating the prompt eval speed!
Hardware: 1x3090 GPU trying to get faster prompt processing of Llama 4 Scout range MoE models. But it's even slow on a two-GPU setup: ~3 tok/s prompt eval, ~4 tok/s eval on Q5_K_M quantization.
What is the best strategy for improving the prompt eval speed?
How should I allocate tensors? My CPU is not particularly strong, and I have 72 GB RAM.
Beta Was this translation helpful? Give feedback.
All reactions