Is there any full scheme/guide of data and computation flow during pp/tg? #16449

wallentri88 · 2025-10-06T20:15:54Z

wallentri88
Oct 6, 2025

I have a multi-GPU CUDA setup with devices of different compute capabilities, and on top of that, I’m also offloading some tensors to the CPU. I'm struggling to understand which device (GPU or CPU) is actually performing the computation and which tensors are simply being stored and transferred to another device when needed. I need to understand this in detail to optimize my offloading configuration.

Let me explain thing more clearly using GLM 4.6 as an example. Each layer involves the following tensor types:

Attention tensors
FFN Up experts + shared experts
FFN Down experts + shared experts
FFN Gate experts + shared experts
ffn_gate_inp
exp_probs_b

First Case:

I'm using the -ts flag to spread layers evenly across all GPUs except CUDA0, which gets all the layers intended for offloading to the RAM later via the -ncmoe flag. For example:

-ts 31,8,7,8,... -ncmoe 25

What happens to the experts in this case during evaluation? Are they always copied to CUDA0 when needed for computation? If so, does this copying happen per token during text generation (TG), and per batch during prefill (PP)?
Or alternatively, is intermediate data sent from CUDA0 to the RAM, where CPU compute the result using intermediate data and RAM experts, and then the result is copied back to CUDA0?
How frequently does this transfer occur in PP vs. TG?

Second Case:

Same setup as above, but instead of using -ncmoe, I'm using tensor overrides such as:

-ot 'blk\.(<layer_regex>)\.ffn_(up|gate)_exps*=CPU'

or

-ot 'blk\.(<layer_regex>)\.ffn_(up|down)_exps*=CPU'

In short, I’m offloading different combinations of expert tensors to the CPU.

I understand that I need to offload more than 25 layers in this case, but I'm unclear on the exact impact on TG and PP when offloading these tensors. Will it perform better than first case?
I’ve heard that offloading up|down experts improves TG speed, while up|gate improves PP speed (or maybe vice versa) - but I’d like to understand why.
When are these tensors used during computation? What’s their role in the forward pass of PP and TG?

Third Case:

I'm using:

-ts 94,0,0,0,...

And a large number of tensor overrides that spread the experts across GPUs. Essentially, all non-expert tensors are kept on CUDA0, and experts are spread out.

In this setup, are the other GPUs (besides CUDA0) actually used for computation? Or are they simply acting as data storage - i.e., experts are copied from those GPUs to CUDA0 during evaluation?
Or do those GPUs participate in the computation directly?

These are just some configurations I’ve tried, but I’m looking for a deeper understanding of all tensor types and their behavior. For example:

Should shared experts be distributed across GPUs, offloaded to CPU, or kept entirely on CUDA0?
When overriding a tensor to a specific device, does that device also perform the tensor’s computation? Or is the tensor just stored there and copied to another device (and how this another device is being chosen then)?
How is the KV cache distributed across devices, especially when a large number of manual tensor overrides are used?

Thanks in advance for helping clarify this. Understanding these details will help me make more informed decisions about optimization.

Answered by slaren

Oct 6, 2025

Generally, prompt processing of offloaded to the GPU, copying the weights if necessary, but generation is done on the device that holds the weights. If you want a detailed list of operations and the devices they are ran on, run llama.cpp with the environment variable GGML_SCHED_DEBUG=2 and enable debug output with -v.

View full answer

slaren · 2025-10-06T20:28:54Z

slaren
Oct 6, 2025
Maintainer

Generally, prompt processing of offloaded to the GPU, copying the weights if necessary, but generation is done on the device that holds the weights. If you want a detailed list of operations and the devices they are ran on, run llama.cpp with the environment variable GGML_SCHED_DEBUG=2 and enable debug output with -v.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Is there any full scheme/guide of data and computation flow during pp/tg? #16449

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Is there any full scheme/guide of data and computation flow during pp/tg? #16449

Uh oh!

Uh oh!

wallentri88 Oct 6, 2025

Replies: 1 comment

Uh oh!

slaren Oct 6, 2025 Maintainer

wallentri88
Oct 6, 2025

slaren
Oct 6, 2025
Maintainer