Is there any full scheme/guide of data and computation flow during pp/tg? #16449
-
I have a multi-GPU CUDA setup with devices of different compute capabilities, and on top of that, I’m also offloading some tensors to the CPU. I'm struggling to understand which device (GPU or CPU) is actually performing the computation and which tensors are simply being stored and transferred to another device when needed. I need to understand this in detail to optimize my offloading configuration. Let me explain thing more clearly using GLM 4.6 as an example. Each layer involves the following tensor types:
First Case: I'm using the
What happens to the experts in this case during evaluation? Are they always copied to CUDA0 when needed for computation? If so, does this copying happen per token during text generation (TG), and per batch during prefill (PP)? Second Case: Same setup as above, but instead of using
or
In short, I’m offloading different combinations of expert tensors to the CPU. I understand that I need to offload more than 25 layers in this case, but I'm unclear on the exact impact on TG and PP when offloading these tensors. Will it perform better than first case? Third Case: I'm using:
And a large number of tensor overrides that spread the experts across GPUs. Essentially, all non-expert tensors are kept on CUDA0, and experts are spread out. In this setup, are the other GPUs (besides CUDA0) actually used for computation? Or are they simply acting as data storage - i.e., experts are copied from those GPUs to CUDA0 during evaluation? These are just some configurations I’ve tried, but I’m looking for a deeper understanding of all tensor types and their behavior. For example:
Thanks in advance for helping clarify this. Understanding these details will help me make more informed decisions about optimization. |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
Generally, prompt processing of offloaded to the GPU, copying the weights if necessary, but generation is done on the device that holds the weights. If you want a detailed list of operations and the devices they are ran on, run llama.cpp with the environment variable |
Beta Was this translation helpful? Give feedback.
Generally, prompt processing of offloaded to the GPU, copying the weights if necessary, but generation is done on the device that holds the weights. If you want a detailed list of operations and the devices they are ran on, run llama.cpp with the environment variable
GGML_SCHED_DEBUG=2
and enable debug output with-v
.