- 
                Notifications
    
You must be signed in to change notification settings  - Fork 13.5k
 
Description
Prerequisites
- I am running the latest code. Mention the version if possible as well.
 - I carefully followed the README.md.
 - I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
 - I reviewed the Discussions, and have a new and useful enhancement to share.
 
Feature Description
llama.cpp currently uses hardcoded minimum batch size = 32 and there's no option to disable offload_op unless the user specify -ub 16 or less manually. It would be great if the user can disable offload_op manually without reducing -ub.
Motivation
With the introduction of --override-tensor, it has become practical to offload experts to host DRAM in large MoEs while keeping the dense tensors on a GPU with relatively small VRAM. However, in the current implementation, prompt processing performance is not ideal in some configurations due to offload_op being used.
For example, when running llama4 400B with -ot exps=CPU with code from master branch, the prompt processing performance is extremely poor when -ub is set to 512 (default). When -ub is set to 16 it bypasses the offload_op in CUDA backend, but the performance is not fully on par with -ub 512 w/ offload_op disabled from source code.
llama-bench -m ~/models/llama4-400b-q4_0.gguf -ngl 999 -fa 1 -ctk q8_0 -ctv q8_0 -ot 'exps=CPU' -mmp 0 -ub 16,512
| model | size | params | backend | ngl | n_ubatch | type_k | type_v | fa | ot | mmap | test | t/s | 
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| llama4 17Bx128E (Maverick) Q4_0 | 211.18 GiB | 400.71 B | ROCm,RPC | 999 | 16 | q8_0 | q8_0 | 1 | exps=CPU | 0 | pp512 | 93.68 ± 0.70 | 
| llama4 17Bx128E (Maverick) Q4_0 | 211.18 GiB | 400.71 B | ROCm,RPC | 999 | 16 | q8_0 | q8_0 | 1 | exps=CPU | 0 | tg128 | 26.68 ± 0.07 | 
| llama4 17Bx128E (Maverick) Q4_0 | 211.18 GiB | 400.71 B | ROCm,RPC | 999 | 512 | q8_0 | q8_0 | 1 | exps=CPU | 0 | pp512 | 23.14 ± 0.01 | 
| llama4 17Bx128E (Maverick) Q4_0 | 211.18 GiB | 400.71 B | ROCm,RPC | 999 | 512 | q8_0 | q8_0 | 1 | exps=CPU | 0 | tg128 | 26.61 ± 0.16 | 
With offload_op changed to always return false in CUDA backend, there's a 10x performance boost.
./build/bin/llama-bench -m ~/models/llama4-400b-q4_0.gguf -ngl 999 -fa 1 -ctk q8_0 -ctv q8_0 -ot 'exps=CPU' -mmp 0
| model | size | params | backend | ngl | type_k | type_v | fa | ot | mmap | test | t/s | 
|---|---|---|---|---|---|---|---|---|---|---|---|
| llama4 17Bx128E (Maverick) Q4_0 | 211.18 GiB | 400.71 B | ROCm,RPC | 999 | q8_0 | q8_0 | 1 | exps=CPU | 0 | pp512 | 233.66 ± 1.31 | 
| llama4 17Bx128E (Maverick) Q4_0 | 211.18 GiB | 400.71 B | ROCm,RPC | 999 | q8_0 | q8_0 | 1 | exps=CPU | 0 | tg128 | 26.91 ± 0.10 | 
Possible Implementation
In ggml-backend.cpp, add some additional options and checks to the following offload_op call
// check if a backend with higher prio wants to offload the op
if (src_backend_id == sched->n_backends - 1 && ggml_backend_buffer_is_host(src->buffer)) {
    for (int b = 0; b < src_backend_id; b++) {
        if (ggml_backend_supports_op(sched->backends[b], tensor) && ggml_backend_offload_op(sched->backends[b], tensor)) {
            SET_CAUSE(tensor, "1.off");
            return b;
        }
    }
}