Feature Request: Allow disabling `offload_op` for backends by user

### Prerequisites

- [x] I am running the latest code. Mention the version if possible as well.
- [x] I carefully followed the [README.md](https://github.com/ggml-org/llama.cpp/blob/master/README.md).
- [x] I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
- [x] I reviewed the [Discussions](https://github.com/ggml-org/llama.cpp/discussions), and have a new and useful enhancement to share.

### Feature Description

llama.cpp currently uses hardcoded minimum batch size = 32 and there's no option to disable offload_op unless the user specify -ub 16 or less manually. It would be great if the user can disable offload_op manually without reducing -ub.

### Motivation

With the introduction of --override-tensor, it has become practical to offload experts to host DRAM in large MoEs while keeping the dense tensors on a GPU with relatively small VRAM. However, in the current implementation, prompt processing performance is not ideal in some configurations due to offload_op being used.

For example, when running llama4 400B with `-ot exps=CPU` with code from master branch, the prompt processing performance is extremely poor when -ub is set to 512 (default). When -ub is set to 16 it bypasses the `offload_op` in CUDA backend, but the performance is not fully on par with -ub 512 w/ `offload_op` disabled from source code.

llama-bench -m ~/models/llama4-400b-q4_0.gguf -ngl 999 -fa 1 -ctk q8_0 -ctv q8_0 -ot 'exps=CPU' -mmp 0 -ub 16,512
| model                          |       size |     params | backend    | ngl | n_ubatch | type_k | type_v | fa | ot                    | mmap |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -------: | -----: | -----: | -: | --------------------- | ---: | --------------: | -------------------: |
| llama4 17Bx128E (Maverick) Q4_0 | 211.18 GiB |   400.71 B | ROCm,RPC   | 999 |       16 |   q8_0 |   q8_0 |  1 | exps=CPU              |    0 |           pp512 |         93.68 ± 0.70 |
| llama4 17Bx128E (Maverick) Q4_0 | 211.18 GiB |   400.71 B | ROCm,RPC   | 999 |       16 |   q8_0 |   q8_0 |  1 | exps=CPU              |    0 |           tg128 |         26.68 ± 0.07 |
| llama4 17Bx128E (Maverick) Q4_0 | 211.18 GiB |   400.71 B | ROCm,RPC   | 999 |      512 |   q8_0 |   q8_0 |  1 | exps=CPU              |    0 |           pp512 |         23.14 ± 0.01 |
| llama4 17Bx128E (Maverick) Q4_0 | 211.18 GiB |   400.71 B | ROCm,RPC   | 999 |      512 |   q8_0 |   q8_0 |  1 | exps=CPU              |    0 |           tg128 |         26.61 ± 0.16 |

With `offload_op` changed to always return false in CUDA backend, there's a 10x performance boost.
./build/bin/llama-bench -m ~/models/llama4-400b-q4_0.gguf -ngl 999 -fa 1 -ctk q8_0 -ctv q8_0 -ot 'exps=CPU' -mmp 0
| model                          |       size |     params | backend    | ngl | type_k | type_v | fa | ot                    | mmap |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -----: | -----: | -: | --------------------- | ---: | --------------: | -------------------: |
| llama4 17Bx128E (Maverick) Q4_0 | 211.18 GiB |   400.71 B | ROCm,RPC   | 999 |   q8_0 |   q8_0 |  1 | exps=CPU              |    0 |           pp512 |        233.66 ± 1.31 |
| llama4 17Bx128E (Maverick) Q4_0 | 211.18 GiB |   400.71 B | ROCm,RPC   | 999 |   q8_0 |   q8_0 |  1 | exps=CPU              |    0 |           tg128 |         26.91 ± 0.10 |

### Possible Implementation

In ggml-backend.cpp, add some additional options and checks to the following offload_op call

```cpp
// check if a backend with higher prio wants to offload the op
if (src_backend_id == sched->n_backends - 1 && ggml_backend_buffer_is_host(src->buffer)) {
    for (int b = 0; b < src_backend_id; b++) {
        if (ggml_backend_supports_op(sched->backends[b], tensor) && ggml_backend_offload_op(sched->backends[b], tensor)) {
            SET_CAUSE(tensor, "1.off");
            return b;
        }
    }
}
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feature Request: Allow disabling `offload_op` for backends by user #13241

Prerequisites

Feature Description

Motivation

Possible Implementation

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

model	size	params	backend	ngl	n_ubatch	type_k	type_v	fa	ot	test	t/s
llama4 17Bx128E (Maverick) Q4_0	211.18 GiB	400.71 B	ROCm,RPC	999	16	q8_0	q8_0	1	exps=CPU	pp512	93.68 ± 0.70
llama4 17Bx128E (Maverick) Q4_0	211.18 GiB	400.71 B	ROCm,RPC	999	16	q8_0	q8_0	1	exps=CPU	tg128	26.68 ± 0.07
llama4 17Bx128E (Maverick) Q4_0	211.18 GiB	400.71 B	ROCm,RPC	999	512	q8_0	q8_0	1	exps=CPU	pp512	23.14 ± 0.01
llama4 17Bx128E (Maverick) Q4_0	211.18 GiB	400.71 B	ROCm,RPC	999	512	q8_0	q8_0	1	exps=CPU	tg128	26.61 ± 0.16

Feature Request: Allow disabling offload_op for backends by user #13241

Description

Prerequisites

Feature Description

Motivation

Possible Implementation

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Feature Request: Allow disabling `offload_op` for backends by user #13241