Eval bug: When offloading to CPU after f77c13b commit using CUDA (MultiGPU), PP performance seems to be reduced by ~75% (CUDA: General GEMV fusion)

### Name and Version

./llama-cli --version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 6 CUDA devices:
  Device 0: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes
  Device 1: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
  Device 2: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
  Device 3: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes
  Device 4: NVIDIA RTX A6000, compute capability 8.6, VMM: yes
  Device 5: NVIDIA A40, compute capability 8.6, VMM: yes
version: 6906 (0de0a0157)
built with cc (GCC) 15.2.1 20251022 (Red Hat 15.2.1-3) for x86_64-redhat-linux

### Operating systems

Linux

### GGML backends

CUDA

### Hardware

Fedora 42
AMD Ryzen 9 9900X
192GB RAM
RTX 5090x2
RTX 4090x2
RTX A6000
RTX A40

### Models

DeepSeek-V3-0324
DeepSeek-R1-0528
DeepSeek-V3.1
DeepSeek-V3.1-Terminus

### Problem description & steps to reproduce

I build llamacpp with:
```

cmake -B build \
  -DGGML_CUDA=ON \
  -DGGML_CUDA_FA_ALL_QUANTS=ON \
  -DGGML_BLAS=OFF \
  -DGGML_RPC=ON \
  -DCMAKE_CUDA_ARCHITECTURES="86;89;120" \
  -DGGML_MAX_CONTEXTS=2048 \
```

When offloading DeepSeek V3 0324/R1 0528/V3.1 models with offloading, on https://github.com/ggml-org/llama.cpp/commit/5d195f17bc60eacc15cfb929f9403cf29ccdf419 commit, with:

```
LLAMA_SET_ROWS=1 ./llama-server -m '/Drive1_8TB/models_llm_2tb/DeepSeek-V3-0324-UD-Q3_K_XL.gguf' -c 32768 -ngl 999 --no-mmap \
-ot "blk.(0|1|2|3|4|5|6|7).ffn.=CUDA0" \
-ot "blk.(8|9|10|11).ffn.=CUDA1" \
-ot "blk.(12|13|14|15).ffn.=CUDA2" \
-ot "blk.(16|17|18|19|20).ffn.=CUDA3" \
-ot "blk.(21|22|23).ffn.=CUDA4" \
-ot "blk.(24|25|26|27).ffn.=CUDA4" \
-ot "blk.(28|29|30|31|32|33|34).ffn.=CUDA5" \
-ot "exps=CPU" \
-fa on -mg 0 -ub 2560 -b 256
```

When loading, it looks like this:

```
load_tensors:        CUDA0 model buffer size = 25363.28 MiB
load_tensors:        CUDA1 model buffer size = 19841.07 MiB
load_tensors:        CUDA2 model buffer size = 19842.82 MiB
load_tensors:        CUDA3 model buffer size = 24357.64 MiB
load_tensors:        CUDA4 model buffer size = 34490.44 MiB
load_tensors:        CUDA5 model buffer size = 35639.92 MiB
load_tensors:          CPU model buffer size =   497.11 MiB
load_tensors:    CUDA_Host model buffer size = 122500.00 MiB
```

As you can see, CPU model buffer size is just after GPUs (CUDA) and just before CUDA_Host.
This nets me these speeds:

```
prompt eval time =   17797.43 ms /  4373 tokens (    4.07 ms per token,   245.71 tokens per second)
       eval time =   42683.82 ms /   453 tokens (   94.22 ms per token,    10.61 tokens per second)
      total time =   60481.25 ms /  4826 tokens
```

A variant of this while testing that also works fine is:

```
load_tensors:        CUDA0 model buffer size = 25363.28 MiB
load_tensors:        CUDA1 model buffer size = 19841.07 MiB
load_tensors:        CUDA2 model buffer size = 19842.82 MiB
load_tensors:        CUDA3 model buffer size = 24357.64 MiB
load_tensors:        CUDA4 model buffer size = 34490.44 MiB
load_tensors:        CUDA5 model buffer size = 35639.92 MiB
load_tensors:    CUDA_Host model buffer size = 122500.00 MiB
load_tensors:          CPU model buffer size =   497.11 MiB
```

While, after https://github.com/ggml-org/llama.cpp/commit/5d195f17bc60eacc15cfb929f9403cf29ccdf419 commit (I'm still not sure if the exact next one is the one causing the issue), it looks like this:

```
load_tensors:          CPU model buffer size =   497.11 MiB
load_tensors:        CUDA0 model buffer size = 25363.28 MiB
load_tensors:        CUDA1 model buffer size = 19841.07 MiB
load_tensors:        CUDA2 model buffer size = 19842.82 MiB
load_tensors:        CUDA3 model buffer size = 24357.64 MiB
load_tensors:        CUDA4 model buffer size = 34490.44 MiB
load_tensors:        CUDA5 model buffer size = 35639.92 MiB
load_tensors:    CUDA_Host model buffer size = 122500.00 MiB
```

Which nets me these speeds

```
prompt eval time =   49380.49 ms /  4373 tokens (   11.29 ms per token,    88.56 tokens per second)
       eval time =   50832.32 ms /   542 tokens (   93.79 ms per token,    10.66 tokens per second)
```

I have deleted and not used ccache on each build to not get any extra issues.

As reference, ik llamacpp handles this, via this way https://github.com/ikawrakow/ik_llama.cpp/pull/405

With this explanation

> When part of the tensors are stored in RAM but there are faster back-ends available (GPU), the scheduler needs to decide if to offload the data for a given op to a faster back-end or to compute the op on the CPU. This is currently done via a simple heuristics where only matrix multiplications (GGML_MUL_MAT and GGML_MUL_MAT_ID) are offloaded if the batch size is larger than some threshold (currently 32). When fmoe is enabled, the fused (ffn_up*X)*unary(ffn_gate*X)) op is never uploaded. In contrast, in mainline llama.cpp matrix multiplications are always offloaded when the batch size is >= 32. The result of this is that when the batch size becomes large enough, llama.cpp will outperform ik_llama.cpp in prompt processing speed. As "large enough" depends on many factors (size of tensors that need to be uploaded, speed of the PCI-E bus to the GPU, relative speed of the GPU vs the CPU), it is hard to devise a better offload policy that automatically takes the best decision.

So it seems that for some reason, now some matrix calculations are done on the CPU instead of the main CUDA device? (CUDA0)

### First Bad Commit

I'm not sure where it started exactly, but https://github.com/ggml-org/llama.cpp/commit/5d195f17bc60eacc15cfb929f9403cf29ccdf419 works fine.

### Relevant log output

```shell
./llama-server -m '/Drive1_8TB/models_llm_2tb/DeepSeek-V3-0324-UD-Q3_K_XL.gguf' -c 32768 --no-mmap -ngl 999 \
-ot "blk.(0|1|2|3|4|5|6|7).ffn.=CUDA0" \
-ot "blk.(8|9|10|11).ffn.=CUDA1" \
-ot "blk.(12|13|14|15).ffn.=CUDA2" \
-ot "blk.(16|17|18|19|20).ffn.=CUDA3" \
-ot "blk.(21|22|23).ffn.=CUDA4" \
-ot "blk.(24|25|26|27).ffn.=CUDA4" \
-ot "blk.(28|29|30|31|32|33|34).ffn.=CUDA5" \
-ot "exps=CPU" \
-fa on -mg 0 -ub 2560 -b 2560 --cache-ram 0
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 6 CUDA devices:
  Device 0: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes
  Device 1: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
  Device 2: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
  Device 3: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes
  Device 4: NVIDIA RTX A6000, compute capability 8.6, VMM: yes
  Device 5: NVIDIA A40, compute capability 8.6, VMM: yes
build: 6906 (0de0a0157) with cc (GCC) 15.2.1 20251022 (Red Hat 15.2.1-3) for x86_64-redhat-linux
system info: n_threads = 12, n_threads_batch = 12, total_threads = 24

system_info: n_threads = 12 (n_threads_batch = 12) / 24 | CUDA : ARCHS = 860,890,1200 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | FA_ALL_QUANTS = 1 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |

main: binding port with default address family
main: HTTP server is listening, hostname: 127.0.0.1, port: 8080, http threads: 23
main: loading model
srv    load_model: loading model '/Drive1_8TB/models_llm_2tb/DeepSeek-V3-0324-UD-Q3_K_XL.gguf'
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 5090) (0000:01:00.0) - 31600 MiB free
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 4090) (0000:02:00.0) - 23686 MiB free
llama_model_load_from_file_impl: using device CUDA2 (NVIDIA GeForce RTX 4090) (0000:17:00.0) - 23675 MiB free
llama_model_load_from_file_impl: using device CUDA3 (NVIDIA GeForce RTX 5090) (0000:03:00.0) - 31600 MiB free
llama_model_load_from_file_impl: using device CUDA4 (NVIDIA RTX A6000) (0000:0d:00.0) - 48268 MiB free
llama_model_load_from_file_impl: using device CUDA5 (NVIDIA A40) (0000:06:00.0) - 48268 MiB free
llama_model_loader: loaded meta data with 64 key-value pairs and 1086 tensors from /Drive1_8TB/models_llm_2tb/DeepSeek-V3-0324-UD-Q3_K_XL.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = deepseek2
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Deepseek-V3-0324
llama_model_loader: - kv   3:                            general.version str              = V3-0324
llama_model_loader: - kv   4:                           general.basename str              = Deepseek-V3-0324
llama_model_loader: - kv   5:                       general.quantized_by str              = Unsloth
llama_model_loader: - kv   6:                         general.size_label str              = 256x20B
llama_model_loader: - kv   7:                            general.license str              = mit
llama_model_loader: - kv   8:                           general.repo_url str              = https://huggingface.co/unsloth
llama_model_loader: - kv   9:                   general.base_model.count u32              = 1
llama_model_loader: - kv  10:                  general.base_model.0.name str              = DeepSeek V3 0324
llama_model_loader: - kv  11:               general.base_model.0.version str              = V3-0324
llama_model_loader: - kv  12:          general.base_model.0.organization str              = Deepseek Ai
llama_model_loader: - kv  13:              general.base_model.0.repo_url str              = https://huggingface.co/deepseek-ai/De...
llama_model_loader: - kv  14:                               general.tags arr[str,4]       = ["deepseek_v3", "deepseek", "unsloth"...
llama_model_loader: - kv  15:                          general.languages arr[str,1]       = ["en"]
llama_model_loader: - kv  16:                      deepseek2.block_count u32              = 61
llama_model_loader: - kv  17:                   deepseek2.context_length u32              = 163840
llama_model_loader: - kv  18:                 deepseek2.embedding_length u32              = 7168
llama_model_loader: - kv  19:              deepseek2.feed_forward_length u32              = 18432
llama_model_loader: - kv  20:             deepseek2.attention.head_count u32              = 128
llama_model_loader: - kv  21:          deepseek2.attention.head_count_kv u32              = 1
llama_model_loader: - kv  22:                   deepseek2.rope.freq_base f32              = 10000.000000
llama_model_loader: - kv  23: deepseek2.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  24:                deepseek2.expert_used_count u32              = 8
llama_model_loader: - kv  25:        deepseek2.leading_dense_block_count u32              = 3
llama_model_loader: - kv  26:                       deepseek2.vocab_size u32              = 129280
llama_model_loader: - kv  27:            deepseek2.attention.q_lora_rank u32              = 1536
llama_model_loader: - kv  28:           deepseek2.attention.kv_lora_rank u32              = 512
llama_model_loader: - kv  29:             deepseek2.attention.key_length u32              = 576
llama_model_loader: - kv  30:           deepseek2.attention.value_length u32              = 512
llama_model_loader: - kv  31:         deepseek2.attention.key_length_mla u32              = 192
llama_model_loader: - kv  32:       deepseek2.attention.value_length_mla u32              = 128
llama_model_loader: - kv  33:       deepseek2.expert_feed_forward_length u32              = 2048
llama_model_loader: - kv  34:                     deepseek2.expert_count u32              = 256
llama_model_loader: - kv  35:              deepseek2.expert_shared_count u32              = 1
llama_model_loader: - kv  36:             deepseek2.expert_weights_scale f32              = 2.500000
llama_model_loader: - kv  37:              deepseek2.expert_weights_norm bool             = true
llama_model_loader: - kv  38:               deepseek2.expert_gating_func u32              = 2
llama_model_loader: - kv  39:             deepseek2.rope.dimension_count u32              = 64
llama_model_loader: - kv  40:                deepseek2.rope.scaling.type str              = yarn
llama_model_loader: - kv  41:              deepseek2.rope.scaling.factor f32              = 40.000000
llama_model_loader: - kv  42: deepseek2.rope.scaling.original_context_length u32              = 4096
llama_model_loader: - kv  43: deepseek2.rope.scaling.yarn_log_multiplier f32              = 0.100000
llama_model_loader: - kv  44:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  45:                         tokenizer.ggml.pre str              = deepseek-v3
llama_model_loader: - kv  46:                      tokenizer.ggml.tokens arr[str,129280]  = ["<｜begin▁of▁sentence｜>", "<�...
llama_model_loader: - kv  47:                  tokenizer.ggml.token_type arr[i32,129280]  = [3, 3, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  48:                      tokenizer.ggml.merges arr[str,127741]  = ["Ġ t", "Ġ a", "i n", "Ġ Ġ", "h e...
llama_model_loader: - kv  49:                tokenizer.ggml.bos_token_id u32              = 0
llama_model_loader: - kv  50:                tokenizer.ggml.eos_token_id u32              = 1
llama_model_loader: - kv  51:            tokenizer.ggml.padding_token_id u32              = 2
llama_model_loader: - kv  52:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  53:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  54:                    tokenizer.chat_template str              = {% if not add_generation_prompt is de...
llama_model_loader: - kv  55:               general.quantization_version u32              = 2
llama_model_loader: - kv  56:                          general.file_type u32              = 12
llama_model_loader: - kv  57:                      quantize.imatrix.file str              = DeepSeek-V3-0324-GGUF/imatrix_unsloth...
llama_model_loader: - kv  58:                   quantize.imatrix.dataset str              = unsloth_calibration_DeepSeek-V3-0324.txt
llama_model_loader: - kv  59:             quantize.imatrix.entries_count i32              = 720
llama_model_loader: - kv  60:              quantize.imatrix.chunks_count i32              = 60
llama_model_loader: - kv  61:                                   split.no u16              = 0
llama_model_loader: - kv  62:                        split.tensors.count i32              = 1086
llama_model_loader: - kv  63:                                split.count u16              = 0
llama_model_loader: - type  f32:  361 tensors
llama_model_loader: - type q8_0:  122 tensors
llama_model_loader: - type q3_K:  173 tensors
llama_model_loader: - type q4_K:  385 tensors
llama_model_loader: - type q5_K:   29 tensors
llama_model_loader: - type q6_K:   16 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q3_K - Medium
print_info: file size   = 275.91 GiB (3.53 BPW)
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
load: printing all EOG tokens:
load:   - 1 ('<｜end▁of▁sentence｜>')
load: special tokens cache size = 818
load: token to piece cache size = 0.8223 MB
print_info: arch             = deepseek2
print_info: vocab_only       = 0
print_info: n_ctx_train      = 163840
print_info: n_embd           = 7168
print_info: n_layer          = 61
print_info: n_head           = 128
print_info: n_head_kv        = 1
print_info: n_rot            = 64
print_info: n_swa            = 0
print_info: is_swa_any       = 0
print_info: n_embd_head_k    = 576
print_info: n_embd_head_v    = 512
print_info: n_gqa            = 128
print_info: n_embd_k_gqa     = 576
print_info: n_embd_v_gqa     = 512
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-06
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 0.0e+00
print_info: n_ff             = 18432
print_info: n_expert         = 256
print_info: n_expert_used    = 8
print_info: n_expert_groups  = 0
print_info: n_group_used     = 0
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 0
print_info: rope scaling     = yarn
print_info: freq_base_train  = 10000.0
print_info: freq_scale_train = 0.025
print_info: n_ctx_orig_yarn  = 4096
print_info: rope_finetuned   = unknown
print_info: model type       = 671B
print_info: model params     = 671.03 B
print_info: general.name     = Deepseek-V3-0324
print_info: n_layer_dense_lead   = 3
print_info: n_lora_q             = 1536
print_info: n_lora_kv            = 512
print_info: n_embd_head_k_mla    = 192
print_info: n_embd_head_v_mla    = 128
print_info: n_ff_exp             = 2048
print_info: n_expert_shared      = 1
print_info: expert_weights_scale = 2.5
print_info: expert_weights_norm  = 1
print_info: expert_gating_func   = sigmoid
print_info: rope_yarn_log_mul    = 0.1000
print_info: vocab type       = BPE
print_info: n_vocab          = 129280
print_info: n_merges         = 127741
print_info: BOS token        = 0 '<｜begin▁of▁sentence｜>'
print_info: EOS token        = 1 '<｜end▁of▁sentence｜>'
print_info: EOT token        = 1 '<｜end▁of▁sentence｜>'
print_info: PAD token        = 2 '<｜▁pad▁｜>'
print_info: LF token         = 201 'Ċ'
print_info: FIM PRE token    = 128801 '<｜fim▁begin｜>'
print_info: FIM SUF token    = 128800 '<｜fim▁hole｜>'
print_info: FIM MID token    = 128802 '<｜fim▁end｜>'
print_info: EOG token        = 1 '<｜end▁of▁sentence｜>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = false)
load_tensors: offloading 61 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 62/62 layers to GPU
load_tensors:          CPU model buffer size =   497.11 MiB
load_tensors:        CUDA0 model buffer size = 25363.28 MiB
load_tensors:        CUDA1 model buffer size = 19841.07 MiB
load_tensors:        CUDA2 model buffer size = 19842.82 MiB
load_tensors:        CUDA3 model buffer size = 24357.64 MiB
load_tensors:        CUDA4 model buffer size = 34490.44 MiB
load_tensors:        CUDA5 model buffer size = 35639.92 MiB
load_tensors:    CUDA_Host model buffer size = 122500.00 MiB
....................................................................................................
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 32768
llama_context: n_ctx_per_seq = 32768
llama_context: n_batch       = 2560
llama_context: n_ubatch      = 2560
llama_context: causal_attn   = 1
llama_context: flash_attn    = enabled
llama_context: kv_unified    = false
llama_context: freq_base     = 10000.0
llama_context: freq_scale    = 0.025
llama_context: n_ctx_per_seq (32768) < n_ctx_train (163840) -- the full capacity of the model will not be utilized
llama_context:  CUDA_Host  output buffer size =     0.49 MiB
llama_kv_cache:      CUDA0 KV buffer size =   680.00 MiB
llama_kv_cache:      CUDA1 KV buffer size =   476.00 MiB
llama_kv_cache:      CUDA2 KV buffer size =   476.00 MiB
llama_kv_cache:      CUDA3 KV buffer size =   680.00 MiB
llama_kv_cache:      CUDA4 KV buffer size =   952.00 MiB
llama_kv_cache:      CUDA5 KV buffer size =   884.00 MiB
llama_kv_cache: size = 4148.00 MiB ( 32768 cells,  61 layers,  1/1 seqs), K (f16): 2196.00 MiB, V (f16): 1952.00 MiB
llama_context:      CUDA0 compute buffer size =  3628.50 MiB
llama_context:      CUDA1 compute buffer size =  2052.63 MiB
llama_context:      CUDA2 compute buffer size =  1995.05 MiB
llama_context:      CUDA3 compute buffer size =  1995.05 MiB
llama_context:      CUDA4 compute buffer size =  4848.52 MiB
llama_context:      CUDA5 compute buffer size =  4848.53 MiB
llama_context:  CUDA_Host compute buffer size =   390.07 MiB
llama_context: graph nodes  = 4843
llama_context: graph splits = 206 (with bs=2560), 154 (with bs=1)
common_init_from_params: added <｜end▁of▁sentence｜> logit bias = -inf
common_init_from_params: setting dry_penalty_last_n to ctx_size = 32768
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
srv          init: initializing slots, n_slots = 1
slot         init: id  0 | task -1 | new slot n_ctx_slot = 32768
srv          init: prompt cache is enabled, size limit: 8192 MiB
srv          init: use `--cache-ram 0` to disable the prompt cache
srv          init: for more info see https://github.com/ggml-org/llama.cpp/pull/16391
srv          init: thinking = 0
main: model loaded
main: chat template, chat_template: {% if not add_generation_prompt is defined %}{% set add_generation_prompt = false %}{% endif %}{% set ns = namespace(is_first=false, is_tool=false, is_output_first=true, system_prompt='', is_first_sp=true, is_last_user=false) %}{%- for message in messages %}{%- if message['role'] == 'system' %}{%- if ns.is_first_sp %}{% set ns.system_prompt = ns.system_prompt + message['content'] %}{% set ns.is_first_sp = false %}{%- else %}{% set ns.system_prompt = ns.system_prompt + '

' + message['content'] %}{%- endif %}{%- endif %}{%- endfor %}{{ bos_token }}{{ ns.system_prompt }}{%- for message in messages %}{%- if message['role'] == 'user' %}{%- set ns.is_tool = false -%}{%- set ns.is_first = false -%}{%- set ns.is_last_user = true -%}{{'<｜User｜>' + message['content'] + '<｜Assistant｜>'}}{%- endif %}{%- if message['role'] == 'assistant' and message['tool_calls'] is defined and message['tool_calls'] is not none %}{%- set ns.is_last_user = false -%}{%- if ns.is_tool %}{{'<｜tool▁outputs▁end｜>'}}{%- endif %}{%- set ns.is_first = false %}{%- set ns.is_tool = false -%}{%- set ns.is_output_first = true %}{%- for tool in message['tool_calls'] %}{%- if not ns.is_first %}{%- if message['content'] is none %}{{'<｜tool▁calls▁begin｜><｜tool▁call▁begin｜>' + tool['type'] + '<｜tool▁sep｜>' + tool['function']['name'] + '
' + '' + '
' + tool['function']['arguments'] + '
' + '' + '<｜tool▁call▁end｜>'}}{%- else %}{{message['content'] + '<｜tool▁calls▁begin｜><｜tool▁call▁begin｜>' + tool['type'] + '<｜tool▁sep｜>' + tool['function']['name'] + '
' + '' + '
' + tool['function']['arguments'] + '
' + '' + '<｜tool▁call▁end｜>'}}{%- endif %}{%- set ns.is_first = true -%}{%- else %}{{'
' + '<｜tool▁call▁begin｜>' + tool['type'] + '<｜tool▁sep｜>' + tool['function']['name'] + '
' + '' + '
' + tool['function']['arguments'] + '
' + '' + '<｜tool▁call▁end｜>'}}{%- endif %}{%- endfor %}{{'<｜tool▁calls▁end｜><｜end▁of▁sentence｜>'}}{%- endif %}{%- if message['role'] == 'assistant' and (message['tool_calls'] is not defined or message['tool_calls'] is none)%}{%- set ns.is_last_user = false -%}{%- if ns.is_tool %}{{'<｜tool▁outputs▁end｜>' + message['content'] + '<｜end▁of▁sentence｜>'}}{%- set ns.is_tool = false -%}{%- else %}{% set content = message['content'] %}{{content + '<｜end▁of▁sentence｜>'}}{%- endif %}{%- endif %}{%- if message['role'] == 'tool' %}{%- set ns.is_last_user = false -%}{%- set ns.is_tool = true -%}{%- if ns.is_output_first %}{{'<｜tool▁outputs▁begin｜><｜tool▁output▁begin｜>' + message['content'] + '<｜tool▁output▁end｜>'}}{%- set ns.is_output_first = false %}{%- else %}{{'
<｜tool▁output▁begin｜>' + message['content'] + '<｜tool▁output▁end｜>'}}{%- endif %}{%- endif %}{%- endfor -%}{% if ns.is_tool %}{{'<｜tool▁outputs▁end｜>'}}{% endif %}{% if add_generation_prompt and not ns.is_last_user and not ns.is_tool %}{{'<｜Assistant｜>'}}{% endif %}, example_format: 'You are a helpful assistant

<｜User｜>Hello<｜Assistant｜>Hi there<｜end▁of▁sentence｜><｜User｜>How are you?<｜Assistant｜>'
main: server is listening on http://127.0.0.1:8080 - starting the main loop
srv  update_slots: all slots are idle
srv  log_server_r: request: GET /v1/models 127.0.0.1 200
check_double_bos_eos: Added a BOS token to the prompt as specified by the model but the prompt also starts with a BOS token. So now the final prompt starts with 2 BOS tokens. Are you sure this is what you want?
common_sampler_types_from_names: unable to match sampler by name 'tfs_z'
common_sampler_types_from_names: unable to match sampler by name 'typical_p'
slot get_availabl: id  0 | task -1 | selected slot by LRU, t_last = -1
slot launch_slot_: id  0 | task 0 | processing task
slot update_slots: id  0 | task 0 | new prompt, n_ctx_slot = 32768, n_keep = 0, task.n_tokens = 4373
slot update_slots: id  0 | task 0 | n_tokens = 0, memory_seq_rm [0, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_tokens = 2560, batch.n_tokens = 2560, progress = 0.585410
slot update_slots: id  0 | task 0 | n_tokens = 2560, memory_seq_rm [2560, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_tokens = 4373, batch.n_tokens = 1813, progress = 1.000000
slot update_slots: id  0 | task 0 | prompt done, n_tokens = 4373, batch.n_tokens = 1813
slot print_timing: id  0 | task 0 |
prompt eval time =   49380.49 ms /  4373 tokens (   11.29 ms per token,    88.56 tokens per second)
       eval time =   50832.32 ms /   542 tokens (   93.79 ms per token,    10.66 tokens per second)
      total time =  100212.80 ms /  4915 tokens
slot      release: id  0 | task 0 | stop processing: n_tokens = 4914, truncated = 0
srv  update_slots: all slots are idle
srv  log_server_r: request: POST /completion 127.0.0.1 200
srv  log_server_r: request: POST /tokenize 127.0.0.1 200
^Csrv    operator(): operator(): cleaning up before exit...
llama_memory_breakdown_print: | memory breakdown [MiB] | total   free      self    model   context   compute    unaccounted |
llama_memory_breakdown_print: |   - CUDA0 (RTX 5090)   | 32109 = 1426 + ( 29671 =  25363 +     680 +    3628) +        1010 |
llama_memory_breakdown_print: |   - CUDA1 (RTX 4090)   | 24080 =  806 + ( 22369 =  19841 +     476 +    2052) +         905 |
llama_memory_breakdown_print: |   - CUDA2 (RTX 4090)   | 24077 =  851 + ( 22313 =  19842 +     476 +    1995) +         913 |
llama_memory_breakdown_print: |   - CUDA3 (RTX 5090)   | 32109 = 4034 + ( 27032 =  24357 +     680 +    1995) +        1042 |
llama_memory_breakdown_print: |   - CUDA4 (RTX A6000)  | 48539 = 2498 + ( 40290 =  34490 +     952 +    4848) +        5749 |
llama_memory_breakdown_print: |   - CUDA5 (A40)        | 48539 = 1418 + ( 41372 =  35639 +     884 +    4848) +        5748 |
llama_memory_breakdown_print: |   - Host               |                 123387 = 122997 +       0 +     390                |
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Eval bug: When offloading to CPU after f77c13b commit using CUDA (MultiGPU), PP performance seems to be reduced by ~75% (CUDA: General GEMV fusion) #16912

Name and Version

Operating systems

GGML backends

Hardware

Models

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Eval bug: When offloading to CPU after f77c13b commit using CUDA (MultiGPU), PP performance seems to be reduced by ~75% (CUDA: General GEMV fusion) #16912

Description

Name and Version

Operating systems

GGML backends

Hardware

Models

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions