Misc. bug: Performance Downgrade happened from b6188, for llama-bin-win-vulkan-x64 distribution.

### Name and Version

```
PS F:\llm\llama-b6123-bin-win-vulkan-x64> .\llama-server.exe --version
load_backend: loaded RPC backend from F:\llm\llama-b6123-bin-win-vulkan-x64\ggml-rpc.dll
[2025-08-27 22:57:02.824][info][9232] [huya-helper.cpp:378#init_log] graphic-hook 64bit log init suceed.
exe:llama-server.exe, pid:6372
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Quadro P620 (NVIDIA) | uma: 0 | fp16: 0 | bf16: 0 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: none
load_backend: loaded Vulkan backend from F:\llm\llama-b6123-bin-win-vulkan-x64\ggml-vulkan.dll
load_backend: loaded CPU backend from F:\llm\llama-b6123-bin-win-vulkan-x64\ggml-cpu-haswell.dll
version: 6123 (79c1160b)
built with clang version 19.1.5 for x86_64-pc-windows-msvc
PS F:\llm\llama-b6123-bin-win-vulkan-x64>
```

```
PS F:\llm\llama-b6301-bin-win-vulkan-x64> .\llama-server.exe --version
load_backend: loaded RPC backend from F:\llm\llama-b6301-bin-win-vulkan-x64\ggml-rpc.dll
[2025-08-27 22:57:27.364][info][12584] [huya-helper.cpp:378#init_log] graphic-hook 64bit log init suceed.
exe:llama-server.exe, pid:19080
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Quadro P620 (NVIDIA) | uma: 0 | fp16: 0 | bf16: 0 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: none
load_backend: loaded Vulkan backend from F:\llm\llama-b6301-bin-win-vulkan-x64\ggml-vulkan.dll
load_backend: loaded CPU backend from F:\llm\llama-b6301-bin-win-vulkan-x64\ggml-cpu-haswell.dll
version: 6301 (da54f9f1)
built with clang version 19.1.5 for x86_64-pc-windows-msvc
PS F:\llm\llama-b6301-bin-win-vulkan-x64>
```

### Operating systems

Windows

### Which llama.cpp modules do you know to be affected?

llama-server

### Command line

```shell
.\llama-server.exe --model ..\Qwen3-1.7B-gguf-Q4_K_M    --no-mmap --jinja --verbose-prompt    --host 0.0.0.0 --flash-attn --cache-type-k q8_0 --cache-type-v q8_0 -ngl 100 --metrics
```

### Problem description & steps to reproduce

The inference performance downgrades from **35.35** to **28.08** tokens per second.
That happens when I just change the version from b6123 to b6301.
I run llama-server after unzip the llama-bxxxx-bin-win-vulkan-x64.zip package, which is downloaded from the github release page.

I try many times, but it is truth, That's so bad.
Could you please fix it?

Thanks!
Mao


```
PS F:\llm\llama-b6123-bin-win-vulkan-x64> .\llama-server.exe --model ..\Qwen3-1.7B-gguf-Q4_K_M    --no-mmap --jinja --verbose-prompt    --host 0.0.0.0 --flash-attn --cache-type-k q8_0 --cache-type-v q8_0 -ngl 100 --metrics
load_backend: loaded RPC backend from F:\llm\llama-b6123-bin-win-vulkan-x64\ggml-rpc.dll
[2025-08-27 22:53:34.152][info][16000] [huya-helper.cpp:378#init_log] graphic-hook 64bit log init suceed.
exe:llama-server.exe, pid:2836
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Quadro P620 (NVIDIA) | uma: 0 | fp16: 0 | bf16: 0 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: none
load_backend: loaded Vulkan backend from F:\llm\llama-b6123-bin-win-vulkan-x64\ggml-vulkan.dll
load_backend: loaded CPU backend from F:\llm\llama-b6123-bin-win-vulkan-x64\ggml-cpu-haswell.dll
build: 6123 (79c1160b) with clang version 19.1.5 for x86_64-pc-windows-msvc
system info: n_threads = 6, n_threads_batch = 6, total_threads = 12

system_info: n_threads = 6 (n_threads_batch = 6) / 12 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |

main: binding port with default address family
main: HTTP server is listening, hostname: 0.0.0.0, port: 8080, http threads: 11
main: loading model
srv    load_model: loading model '..\Qwen3-1.7B-gguf-Q4_K_M'
llama_model_load_from_file_impl: using device Vulkan0 (Quadro P620) - 1962 MiB free
[2025-08-27 22:53:34.254][info][6524] [graphics-hook.cpp:82#init_pipe] [OBS] Failed to open pipe

[2025-08-27 22:53:34.255][info][6524] [graphics-hook.cpp:474#hlogv] [OBS]graphics-hook.dll loaded against process: llama-server.exe
[2025-08-27 22:53:34.255][info][6524] [graphics-hook.cpp:474#hlogv] [OBS](half life scientist) everything..  seems to be in order
llama_model_loader: loaded meta data with 34 key-value pairs and 311 tensors from ..\Qwen3-1.7B-gguf-Q4_K_M (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen3
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Qwen3 1.7B
llama_model_loader: - kv   3:                           general.basename str              = Qwen3
llama_model_loader: - kv   4:                         general.size_label str              = 1.7B
llama_model_loader: - kv   5:                            general.license str              = apache-2.0
llama_model_loader: - kv   6:                       general.license.link str              = https://huggingface.co/Qwen/Qwen3-1.7...
llama_model_loader: - kv   7:                   general.base_model.count u32              = 1
llama_model_loader: - kv   8:                  general.base_model.0.name str              = Qwen3 1.7B Base
llama_model_loader: - kv   9:          general.base_model.0.organization str              = Qwen
llama_model_loader: - kv  10:              general.base_model.0.repo_url str              = https://huggingface.co/Qwen/Qwen3-1.7...

main: server is listening on http://0.0.0.0:8080 - starting the main loop
srv  update_slots: all slots are idle
srv  params_from_: Chat format: Hermes 2 Pro
slot launch_slot_: id  0 | task 0 | processing task
slot update_slots: id  0 | task 0 | new prompt, n_ctx_slot = 4096, n_keep = 0, n_prompt_tokens = 36
slot update_slots: id  0 | task 0 | kv cache rm [0, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 36, n_tokens = 36, progress = 1.000000
slot update_slots: id  0 | task 0 | prompt done, n_past = 36, n_tokens = 36
slot      release: id  0 | task 0 | stop processing: n_past = 275, truncated = 0
slot print_timing: id  0 | task 0 |
prompt eval time =     382.86 ms /    36 tokens (   10.63 ms per token,    94.03 tokens per second)
       eval time =    6789.64 ms /   240 tokens (   28.29 ms per token,    35.35 tokens per second)
      total time =    7172.50 ms /   276 tokens
srv  update_slots: all slots are idle
srv  log_server_r: request: POST /v1/chat/completions 192.168.1.36 200
srv    operator(): operator(): cleaning up before exit...
PS F:\llm\llama-b6123-bin-win-vulkan-x64>
```


```
PS F:\llm\llama-b6301-bin-win-vulkan-x64> .\llama-server.exe --model ..\Qwen3-1.7B-gguf-Q4_K_M    --no-mmap --jinja --verbose-prompt    --host 0.0.0.0 --flash-attn --cache-type-k q8_0 --cache-type-v q8_0 -ngl 100 --metrics
load_backend: loaded RPC backend from F:\llm\llama-b6301-bin-win-vulkan-x64\ggml-rpc.dll
[2025-08-27 22:51:58.509][info][20772] [huya-helper.cpp:378#init_log] graphic-hook 64bit log init suceed.
exe:llama-server.exe, pid:14444
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Quadro P620 (NVIDIA) | uma: 0 | fp16: 0 | bf16: 0 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: none
load_backend: loaded Vulkan backend from F:\llm\llama-b6301-bin-win-vulkan-x64\ggml-vulkan.dll
load_backend: loaded CPU backend from F:\llm\llama-b6301-bin-win-vulkan-x64\ggml-cpu-haswell.dll
build: 6301 (da54f9f1) with clang version 19.1.5 for x86_64-pc-windows-msvc
system info: n_threads = 6, n_threads_batch = 6, total_threads = 12

system_info: n_threads = 6 (n_threads_batch = 6) / 12 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |

main: binding port with default address family
main: HTTP server is listening, hostname: 0.0.0.0, port: 8080, http threads: 11
main: loading model
srv    load_model: loading model '..\Qwen3-1.7B-gguf-Q4_K_M'
llama_model_load_from_file_impl: using device Vulkan0 (Quadro P620) - 1962 MiB free
[2025-08-27 22:51:58.617][info][8968] [graphics-hook.cpp:82#init_pipe] [OBS] Failed to open pipe

[2025-08-27 22:51:58.619][info][8968] [graphics-hook.cpp:474#hlogv] [OBS]graphics-hook.dll loaded against process: llama-server.exe
[2025-08-27 22:51:58.619][info][8968] [graphics-hook.cpp:474#hlogv] [OBS](half life scientist) everything..  seems to be in order
llama_model_loader: loaded meta data with 34 key-value pairs and 311 tensors from ..\Qwen3-1.7B-gguf-Q4_K_M (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen3
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Qwen3 1.7B
llama_model_loader: - kv   3:                           general.basename str              = Qwen3
llama_model_loader: - kv   4:                         general.size_label str              = 1.7B
llama_model_loader: - kv   5:                            general.license str              = apache-2.0
llama_model_loader: - kv   6:                       general.license.link str              = https://huggingface.co/Qwen/Qwen3-1.7...
llama_model_loader: - kv   7:                   general.base_model.count u32              = 1
llama_model_loader: - kv   8:                  general.base_model.0.name str              = Qwen3 1.7B Base
llama_model_loader: - kv   9:          general.base_model.0.organization str              = Qwen
llama_model_loader: - kv  10:              general.base_model.0.repo_url str              = https://huggingface.co/Qwen/Qwen3-1.7...


main: server is listening on http://0.0.0.0:8080 - starting the main loop
srv  update_slots: all slots are idle
srv  params_from_: Chat format: Hermes 2 Pro
slot launch_slot_: id  0 | task 0 | processing task
slot update_slots: id  0 | task 0 | new prompt, n_ctx_slot = 4096, n_keep = 0, n_prompt_tokens = 36
slot update_slots: id  0 | task 0 | kv cache rm [0, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 36, n_tokens = 36, progress = 1.000000
slot update_slots: id  0 | task 0 | prompt done, n_past = 36, n_tokens = 36
slot      release: id  0 | task 0 | stop processing: n_past = 293, truncated = 0
slot print_timing: id  0 | task 0 |
prompt eval time =     317.83 ms /    36 tokens (    8.83 ms per token,   113.27 tokens per second)
       eval time =    9186.42 ms /   258 tokens (   35.61 ms per token,    28.08 tokens per second)
      total time =    9504.25 ms /   294 tokens
srv  update_slots: all slots are idle
srv  log_server_r: request: POST /v1/chat/completions 192.168.1.36 200
srv    operator(): operator(): cleaning up before exit...
PS F:\llm\llama-b6301-bin-win-vulkan-x64>
```

### First Bad Commit

**b6188**

### Relevant log output

```shell

```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Misc. bug: Performance Downgrade happened from b6188, for llama-bin-win-vulkan-x64 distribution. #15618

Name and Version

Operating systems

Which llama.cpp modules do you know to be affected?

Command line

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Misc. bug: Performance Downgrade happened from b6188, for llama-bin-win-vulkan-x64 distribution. #15618

Description

Name and Version

Operating systems

Which llama.cpp modules do you know to be affected?

Command line

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions