Skip to content

Misc. bug: Performance Downgrade happened from b6188, for llama-bin-win-vulkan-x64 distribution. #15618

@MaoJianwei

Description

@MaoJianwei

Name and Version

PS F:\llm\llama-b6123-bin-win-vulkan-x64> .\llama-server.exe --version
load_backend: loaded RPC backend from F:\llm\llama-b6123-bin-win-vulkan-x64\ggml-rpc.dll
[2025-08-27 22:57:02.824][info][9232] [huya-helper.cpp:378#init_log] graphic-hook 64bit log init suceed.
exe:llama-server.exe, pid:6372
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Quadro P620 (NVIDIA) | uma: 0 | fp16: 0 | bf16: 0 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: none
load_backend: loaded Vulkan backend from F:\llm\llama-b6123-bin-win-vulkan-x64\ggml-vulkan.dll
load_backend: loaded CPU backend from F:\llm\llama-b6123-bin-win-vulkan-x64\ggml-cpu-haswell.dll
version: 6123 (79c1160b)
built with clang version 19.1.5 for x86_64-pc-windows-msvc
PS F:\llm\llama-b6123-bin-win-vulkan-x64>
PS F:\llm\llama-b6301-bin-win-vulkan-x64> .\llama-server.exe --version
load_backend: loaded RPC backend from F:\llm\llama-b6301-bin-win-vulkan-x64\ggml-rpc.dll
[2025-08-27 22:57:27.364][info][12584] [huya-helper.cpp:378#init_log] graphic-hook 64bit log init suceed.
exe:llama-server.exe, pid:19080
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Quadro P620 (NVIDIA) | uma: 0 | fp16: 0 | bf16: 0 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: none
load_backend: loaded Vulkan backend from F:\llm\llama-b6301-bin-win-vulkan-x64\ggml-vulkan.dll
load_backend: loaded CPU backend from F:\llm\llama-b6301-bin-win-vulkan-x64\ggml-cpu-haswell.dll
version: 6301 (da54f9f1)
built with clang version 19.1.5 for x86_64-pc-windows-msvc
PS F:\llm\llama-b6301-bin-win-vulkan-x64>

Operating systems

Windows

Which llama.cpp modules do you know to be affected?

llama-server

Command line

.\llama-server.exe --model ..\Qwen3-1.7B-gguf-Q4_K_M    --no-mmap --jinja --verbose-prompt    --host 0.0.0.0 --flash-attn --cache-type-k q8_0 --cache-type-v q8_0 -ngl 100 --metrics

Problem description & steps to reproduce

The inference performance downgrades from 35.35 to 28.08 tokens per second.
That happens when I just change the version from b6123 to b6301.
I run llama-server after unzip the llama-bxxxx-bin-win-vulkan-x64.zip package, which is downloaded from the github release page.

I try many times, but it is truth, That's so bad.
Could you please fix it?

Thanks!
Mao

PS F:\llm\llama-b6123-bin-win-vulkan-x64> .\llama-server.exe --model ..\Qwen3-1.7B-gguf-Q4_K_M    --no-mmap --jinja --verbose-prompt    --host 0.0.0.0 --flash-attn --cache-type-k q8_0 --cache-type-v q8_0 -ngl 100 --metrics
load_backend: loaded RPC backend from F:\llm\llama-b6123-bin-win-vulkan-x64\ggml-rpc.dll
[2025-08-27 22:53:34.152][info][16000] [huya-helper.cpp:378#init_log] graphic-hook 64bit log init suceed.
exe:llama-server.exe, pid:2836
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Quadro P620 (NVIDIA) | uma: 0 | fp16: 0 | bf16: 0 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: none
load_backend: loaded Vulkan backend from F:\llm\llama-b6123-bin-win-vulkan-x64\ggml-vulkan.dll
load_backend: loaded CPU backend from F:\llm\llama-b6123-bin-win-vulkan-x64\ggml-cpu-haswell.dll
build: 6123 (79c1160b) with clang version 19.1.5 for x86_64-pc-windows-msvc
system info: n_threads = 6, n_threads_batch = 6, total_threads = 12

system_info: n_threads = 6 (n_threads_batch = 6) / 12 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |

main: binding port with default address family
main: HTTP server is listening, hostname: 0.0.0.0, port: 8080, http threads: 11
main: loading model
srv    load_model: loading model '..\Qwen3-1.7B-gguf-Q4_K_M'
llama_model_load_from_file_impl: using device Vulkan0 (Quadro P620) - 1962 MiB free
[2025-08-27 22:53:34.254][info][6524] [graphics-hook.cpp:82#init_pipe] [OBS] Failed to open pipe

[2025-08-27 22:53:34.255][info][6524] [graphics-hook.cpp:474#hlogv] [OBS]graphics-hook.dll loaded against process: llama-server.exe
[2025-08-27 22:53:34.255][info][6524] [graphics-hook.cpp:474#hlogv] [OBS](half life scientist) everything..  seems to be in order
llama_model_loader: loaded meta data with 34 key-value pairs and 311 tensors from ..\Qwen3-1.7B-gguf-Q4_K_M (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen3
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Qwen3 1.7B
llama_model_loader: - kv   3:                           general.basename str              = Qwen3
llama_model_loader: - kv   4:                         general.size_label str              = 1.7B
llama_model_loader: - kv   5:                            general.license str              = apache-2.0
llama_model_loader: - kv   6:                       general.license.link str              = https://huggingface.co/Qwen/Qwen3-1.7...
llama_model_loader: - kv   7:                   general.base_model.count u32              = 1
llama_model_loader: - kv   8:                  general.base_model.0.name str              = Qwen3 1.7B Base
llama_model_loader: - kv   9:          general.base_model.0.organization str              = Qwen
llama_model_loader: - kv  10:              general.base_model.0.repo_url str              = https://huggingface.co/Qwen/Qwen3-1.7...

main: server is listening on http://0.0.0.0:8080 - starting the main loop
srv  update_slots: all slots are idle
srv  params_from_: Chat format: Hermes 2 Pro
slot launch_slot_: id  0 | task 0 | processing task
slot update_slots: id  0 | task 0 | new prompt, n_ctx_slot = 4096, n_keep = 0, n_prompt_tokens = 36
slot update_slots: id  0 | task 0 | kv cache rm [0, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 36, n_tokens = 36, progress = 1.000000
slot update_slots: id  0 | task 0 | prompt done, n_past = 36, n_tokens = 36
slot      release: id  0 | task 0 | stop processing: n_past = 275, truncated = 0
slot print_timing: id  0 | task 0 |
prompt eval time =     382.86 ms /    36 tokens (   10.63 ms per token,    94.03 tokens per second)
       eval time =    6789.64 ms /   240 tokens (   28.29 ms per token,    35.35 tokens per second)
      total time =    7172.50 ms /   276 tokens
srv  update_slots: all slots are idle
srv  log_server_r: request: POST /v1/chat/completions 192.168.1.36 200
srv    operator(): operator(): cleaning up before exit...
PS F:\llm\llama-b6123-bin-win-vulkan-x64>
PS F:\llm\llama-b6301-bin-win-vulkan-x64> .\llama-server.exe --model ..\Qwen3-1.7B-gguf-Q4_K_M    --no-mmap --jinja --verbose-prompt    --host 0.0.0.0 --flash-attn --cache-type-k q8_0 --cache-type-v q8_0 -ngl 100 --metrics
load_backend: loaded RPC backend from F:\llm\llama-b6301-bin-win-vulkan-x64\ggml-rpc.dll
[2025-08-27 22:51:58.509][info][20772] [huya-helper.cpp:378#init_log] graphic-hook 64bit log init suceed.
exe:llama-server.exe, pid:14444
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Quadro P620 (NVIDIA) | uma: 0 | fp16: 0 | bf16: 0 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: none
load_backend: loaded Vulkan backend from F:\llm\llama-b6301-bin-win-vulkan-x64\ggml-vulkan.dll
load_backend: loaded CPU backend from F:\llm\llama-b6301-bin-win-vulkan-x64\ggml-cpu-haswell.dll
build: 6301 (da54f9f1) with clang version 19.1.5 for x86_64-pc-windows-msvc
system info: n_threads = 6, n_threads_batch = 6, total_threads = 12

system_info: n_threads = 6 (n_threads_batch = 6) / 12 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |

main: binding port with default address family
main: HTTP server is listening, hostname: 0.0.0.0, port: 8080, http threads: 11
main: loading model
srv    load_model: loading model '..\Qwen3-1.7B-gguf-Q4_K_M'
llama_model_load_from_file_impl: using device Vulkan0 (Quadro P620) - 1962 MiB free
[2025-08-27 22:51:58.617][info][8968] [graphics-hook.cpp:82#init_pipe] [OBS] Failed to open pipe

[2025-08-27 22:51:58.619][info][8968] [graphics-hook.cpp:474#hlogv] [OBS]graphics-hook.dll loaded against process: llama-server.exe
[2025-08-27 22:51:58.619][info][8968] [graphics-hook.cpp:474#hlogv] [OBS](half life scientist) everything..  seems to be in order
llama_model_loader: loaded meta data with 34 key-value pairs and 311 tensors from ..\Qwen3-1.7B-gguf-Q4_K_M (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen3
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Qwen3 1.7B
llama_model_loader: - kv   3:                           general.basename str              = Qwen3
llama_model_loader: - kv   4:                         general.size_label str              = 1.7B
llama_model_loader: - kv   5:                            general.license str              = apache-2.0
llama_model_loader: - kv   6:                       general.license.link str              = https://huggingface.co/Qwen/Qwen3-1.7...
llama_model_loader: - kv   7:                   general.base_model.count u32              = 1
llama_model_loader: - kv   8:                  general.base_model.0.name str              = Qwen3 1.7B Base
llama_model_loader: - kv   9:          general.base_model.0.organization str              = Qwen
llama_model_loader: - kv  10:              general.base_model.0.repo_url str              = https://huggingface.co/Qwen/Qwen3-1.7...


main: server is listening on http://0.0.0.0:8080 - starting the main loop
srv  update_slots: all slots are idle
srv  params_from_: Chat format: Hermes 2 Pro
slot launch_slot_: id  0 | task 0 | processing task
slot update_slots: id  0 | task 0 | new prompt, n_ctx_slot = 4096, n_keep = 0, n_prompt_tokens = 36
slot update_slots: id  0 | task 0 | kv cache rm [0, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 36, n_tokens = 36, progress = 1.000000
slot update_slots: id  0 | task 0 | prompt done, n_past = 36, n_tokens = 36
slot      release: id  0 | task 0 | stop processing: n_past = 293, truncated = 0
slot print_timing: id  0 | task 0 |
prompt eval time =     317.83 ms /    36 tokens (    8.83 ms per token,   113.27 tokens per second)
       eval time =    9186.42 ms /   258 tokens (   35.61 ms per token,    28.08 tokens per second)
      total time =    9504.25 ms /   294 tokens
srv  update_slots: all slots are idle
srv  log_server_r: request: POST /v1/chat/completions 192.168.1.36 200
srv    operator(): operator(): cleaning up before exit...
PS F:\llm\llama-b6301-bin-win-vulkan-x64>

First Bad Commit

b6188

Relevant log output

Metadata

Metadata

Assignees

No one assigned

    Labels

    VulkanIssues specific to the Vulkan backendbugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions