-
Notifications
You must be signed in to change notification settings - Fork 13.4k
Closed
Labels
VulkanIssues specific to the Vulkan backendIssues specific to the Vulkan backendbugSomething isn't workingSomething isn't working
Description
Name and Version
PS F:\llm\llama-b6123-bin-win-vulkan-x64> .\llama-server.exe --version
load_backend: loaded RPC backend from F:\llm\llama-b6123-bin-win-vulkan-x64\ggml-rpc.dll
[2025-08-27 22:57:02.824][info][9232] [huya-helper.cpp:378#init_log] graphic-hook 64bit log init suceed.
exe:llama-server.exe, pid:6372
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Quadro P620 (NVIDIA) | uma: 0 | fp16: 0 | bf16: 0 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: none
load_backend: loaded Vulkan backend from F:\llm\llama-b6123-bin-win-vulkan-x64\ggml-vulkan.dll
load_backend: loaded CPU backend from F:\llm\llama-b6123-bin-win-vulkan-x64\ggml-cpu-haswell.dll
version: 6123 (79c1160b)
built with clang version 19.1.5 for x86_64-pc-windows-msvc
PS F:\llm\llama-b6123-bin-win-vulkan-x64>
PS F:\llm\llama-b6301-bin-win-vulkan-x64> .\llama-server.exe --version
load_backend: loaded RPC backend from F:\llm\llama-b6301-bin-win-vulkan-x64\ggml-rpc.dll
[2025-08-27 22:57:27.364][info][12584] [huya-helper.cpp:378#init_log] graphic-hook 64bit log init suceed.
exe:llama-server.exe, pid:19080
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Quadro P620 (NVIDIA) | uma: 0 | fp16: 0 | bf16: 0 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: none
load_backend: loaded Vulkan backend from F:\llm\llama-b6301-bin-win-vulkan-x64\ggml-vulkan.dll
load_backend: loaded CPU backend from F:\llm\llama-b6301-bin-win-vulkan-x64\ggml-cpu-haswell.dll
version: 6301 (da54f9f1)
built with clang version 19.1.5 for x86_64-pc-windows-msvc
PS F:\llm\llama-b6301-bin-win-vulkan-x64>
Operating systems
Windows
Which llama.cpp modules do you know to be affected?
llama-server
Command line
.\llama-server.exe --model ..\Qwen3-1.7B-gguf-Q4_K_M --no-mmap --jinja --verbose-prompt --host 0.0.0.0 --flash-attn --cache-type-k q8_0 --cache-type-v q8_0 -ngl 100 --metrics
Problem description & steps to reproduce
The inference performance downgrades from 35.35 to 28.08 tokens per second.
That happens when I just change the version from b6123 to b6301.
I run llama-server after unzip the llama-bxxxx-bin-win-vulkan-x64.zip package, which is downloaded from the github release page.
I try many times, but it is truth, That's so bad.
Could you please fix it?
Thanks!
Mao
PS F:\llm\llama-b6123-bin-win-vulkan-x64> .\llama-server.exe --model ..\Qwen3-1.7B-gguf-Q4_K_M --no-mmap --jinja --verbose-prompt --host 0.0.0.0 --flash-attn --cache-type-k q8_0 --cache-type-v q8_0 -ngl 100 --metrics
load_backend: loaded RPC backend from F:\llm\llama-b6123-bin-win-vulkan-x64\ggml-rpc.dll
[2025-08-27 22:53:34.152][info][16000] [huya-helper.cpp:378#init_log] graphic-hook 64bit log init suceed.
exe:llama-server.exe, pid:2836
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Quadro P620 (NVIDIA) | uma: 0 | fp16: 0 | bf16: 0 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: none
load_backend: loaded Vulkan backend from F:\llm\llama-b6123-bin-win-vulkan-x64\ggml-vulkan.dll
load_backend: loaded CPU backend from F:\llm\llama-b6123-bin-win-vulkan-x64\ggml-cpu-haswell.dll
build: 6123 (79c1160b) with clang version 19.1.5 for x86_64-pc-windows-msvc
system info: n_threads = 6, n_threads_batch = 6, total_threads = 12
system_info: n_threads = 6 (n_threads_batch = 6) / 12 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
main: binding port with default address family
main: HTTP server is listening, hostname: 0.0.0.0, port: 8080, http threads: 11
main: loading model
srv load_model: loading model '..\Qwen3-1.7B-gguf-Q4_K_M'
llama_model_load_from_file_impl: using device Vulkan0 (Quadro P620) - 1962 MiB free
[2025-08-27 22:53:34.254][info][6524] [graphics-hook.cpp:82#init_pipe] [OBS] Failed to open pipe
[2025-08-27 22:53:34.255][info][6524] [graphics-hook.cpp:474#hlogv] [OBS]graphics-hook.dll loaded against process: llama-server.exe
[2025-08-27 22:53:34.255][info][6524] [graphics-hook.cpp:474#hlogv] [OBS](half life scientist) everything.. seems to be in order
llama_model_loader: loaded meta data with 34 key-value pairs and 311 tensors from ..\Qwen3-1.7B-gguf-Q4_K_M (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = qwen3
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Qwen3 1.7B
llama_model_loader: - kv 3: general.basename str = Qwen3
llama_model_loader: - kv 4: general.size_label str = 1.7B
llama_model_loader: - kv 5: general.license str = apache-2.0
llama_model_loader: - kv 6: general.license.link str = https://huggingface.co/Qwen/Qwen3-1.7...
llama_model_loader: - kv 7: general.base_model.count u32 = 1
llama_model_loader: - kv 8: general.base_model.0.name str = Qwen3 1.7B Base
llama_model_loader: - kv 9: general.base_model.0.organization str = Qwen
llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/Qwen/Qwen3-1.7...
main: server is listening on http://0.0.0.0:8080 - starting the main loop
srv update_slots: all slots are idle
srv params_from_: Chat format: Hermes 2 Pro
slot launch_slot_: id 0 | task 0 | processing task
slot update_slots: id 0 | task 0 | new prompt, n_ctx_slot = 4096, n_keep = 0, n_prompt_tokens = 36
slot update_slots: id 0 | task 0 | kv cache rm [0, end)
slot update_slots: id 0 | task 0 | prompt processing progress, n_past = 36, n_tokens = 36, progress = 1.000000
slot update_slots: id 0 | task 0 | prompt done, n_past = 36, n_tokens = 36
slot release: id 0 | task 0 | stop processing: n_past = 275, truncated = 0
slot print_timing: id 0 | task 0 |
prompt eval time = 382.86 ms / 36 tokens ( 10.63 ms per token, 94.03 tokens per second)
eval time = 6789.64 ms / 240 tokens ( 28.29 ms per token, 35.35 tokens per second)
total time = 7172.50 ms / 276 tokens
srv update_slots: all slots are idle
srv log_server_r: request: POST /v1/chat/completions 192.168.1.36 200
srv operator(): operator(): cleaning up before exit...
PS F:\llm\llama-b6123-bin-win-vulkan-x64>
PS F:\llm\llama-b6301-bin-win-vulkan-x64> .\llama-server.exe --model ..\Qwen3-1.7B-gguf-Q4_K_M --no-mmap --jinja --verbose-prompt --host 0.0.0.0 --flash-attn --cache-type-k q8_0 --cache-type-v q8_0 -ngl 100 --metrics
load_backend: loaded RPC backend from F:\llm\llama-b6301-bin-win-vulkan-x64\ggml-rpc.dll
[2025-08-27 22:51:58.509][info][20772] [huya-helper.cpp:378#init_log] graphic-hook 64bit log init suceed.
exe:llama-server.exe, pid:14444
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Quadro P620 (NVIDIA) | uma: 0 | fp16: 0 | bf16: 0 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: none
load_backend: loaded Vulkan backend from F:\llm\llama-b6301-bin-win-vulkan-x64\ggml-vulkan.dll
load_backend: loaded CPU backend from F:\llm\llama-b6301-bin-win-vulkan-x64\ggml-cpu-haswell.dll
build: 6301 (da54f9f1) with clang version 19.1.5 for x86_64-pc-windows-msvc
system info: n_threads = 6, n_threads_batch = 6, total_threads = 12
system_info: n_threads = 6 (n_threads_batch = 6) / 12 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
main: binding port with default address family
main: HTTP server is listening, hostname: 0.0.0.0, port: 8080, http threads: 11
main: loading model
srv load_model: loading model '..\Qwen3-1.7B-gguf-Q4_K_M'
llama_model_load_from_file_impl: using device Vulkan0 (Quadro P620) - 1962 MiB free
[2025-08-27 22:51:58.617][info][8968] [graphics-hook.cpp:82#init_pipe] [OBS] Failed to open pipe
[2025-08-27 22:51:58.619][info][8968] [graphics-hook.cpp:474#hlogv] [OBS]graphics-hook.dll loaded against process: llama-server.exe
[2025-08-27 22:51:58.619][info][8968] [graphics-hook.cpp:474#hlogv] [OBS](half life scientist) everything.. seems to be in order
llama_model_loader: loaded meta data with 34 key-value pairs and 311 tensors from ..\Qwen3-1.7B-gguf-Q4_K_M (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = qwen3
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Qwen3 1.7B
llama_model_loader: - kv 3: general.basename str = Qwen3
llama_model_loader: - kv 4: general.size_label str = 1.7B
llama_model_loader: - kv 5: general.license str = apache-2.0
llama_model_loader: - kv 6: general.license.link str = https://huggingface.co/Qwen/Qwen3-1.7...
llama_model_loader: - kv 7: general.base_model.count u32 = 1
llama_model_loader: - kv 8: general.base_model.0.name str = Qwen3 1.7B Base
llama_model_loader: - kv 9: general.base_model.0.organization str = Qwen
llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/Qwen/Qwen3-1.7...
main: server is listening on http://0.0.0.0:8080 - starting the main loop
srv update_slots: all slots are idle
srv params_from_: Chat format: Hermes 2 Pro
slot launch_slot_: id 0 | task 0 | processing task
slot update_slots: id 0 | task 0 | new prompt, n_ctx_slot = 4096, n_keep = 0, n_prompt_tokens = 36
slot update_slots: id 0 | task 0 | kv cache rm [0, end)
slot update_slots: id 0 | task 0 | prompt processing progress, n_past = 36, n_tokens = 36, progress = 1.000000
slot update_slots: id 0 | task 0 | prompt done, n_past = 36, n_tokens = 36
slot release: id 0 | task 0 | stop processing: n_past = 293, truncated = 0
slot print_timing: id 0 | task 0 |
prompt eval time = 317.83 ms / 36 tokens ( 8.83 ms per token, 113.27 tokens per second)
eval time = 9186.42 ms / 258 tokens ( 35.61 ms per token, 28.08 tokens per second)
total time = 9504.25 ms / 294 tokens
srv update_slots: all slots are idle
srv log_server_r: request: POST /v1/chat/completions 192.168.1.36 200
srv operator(): operator(): cleaning up before exit...
PS F:\llm\llama-b6301-bin-win-vulkan-x64>
First Bad Commit
b6188
Relevant log output
Metadata
Metadata
Assignees
Labels
VulkanIssues specific to the Vulkan backendIssues specific to the Vulkan backendbugSomething isn't workingSomething isn't working