vulkan: Skip syncing for prealloc_y when it is reused #15544

jeffbolznv · 2025-08-24T17:43:55Z

Primarily helps bigger GPUs. I think this helps some non-MoE models as well. Probably only helps the coopmat2 path which converts Y to fp16.

5090 before

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -fa 1 -n 0 -p 512 -r 100 --prio 1 -m c:\models\Qwen_Qwen3-30B-A3B-Q2_K.gguf -m c:\models\\deepseek-v2-lite-safetensors\deepseek-v2-lite-Q4_K_M.gguf -m c:\models\gpt-oss-20b-mxfp4.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| qwen3moe 30B.A3B Q2_K - Medium |  10.15 GiB |    30.53 B | Vulkan     |  99 |  1 |           pp512 |      6293.79 ± 64.66 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |           pp512 |      8554.05 ± 90.98 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan     |  99 |  1 |           pp512 |     7622.73 ± 169.20 |

5090 after

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -fa 1 -n 0 -p 512 -r 100 --prio 1 -m c:\models\Qwen_Qwen3-30B-A3B-Q2_K.gguf -m c:\models\\deepseek-v2-lite-safetensors\deepseek-v2-lite-Q4_K_M.gguf -m c:\models\gpt-oss-20b-mxfp4.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| qwen3moe 30B.A3B Q2_K - Medium |  10.15 GiB |    30.53 B | Vulkan     |  99 |  1 |           pp512 |      6642.64 ± 69.59 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |           pp512 |     8963.25 ± 168.48 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan     |  99 |  1 |           pp512 |     7637.93 ± 178.80 |

4070 before

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -fa 1 -n 0 -p 512 -r 100 --prio 1 -m c:\models\Qwen_Qwen3-30B-A3B-Q2_K.gguf -m c:\models\\deepseek-v2-lite-safetensors\deepseek-v2-lite-Q4_K_M.gguf -m c:\models\gpt-oss-20b-mxfp4.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 4070 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| qwen3moe 30B.A3B Q2_K - Medium |  10.15 GiB |    30.53 B | Vulkan     |  99 |  1 |           pp512 |      2250.45 ± 21.68 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |           pp512 |      2721.79 ± 18.53 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan     |  99 |  1 |           pp512 |      2991.03 ± 29.67 |

4070 after

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -fa 1 -n 0 -p 512 -r 100 --prio 1 -m c:\models\Qwen_Qwen3-30B-A3B-Q2_K.gguf -m c:\models\\deepseek-v2-lite-safetensors\deepseek-v2-lite-Q4_K_M.gguf -m c:\models\gpt-oss-20b-mxfp4.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 4070 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| qwen3moe 30B.A3B Q2_K - Medium |  10.15 GiB |    30.53 B | Vulkan     |  99 |  1 |           pp512 |      2277.99 ± 21.86 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |           pp512 |      2738.75 ± 14.72 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan     |  99 |  1 |           pp512 |      3001.78 ± 28.78 |

0cc4m · 2025-08-29T06:29:32Z

Sorry about the delay, I'm pretty busy again this week. I'll take care of the open PRs this weekend.

vulkan: Skip syncing for prealloc_y when it is reused

be61fa7

jeffbolznv requested a review from 0cc4m as a code owner August 24, 2025 17:43

github-actions bot added Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels Aug 24, 2025

jeffbolznv mentioned this pull request Aug 24, 2025

Vulkan: Add Integer Dot Product mul_mat_vec shader for legacy quants #14903

Merged

0cc4m approved these changes Aug 30, 2025

View reviewed changes

0cc4m merged commit 696fccf into ggml-org:master Aug 30, 2025
48 checks passed

qnixsynapse pushed a commit to janhq/llama.cpp that referenced this pull request Aug 30, 2025

vulkan: Skip syncing for prealloc_y when it is reused (ggml-org#15544)

ad5f187

walidbr pushed a commit to walidbr/llama.cpp that referenced this pull request Sep 7, 2025

vulkan: Skip syncing for prealloc_y when it is reused (ggml-org#15544)

a5d75e1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

vulkan: Skip syncing for prealloc_y when it is reused #15544

vulkan: Skip syncing for prealloc_y when it is reused #15544

Uh oh!

jeffbolznv commented Aug 24, 2025

Uh oh!

0cc4m commented Aug 29, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

vulkan: Skip syncing for prealloc_y when it is reused #15544

vulkan: Skip syncing for prealloc_y when it is reused #15544

Uh oh!

Conversation

jeffbolznv commented Aug 24, 2025

Uh oh!

0cc4m commented Aug 29, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants