vulkan: Implement topk_moe fused shader, ported from CUDA #16641

jeffbolznv · 2025-10-17T20:14:12Z

This is similar to the CUDA shader from #16130, but doesn't use shared memory and handles different subgroup sizes.

before:

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -fa 1 -n 128 -p 0 -r 10 --prio 1 -m c:\models\Qwen_Qwen3-30B-A3B-Q2_K.gguf -m c:\models\\deepseek-v2-lite-safetensors\deepseek-v2-lite-Q4_K_M.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| qwen3moe 30B.A3B Q2_K - Medium |  10.15 GiB |    30.53 B | Vulkan     |  99 |  1 |           tg128 |       260.12 ± 24.04 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |           tg128 |        328.18 ± 9.83 |

build: 66b0dbcb2 (6791)

after:

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -fa 1 -n 128 -p 0 -r 10 --prio 1 -m c:\models\Qwen_Qwen3-30B-A3B-Q2_K.gguf -m c:\models\\deepseek-v2-lite-safetensors\deepseek-v2-lite-Q4_K_M.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| qwen3moe 30B.A3B Q2_K - Medium |  10.15 GiB |    30.53 B | Vulkan     |  99 |  1 |           tg128 |        285.47 ± 7.80 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |           tg128 |       339.16 ± 16.07 |

build: e0f7fa913 (6792)

This is similar to the CUDA shader from ggml-org#16130, but doesn't use shared memory and handles different subgroup sizes.

Uses the technique used in the vulkan PR ggml-org#16641. Neat trick!

Uses the technique used in the vulkan PR #16641. Neat trick!

0cc4m

LGTM

vulkan: Implement topk_moe fused shader, ported from CUDA

e0f7fa9

This is similar to the CUDA shader from ggml-org#16130, but doesn't use shared memory and handles different subgroup sizes.

jeffbolznv requested review from 0cc4m, ggerganov and slaren as code owners October 17, 2025 20:14

github-actions bot added Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels Oct 17, 2025

jeffbolznv mentioned this pull request Oct 17, 2025

CUDA: add a fused top-K MoE kernel #16130

Merged

am17an added a commit to am17an/llama.cpp that referenced this pull request Oct 18, 2025

CUDA: use registers instead of smem in topk-moe

9891b4f

Uses the technique used in the vulkan PR ggml-org#16641. Neat trick!

am17an added a commit to am17an/llama.cpp that referenced this pull request Oct 18, 2025

CUDA: use registers instead of smem in topk-moe

06cd6bd

Uses the technique used in the vulkan PR ggml-org#16641. Neat trick!

am17an mentioned this pull request Oct 18, 2025

CUDA: use registers instead of smem in topk-moe #16647

Merged

JohannesGaessler pushed a commit that referenced this pull request Oct 18, 2025

CUDA: use registers instead of smem in topk-moe (#16647)

38355c6

Uses the technique used in the vulkan PR #16641. Neat trick!

0cc4m approved these changes Oct 18, 2025

View reviewed changes

0cc4m merged commit e56abd2 into ggml-org:master Oct 18, 2025
69 of 70 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

vulkan: Implement topk_moe fused shader, ported from CUDA #16641

vulkan: Implement topk_moe fused shader, ported from CUDA #16641

jeffbolznv commented Oct 17, 2025

Uh oh!

0cc4m left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

vulkan: Implement topk_moe fused shader, ported from CUDA #16641

vulkan: Implement topk_moe fused shader, ported from CUDA #16641

Conversation

jeffbolznv commented Oct 17, 2025

Uh oh!

0cc4m left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants