Skip to content

Conversation

jeffbolznv
Copy link
Collaborator

I'll leave some inline comments motivating some of the changes.

before:

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -fa 1 -n 128 -p 512 -r 30 --prio 1 -m c:\models\granite-4.0-h-tiny-Q8_0.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| granitehybrid ?B Q8_0          |   6.88 GiB |     6.94 B | Vulkan     |  99 |  1 |           pp512 |     7110.83 ± 689.95 |
| granitehybrid ?B Q8_0          |   6.88 GiB |     6.94 B | Vulkan     |  99 |  1 |           tg128 |        193.79 ± 1.00 |

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -fa 1 -n 128 -p 512 -r 30 --prio 1 -m c:\models\grante-4.0-h-tiny-Q8_0.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 4070 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| granitehybrid ?B Q8_0          |   6.88 GiB |     6.94 B | Vulkan     |  99 |  1 |           pp512 |      3862.00 ± 19.50 |
| granitehybrid ?B Q8_0          |   6.88 GiB |     6.94 B | Vulkan     |  99 |  1 |           tg128 |        135.66 ± 0.80 |

after:

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -fa 1 -n 128 -p 512 -r 30 --prio 1 -m c:\models\granite-4.0-h-tiny-Q8_0.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| granitehybrid ?B Q8_0          |   6.88 GiB |     6.94 B | Vulkan     |  99 |  1 |           pp512 |     9851.65 ± 131.68 |
| granitehybrid ?B Q8_0          |   6.88 GiB |     6.94 B | Vulkan     |  99 |  1 |           tg128 |        193.69 ± 1.00 |

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -fa 1 -n 128 -p 512 -r 30 --prio 1 -m c:\models\granite-4.0-h-tiny-Q8_0.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 4070 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| granitehybrid ?B Q8_0          |   6.88 GiB |     6.94 B | Vulkan     |  99 |  1 |           pp512 |      4488.07 ± 63.72 |
| granitehybrid ?B Q8_0          |   6.88 GiB |     6.94 B | Vulkan     |  99 |  1 |           tg128 |        136.61 ± 0.88 |

CC @giuseppe

@jeffbolznv jeffbolznv requested a review from 0cc4m as a code owner October 18, 2025 02:55
@github-actions github-actions bot added Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels Oct 18, 2025
if (k < SPLIT_H * D_STATE && (k + (w >> 1)) < SPLIT_H * D_STATE) {
stateC[k] += stateC[k + (w >> 1)];
[[unroll]]
for (uint w = D_STATE / 2; w >= SUBGROUP_SIZE; w >>= 1) {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shifted the values of w down by a factor of 2, rather than using w>>1 everywhere. This seemed to be causing some weird code to be generated.

barrier();
}

[[unroll]] for (uint j = 0; j <= SPLIT_H / (D_STATE / SUBGROUP_SIZE); j++) {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was one too many iterations most of the time, leading to extra subgroup ops or barriers. But when D_STATE / SUBGROUP_SIZE is greater than SPLIT_H, we need at least one iteration.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for fixing this one!

if (idx + offset < SPLIT_H * D_STATE) {
stateC[idx] += stateC[idx + offset];
if (idx < SPLIT_H * D_STATE ||
max_idx < SPLIT_H * D_STATE) {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This max_idx comparison should fold away and avoid the need for the branch most of the time.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just curious, is this needed to get the subgroup to run the same code?

Copy link
Contributor

@giuseppe giuseppe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

I had started working on the subgroupAdd optimization, but I got stuck trying to get it to work on the Intel Arc GPU (I’m not yet sure why it was behaving differently on a NVIDIA GPU). I even suspected there might be an issue with the driver. Your version works well there too, so the issue was definitely in my version. So thanks for beating me at it :-) I will take the chance to compare the two versions and understand what I was doing wrong.

@0cc4m
Copy link
Collaborator

0cc4m commented Oct 18, 2025

@giuseppe The usual subgroup issue with Intel would be about subgroup size (which can be between 8 and 32 for Intel) and forcing full subgroups. You probably need to do the latter to get the functions to work, and if you expect a specific size in the shader, force it. For performance on Intel 16 has been best in my experience.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning Vulkan Issues specific to the Vulkan backend

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants