Skip to content

Conversation

giuseppe
Copy link
Contributor

@giuseppe giuseppe commented Oct 7, 2025

implement SSM scan and SSM conv for Vulkan.

Tested on an NVIDIA L4:
master (4e0388a)

# bin/llama-bench -r 5 -m ~/models/granite-4.0-h-tiny-UD-Q8_K_XL.gguf 

| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| granitehybrid ?B Q8_0          |   7.73 GiB |     6.94 B | Vulkan     | 999 |           pp512 |       1419.59 ± 6.62 |
| granitehybrid ?B Q8_0          |   7.73 GiB |     6.94 B | Vulkan     | 999 |           tg128 |         33.57 ± 0.19 |

and db8b8bc:

# bin/llama-bench -r 5 -m ~/models/granite-4.0-h-tiny-UD-Q8_K_XL.gguf

| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| granitehybrid ?B Q8_0          |   7.73 GiB |     6.94 B | Vulkan     |  99 |       4 |     512 |     4096 |    0 |           pp512 |      2569.27 ± 19.37 |
| granitehybrid ?B Q8_0          |   7.73 GiB |     6.94 B | Vulkan     |  99 |       4 |     512 |     4096 |    0 |           tg128 |         80.77 ± 0.12 |

@github-actions github-actions bot added Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels Oct 7, 2025
@giuseppe giuseppe marked this pull request as ready for review October 7, 2025 14:05
@giuseppe giuseppe requested a review from 0cc4m as a code owner October 7, 2025 14:05
Copy link
Collaborator

@jeffbolznv jeffbolznv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this contribution!

warp_sdata[warp_offset + lane] = val;
barrier();

if (lane < 16) warp_sdata[warp_offset + lane] += warp_sdata[warp_offset + lane + 16];
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like it's assuming a subgroup size of 32 (also at line 37).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do I understand correctly that this doesn't actually rely on a subgroup size of 32, but it's splitting the workgroup into groups of 32 and just reducing those (and it looks like some reduction across groups of 32 has already happened?).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry I've missed this one. Yeah I don't think it would work with a size != 32. I need to think more through this one.

Do you've any suggestions on what I could do here?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this may work because you're not relying on SubgroupInvocationId or SubgroupID, you've just split the workgroup into groups of 32. Maybe we can just test it on AMD (with wave64) and Intel and verify that it works.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it work on Intel, but I am worried about all these settings that we made configurable. I've not really tried how it behaves with different values of the constants we defined. Or is the assumption that these values should not be tweaked from vulkan-shaders-gen.cpp without also changing the implementation in the shader?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could change this to a while loop that will handle any power-of-two value of WARP_SIZE. We do want to allow the spec constants to be changeable but it's fine to have limitations like "must be a power of two".

Copy link
Collaborator

@netrunnereve netrunnereve Oct 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm wave64 AMD and wave8 llvmpipe are failing one test here, possibly due to this. All other tests are passing.

[SSM_SCAN] NMSE = 31335529439335960.000000000 > 0.000000100   SSM_SCAN(type=f32,d_state=16,head_dim=1,n_head=1024,n_group=1,n_seq_tokens=32,n_seqs=4): FAIL

I think Intel also has a subgroup size of 32 so it wouldn't be a good test for this.

return warp_sdata[warp_offset];
}

void main() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do all threads always load/store in bounds? In the host code there was some rounding up going on, which suggests maybe some threads don't correspond to in-bounds locations.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ping on this one. I don't really understand what this shader does and which locations it should be accessing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've tried to follow what the CUDA shader does. I'll spend more time on it and see if there is anything I can improve about memory access and make sure all the assumptions in the code are checked.

@giuseppe
Copy link
Contributor Author

giuseppe commented Oct 8, 2025

I've addressed the comments and pushed a new version. The results are even better now:

ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA L4 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 49152 | int dot: 0 | matrix cores: KHR_coopmat
| model                          |       size |     params | backend    | ngl | threads | n_batch | n_ubatch | mmap |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | ------: | -------: | ---: | --------------: | -------------------: |
| granitehybrid ?B Q8_0          |   7.73 GiB |     6.94 B | Vulkan     |  99 |       4 |     512 |     4096 |    0 |           pp512 |      2569.27 ± 19.37 |
| granitehybrid ?B Q8_0          |   7.73 GiB |     6.94 B | Vulkan     |  99 |       4 |     512 |     4096 |    0 |           tg128 |         80.77 ± 0.12 |


string_to_spv("ssm_scan_f32_d16", "ssm_scan.comp", {{"A_TYPE", "float"}});
string_to_spv("ssm_scan_f32_d128", "ssm_scan.comp", {{"A_TYPE", "float"}});
string_to_spv("ssm_scan_f32_d256", "ssm_scan.comp", {{"A_TYPE", "float"}});
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These three are all identical now, you only need one.

warp_sdata[warp_offset + lane] = val;
barrier();

if (lane < 16) warp_sdata[warp_offset + lane] += warp_sdata[warp_offset + lane + 16];
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could change this to a while loop that will handle any power-of-two value of WARP_SIZE. We do want to allow the spec constants to be changeable but it's fine to have limitations like "must be a power of two".

@giuseppe
Copy link
Contributor Author

giuseppe commented Oct 9, 2025

I've completely replaced the code to reduce sum with subgroupAdd() as sum.comp does. It didn't work earlier because I had to use gl_SubgroupInvocationID == 0 instead of % WARP_SIZE == 0.

In the last version I've also renamed WARP_SIZE to SUBGROUP_SIZE.

@0cc4m
Copy link
Collaborator

0cc4m commented Oct 9, 2025

Be aware that not all devices support subgroup commands. If there's a performance advantage to using them, you can do that, but it would still need a fallback to using a shared memory reduction. If there isn't a performance advantage, just use the shared memory reduction for compatibility.

@giuseppe
Copy link
Contributor Author

giuseppe commented Oct 9, 2025

Be aware that not all devices support subgroup commands. If there's a performance advantage to using them, you can do that, but it would still need a fallback to using a shared memory reduction. If there isn't a performance advantage, just use the shared memory reduction for compatibility.

with the subgroup code I get:

ggml_vulkan: 0 = NVIDIA L4 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 49152 | int dot: 0 | matrix cores: KHR_coopmat
| model                          |       size |     params | backend    | ngl | threads | n_batch | n_ubatch | mmap |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | ------: | -------: | ---: | --------------: | -------------------: |
| granitehybrid ?B Q8_0          |   7.73 GiB |     6.94 B | Vulkan     |  99 |       4 |     512 |     4096 |    0 |           pp512 |      2692.95 ± 49.85 |
| granitehybrid ?B Q8_0          |   7.73 GiB |     6.94 B | Vulkan     |  99 |       4 |     512 |     4096 |    0 |           tg128 |         81.13 ± 0.08 |

if I revert to the reduction loop, I've:

ggml_vulkan: 0 = NVIDIA L4 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 49152 | int dot: 0 | matrix cores: KHR_coopmat
| model                          |       size |     params | backend    | ngl | threads | n_batch | n_ubatch | mmap |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | ------: | -------: | ---: | --------------: | -------------------: |
| granitehybrid ?B Q8_0          |   7.73 GiB |     6.94 B | Vulkan     |  99 |       4 |     512 |     4096 |    0 |           pp512 |      2562.04 ± 31.72 |
| granitehybrid ?B Q8_0          |   7.73 GiB |     6.94 B | Vulkan     |  99 |       4 |     512 |     4096 |    0 |           tg128 |         81.14 ± 0.09 |

@giuseppe
Copy link
Contributor Author

giuseppe commented Oct 9, 2025

I've reverted to the version with a for loop. We can look at the subgroup optimization later

@giuseppe giuseppe force-pushed the ssm-vulkan branch 3 times, most recently from 742ebd7 to c467631 Compare October 9, 2025 12:04
Add State Space Model scan operation to the Vulkan backend.

Signed-off-by: Giuseppe Scrivano <[email protected]>
Add State Space Model conv operation to the Vulkan backend.

Signed-off-by: Giuseppe Scrivano <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning Vulkan Issues specific to the Vulkan backend
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants