vulkan: support softmax/FA batch and broadcast #607

firecoperana · 2025-07-13T17:51:25Z

vulkan: support softmax/FA batch and broadcast
ggml-org/llama.cpp#14449
Fix gibberish output when FA is enabled for some model

The new FA for deepseek MLA PR is missing this, which caused gibberish output in some models.

I have read the contributing guidelines
Self-reported review complexity:
- Low
- Medium
- High

# Conflicts: # ggml/src/ggml-vulkan.cpp # ggml/src/vulkan-shaders/flash_attn.comp # ggml/src/vulkan-shaders/flash_attn_cm1.comp # ggml/src/vulkan-shaders/flash_attn_cm2.comp

ubergarm · 2025-07-13T19:09:26Z

Great, this fixes the gibberish issue we were seeing over on #598 when I run with KHR_coopmat and -fa enabled:

ggml_vulkan: 0 = NVIDIA GeForce RTX 3090 Ti (NVIDIA) | uma: 0 | fp16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: KHR_coopmat

However, on the AMD GPU rig it no longer outputs that same looking gibberish, but now kinda chokes/freezes up around the same point where it used to throw gibberish. Then it very slowly outputs 3333

$ ./build/bin/llama-server --version
version: 3796 (69ab6921)
built with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu

ggml_vulkan: 0 = Radeon RX 7900 XTX (AMD open-source driver) | uma: 0 | fp16: 1 | warp size: 64 | shared memory: 32768 | int dot: 1 | matrix cores: KHR_coopmat

... For example, in French, numbers from  to 10 are all irregular except for 11-16 which333^C
Response cancelled.

Also, I get a similar behavior where it starts out okay then goes to 33333 on my nvidia GPU when running with NV_coopmat2

ggml_vulkan: 0 = NVIDIA GeForce RTX 3090 Ti (NVIDIA) | uma: 0 | fp16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2

...Maybe the user is learning French or needs it for a specific purpose. They might be preparing for a trip, studying, or33333333333333333333333333333333333333333333333333333333333333333333333333333333333^C
Response cancelled.

So this PR does seem to fix the NVIDIA KHR_coopmat -fa enabled path, but not on the NVIDIA NV_coopmat2 nor AMD KHR_coopmat libvulkan.so (found version "1.4.313") path.

* vulkan: Handle updated FA dim2/3 definition Pack mask boolean and n_head_log2 into a single dword to keep the push constant block under the 128B limit. * handle null mask for gqa * allow gqa with dim3>1

firecoperana · 2025-07-13T23:46:43Z

Can you try again?

ubergarm · 2025-07-14T01:38:51Z

Hey thanks a lot for working on this stuff! I just tried again with dba868a with the three cases:

NVIDIA 3090TI FE

KHR_coopmat is still working okay it seems
NV_coopmat2 still glitches out similarly.

AMD RX 7900 XTX

NV_coopmat2 still glitches out

Yeah so seems unchanged with two cases still suddnely outputing just 3 so cardinal numbers33^C after about ~225ish tokens into the reply. I have some time tomorrow to test anything else, thanks!

ikawrakow · 2025-07-15T06:04:07Z

@firecoperana

Is this necessary after #608?

firecoperana · 2025-07-15T12:30:20Z

Already included in the main.

vulkan: support softmax/FA batch and broadcast (#14449)

69ab692

# Conflicts: # ggml/src/ggml-vulkan.cpp # ggml/src/vulkan-shaders/flash_attn.comp # ggml/src/vulkan-shaders/flash_attn_cm1.comp # ggml/src/vulkan-shaders/flash_attn_cm2.comp

firecoperana requested a review from ikawrakow July 13, 2025 17:51

firecoperana self-assigned this Jul 13, 2025

firecoperana mentioned this pull request Jul 13, 2025

Vulkan: iquants and flash attention split_k_reduce improvement #598

Closed

4 tasks

vulkan: Handle updated FA dim2/3 definition (#14518)

dba868a

* vulkan: Handle updated FA dim2/3 definition Pack mask boolean and n_head_log2 into a single dword to keep the push constant block under the 128B limit. * handle null mask for gqa * allow gqa with dim3>1

firecoperana closed this Jul 15, 2025

firecoperana deleted the fcp/vulkan_fa_fix_dsv branch July 16, 2025 14:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

vulkan: support softmax/FA batch and broadcast #607

vulkan: support softmax/FA batch and broadcast #607

Uh oh!

firecoperana commented Jul 13, 2025

Uh oh!

ubergarm commented Jul 13, 2025 •

edited

Loading

Uh oh!

firecoperana commented Jul 13, 2025

Uh oh!

ubergarm commented Jul 14, 2025

Uh oh!

ikawrakow commented Jul 15, 2025

Uh oh!

firecoperana commented Jul 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

vulkan: support softmax/FA batch and broadcast #607

vulkan: support softmax/FA batch and broadcast #607

Uh oh!

Conversation

firecoperana commented Jul 13, 2025

Uh oh!

ubergarm commented Jul 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

firecoperana commented Jul 13, 2025

Uh oh!

ubergarm commented Jul 14, 2025

NVIDIA 3090TI FE

AMD RX 7900 XTX

Uh oh!

ikawrakow commented Jul 15, 2025

Uh oh!

firecoperana commented Jul 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ubergarm commented Jul 13, 2025 •

edited

Loading