Skip to content

Conversation

firecoperana
Copy link
Collaborator

vulkan: support softmax/FA batch and broadcast
ggml-org/llama.cpp#14449
Fix gibberish output when FA is enabled for some model

The new FA for deepseek MLA PR is missing this, which caused gibberish output in some models.

# Conflicts:
#	ggml/src/ggml-vulkan.cpp
#	ggml/src/vulkan-shaders/flash_attn.comp
#	ggml/src/vulkan-shaders/flash_attn_cm1.comp
#	ggml/src/vulkan-shaders/flash_attn_cm2.comp
@ubergarm
Copy link
Contributor

ubergarm commented Jul 13, 2025

Great, this fixes the gibberish issue we were seeing over on #598 when I run with KHR_coopmat and -fa enabled:

ggml_vulkan: 0 = NVIDIA GeForce RTX 3090 Ti (NVIDIA) | uma: 0 | fp16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: KHR_coopmat

However, on the AMD GPU rig it no longer outputs that same looking gibberish, but now kinda chokes/freezes up around the same point where it used to throw gibberish. Then it very slowly outputs 3333

$ ./build/bin/llama-server --version
version: 3796 (69ab6921)
built with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu

ggml_vulkan: 0 = Radeon RX 7900 XTX (AMD open-source driver) | uma: 0 | fp16: 1 | warp size: 64 | shared memory: 32768 | int dot: 1 | matrix cores: KHR_coopmat

... For example, in French, numbers from  to 10 are all irregular except for 11-16 which333^C
Response cancelled.

Also, I get a similar behavior where it starts out okay then goes to 33333 on my nvidia GPU when running with NV_coopmat2

ggml_vulkan: 0 = NVIDIA GeForce RTX 3090 Ti (NVIDIA) | uma: 0 | fp16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2

...Maybe the user is learning French or needs it for a specific purpose. They might be preparing for a trip, studying, or33333333333333333333333333333333333333333333333333333333333333333333333333333333333^C
Response cancelled.

So this PR does seem to fix the NVIDIA KHR_coopmat -fa enabled path, but not on the NVIDIA NV_coopmat2 nor AMD KHR_coopmat libvulkan.so (found version "1.4.313") path.

* vulkan: Handle updated FA dim2/3 definition

Pack mask boolean and n_head_log2 into a single dword to keep the push
constant block under the 128B limit.

* handle null mask for gqa

* allow gqa with dim3>1
@firecoperana
Copy link
Collaborator Author

Can you try again?

@ubergarm
Copy link
Contributor

Hey thanks a lot for working on this stuff! I just tried again with dba868a with the three cases:

NVIDIA 3090TI FE

  • KHR_coopmat is still working okay it seems
  • NV_coopmat2 still glitches out similarly.

AMD RX 7900 XTX

  • NV_coopmat2 still glitches out

Yeah so seems unchanged with two cases still suddnely outputing just 3 so cardinal numbers33^C after about ~225ish tokens into the reply. I have some time tomorrow to test anything else, thanks!

@ikawrakow
Copy link
Owner

@firecoperana

Is this necessary after #608?

@firecoperana
Copy link
Collaborator Author

Already included in the main.

@firecoperana firecoperana deleted the fcp/vulkan_fa_fix_dsv branch July 16, 2025 14:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants