Misc. bug: Flash attention on Vulkan

### Name and Version

$ ./build/bin/llama-cli --version
version: 4941 (ba932dfb)
built with cc (GCC) 14.2.1 20250110 (Red Hat 14.2.1-7) for x86_64-redhat-linux

### Operating systems

Linux

### Which llama.cpp modules do you know to be affected?

Other (Please specify in the next section)

### Command line

```shell
./build/bin/llama-bench -ngl 99 -m models/Qwen2.5-Coder-14B-Instruct-Q4_K_L.gguf -fa 0,1
./build/bin/llama-bench -ngl 99 -m models/Qwen2.5-Coder-1.5B-Instruct-Q8_0.gguf -fa 0,1
```

### Problem description & steps to reproduce

It seems that some FA operations are not yet handled by the Vulkan backend and fall back to CPU. But I can't find any open issue on this, only several closed ones, so maybe it's just some model archs or just my GPU that can't do it yet?

Using Mesa RADV 25.0.1 on Linux Fedora 41. AMD Radeon RX 6700 XT, gfx1030 (0x1030), VMM: no, Wave Size: 32

| model                          |       size |     params | backend    | ngl | fa |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | ------------: | -------------------: |
| qwen2 14B Q4_K - Medium        |   8.90 GiB |    14.77 B | ROCm       |  99 |  0 |         pp512 |        408.32 ± 0.13 |
| qwen2 14B Q4_K - Medium        |   8.90 GiB |    14.77 B | ROCm       |  99 |  0 |         tg128 |         27.97 ± 0.01 |
| qwen2 14B Q4_K - Medium        |   8.90 GiB |    14.77 B | ROCm       |  99 |  1 |         pp512 |        352.63 ± 3.31 |
| qwen2 14B Q4_K - Medium        |   8.90 GiB |    14.77 B | ROCm       |  99 |  1 |         tg128 |         26.74 ± 0.01 |
| qwen2 14B Q4_K - Medium        |   8.90 GiB |    14.77 B | Vulkan     |  99 |  0 |         pp512 |        256.50 ± 0.14 |
| qwen2 14B Q4_K - Medium        |   8.90 GiB |    14.77 B | Vulkan     |  99 |  0 |         tg128 |         34.37 ± 0.09 |
| qwen2 14B Q4_K - Medium        |   8.90 GiB |    14.77 B | Vulkan     |  99 |  1 |         pp512 |        100.24 ± 1.31 |
| qwen2 14B Q4_K - Medium        |   8.90 GiB |    14.77 B | Vulkan     |  99 |  1 |         tg128 |         22.07 ± 0.03 |

build: ba932dfb (4941)


Seems that perf loss is more pronounced with smaller models:

| model                          |       size |     params | backend    | ngl | fa |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | ------------: | -------------------: |
| qwen2 1.5B Q8_0                |   1.53 GiB |     1.54 B | ROCm       |  99 |  0 |         pp512 |       4682.96 ± 2.85 |
| qwen2 1.5B Q8_0                |   1.53 GiB |     1.54 B | ROCm       |  99 |  0 |         tg128 |        113.17 ± 0.03 |
| qwen2 1.5B Q8_0                |   1.53 GiB |     1.54 B | ROCm       |  99 |  1 |         pp512 |       3458.94 ± 3.00 |
| qwen2 1.5B Q8_0                |   1.53 GiB |     1.54 B | ROCm       |  99 |  1 |         tg128 |        100.59 ± 0.22 |
| qwen2 1.5B Q8_0                |   1.53 GiB |     1.54 B | Vulkan     |  99 |  0 |         pp512 |       2778.95 ± 4.85 |
| qwen2 1.5B Q8_0                |   1.53 GiB |     1.54 B | Vulkan     |  99 |  0 |         tg128 |        140.45 ± 0.39 |
| qwen2 1.5B Q8_0                |   1.53 GiB |     1.54 B | Vulkan     |  99 |  1 |         pp512 |        685.72 ± 5.62 |
| qwen2 1.5B Q8_0                |   1.53 GiB |     1.54 B | Vulkan     |  99 |  1 |         tg128 |         42.96 ± 0.83 |

build: ba932dfb (4941)

### First Bad Commit

_No response_

### Relevant log output

```shell

```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Misc. bug: Flash attention on Vulkan #12526

Name and Version

Operating systems

Which llama.cpp modules do you know to be affected?

Command line

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

model	size	params	backend	ngl	fa	test	t/s
qwen2 14B Q4_K - Medium	8.90 GiB	14.77 B	ROCm	99	0	pp512	408.32 ± 0.13
qwen2 14B Q4_K - Medium	8.90 GiB	14.77 B	ROCm	99	0	tg128	27.97 ± 0.01
qwen2 14B Q4_K - Medium	8.90 GiB	14.77 B	ROCm	99	1	pp512	352.63 ± 3.31
qwen2 14B Q4_K - Medium	8.90 GiB	14.77 B	ROCm	99	1	tg128	26.74 ± 0.01
qwen2 14B Q4_K - Medium	8.90 GiB	14.77 B	Vulkan	99	0	pp512	256.50 ± 0.14
qwen2 14B Q4_K - Medium	8.90 GiB	14.77 B	Vulkan	99	0	tg128	34.37 ± 0.09
qwen2 14B Q4_K - Medium	8.90 GiB	14.77 B	Vulkan	99	1	pp512	100.24 ± 1.31
qwen2 14B Q4_K - Medium	8.90 GiB	14.77 B	Vulkan	99	1	tg128	22.07 ± 0.03

model	size	params	backend	ngl	fa	test	t/s
qwen2 1.5B Q8_0	1.53 GiB	1.54 B	ROCm	99	0	pp512	4682.96 ± 2.85
qwen2 1.5B Q8_0	1.53 GiB	1.54 B	ROCm	99	0	tg128	113.17 ± 0.03
qwen2 1.5B Q8_0	1.53 GiB	1.54 B	ROCm	99	1	pp512	3458.94 ± 3.00
qwen2 1.5B Q8_0	1.53 GiB	1.54 B	ROCm	99	1	tg128	100.59 ± 0.22
qwen2 1.5B Q8_0	1.53 GiB	1.54 B	Vulkan	99	0	pp512	2778.95 ± 4.85
qwen2 1.5B Q8_0	1.53 GiB	1.54 B	Vulkan	99	0	tg128	140.45 ± 0.39
qwen2 1.5B Q8_0	1.53 GiB	1.54 B	Vulkan	99	1	pp512	685.72 ± 5.62
qwen2 1.5B Q8_0	1.53 GiB	1.54 B	Vulkan	99	1	tg128	42.96 ± 0.83

Misc. bug: Flash attention on Vulkan #12526

Description

Name and Version

Operating systems

Which llama.cpp modules do you know to be affected?

Command line

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions