Bug: Flash attention reduces vulkan performance by ~50%

### What happened?

Enabling flash attention reduces performance on vulkan by a lot more than expected.
Even if performance varies between hardware, it feels like a 50% drop would be a bug

Hardware is AMD RX 6800 XT

### Name and Version

version: 3772 (23e0d70b)
built with MSVC 19.29.30154.0 for x64

### What operating system are you seeing the problem on?

Windows

### Relevant log output

```shell
llama-b3772-bin-win-vulkan-x64> ./llama-cli.exe -m '..\Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf' -p "to be or" -n 600 -c 4096 -ngl 99
Performance without flash attention:
llama_perf_sampler_print:    sampling time =      48.42 ms /   604 runs   (    0.08 ms per token, 12474.70 tokens per second)
llama_perf_context_print:        load time =   13033.53 ms
llama_perf_context_print: prompt eval time =     183.59 ms /     4 tokens (   45.90 ms per token,    21.79 tokens per second)
llama_perf_context_print:        eval time =    9458.98 ms /   599 runs   (   15.79 ms per token,    63.33 tokens per second)
llama_perf_context_print:       total time =    9765.68 ms /   603 tokens

llama-b3772-bin-win-vulkan-x64> ./llama-cli.exe -m '..\Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf' -p "to be or" -n 600 -c 4096 -ngl 99 --flash-attn
with flash attention:
llama_perf_sampler_print:    sampling time =      48.48 ms /   604 runs   (    0.08 ms per token, 12458.75 tokens per second)
llama_perf_context_print:        load time =    2709.09 ms
llama_perf_context_print: prompt eval time =     194.77 ms /     4 tokens (   48.69 ms per token,    20.54 tokens per second)
llama_perf_context_print:        eval time =   18321.90 ms /   599 runs   (   30.59 ms per token,    32.69 tokens per second)
llama_perf_context_print:       total time =   18617.86 ms /   603 tokens
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Bug: Flash attention reduces vulkan performance by ~50% #9572

What happened?

Name and Version

What operating system are you seeing the problem on?

Relevant log output

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Bug: Flash attention reduces vulkan performance by ~50% #9572

Description

What happened?

Name and Version

What operating system are you seeing the problem on?

Relevant log output

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions