Commit a04b329

authored and

committed

vulkan: scalar flash attention implementation (llama/13324)

* vulkan: scalar flash attention implementation * vulkan: always use fp32 for scalar flash attention * vulkan: use vector loads in scalar flash attention shader * vulkan: remove PV matrix, helps with register usage * vulkan: reduce register usage in scalar FA, but perf may be slightly worse * vulkan: load each Q value once. optimize O reduction. more tuning * vulkan: support q4_0/q8_0 KV in scalar FA * CI: increase timeout to accommodate newly-supported tests * vulkan: for scalar FA, select between 1 and 8 rows * vulkan: avoid using Float16 capability in scalar FA

1 parent 45d8b23 commit a04b329Copy full SHA for a04b329

3 files changed

+645

-93

lines changed

ggml/src/ggml-vulkan
- ggml-vulkan.cpp
- vulkan-shaders
  - flash_attn.comp
  - vulkan-shaders-gen.cpp

3 files changed

+645

-93

lines changed

Comments

(0)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Commit a04b329

3 files changed

3 files changed

File tree

3 files changed

3 files changed

0 commit comments