Commit dc1d2ad
authored
vulkan: scalar flash attention implementation (ggml-org#13324)
* vulkan: scalar flash attention implementation
* vulkan: always use fp32 for scalar flash attention
* vulkan: use vector loads in scalar flash attention shader
* vulkan: remove PV matrix, helps with register usage
* vulkan: reduce register usage in scalar FA, but perf may be slightly worse
* vulkan: load each Q value once. optimize O reduction. more tuning
* vulkan: support q4_0/q8_0 KV in scalar FA
* CI: increase timeout to accommodate newly-supported tests
* vulkan: for scalar FA, select between 1 and 8 rows
* vulkan: avoid using Float16 capability in scalar FA1 parent 7c28a74 commit dc1d2ad
File tree
4 files changed
+646
-94
lines changed- .github/workflows
- ggml/src/ggml-vulkan
- vulkan-shaders
4 files changed
+646
-94
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
307 | 307 | | |
308 | 308 | | |
309 | 309 | | |
310 | | - | |
| 310 | + | |
311 | 311 | | |
312 | 312 | | |
313 | 313 | | |
| |||
0 commit comments