Commit c262bed
CUDA: Prefer vector flash decoding kernel for Gemma models (#12738)
* Prefer vector flash decoding kernel for Gemma models
Vector flash decoding kernel was not being picked for models with head dimension 256. Gemma models are in this category.
Removing this limit improves e2e performance by upto 12% in gen phase throughput for Gemm models.
* Update ggml/src/ggml-cuda/fattn.cu
Co-authored-by: Johannes Gäßler <[email protected]>
---------
Co-authored-by: Johannes Gäßler <[email protected]>1 parent 5dd5d1a commit c262bed
1 file changed
+1
-1
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
299 | 299 | | |
300 | 300 | | |
301 | 301 | | |
302 | | - | |
| 302 | + | |
303 | 303 | | |
304 | 304 | | |
305 | 305 | | |
| |||
0 commit comments