CUDA: faster tile FA, add oob checks, more HSs #16492

JohannesGaessler · 2025-10-09T22:03:29Z

Changes:

Generalized the tile CUDA FlashAttention kernel with support for essentially arbitrary head sizes (in particular 40 for Stable Diffusion and 576/512 for Deepseek) as well as arbitrary context sizes (for optimal performance this should still be padded to a multiple of 256, long term this can be lowered to 128). The tile kernel is now used as a fallback if the other kernels cannot be used for whatever reason. I intend to also add support for non-padded ne11 to the mma kernel.
Added the same GQA optimizations from the mma kernel to the tile kernel which reduces I/O for the mask and increases arithmetic intensity for small batch sizes. In order to keep the number of kernel specializations low I'm using the same strategy where I put support for optional features into the version without GQA (ALiBi, no mask, non-padded KV cache). The GQA optimizations require additional integer divisions which as of yet are still done without fastdiv in the FA kernels, because of this there are some combinations of GPUs, models, and batch sizes where there is a 1-2% performance regression. I intend to add fastdiv once I have removed the WMMA kernel, I expect this to be fixed then. Also note that the granularity in terms of tokens is now being reduced by a factor equal to the GQA ratio so even in those cases there is now slightly less wasted compute.
Added support for multiple parallel warps per Q column in order to improve performance for small batch sizes. With this additional optimization the tile kernel seems to now be a better choice for batch size 1 than the vector kernel, particularly for AMD hardware.
Fixed a bug in common.cuh where if one were to compile code for CC 6.1 and then run it on a device with CC >= 7.0 FAST_FP16_AVAILABLE and fast_fp16_available could be inconsistent.

Performance

GPU	Model	n_gqa	Microbatch size	Test	t/s master	t/s `a2672e3`	Speedup
MI50	gemma 2B Q4_0	8	1	pp16384	167.96	198.25	1.18
MI50	gemma 2B Q4_0	8	2	pp16384	176.09	352.99	2.00
MI50	gemma 2B Q4_0	8	4	pp16384	192.18	369.27	1.92
MI50	gemma 2B Q4_0	8	8	pp16384	202.19	497.16	2.46
MI50	gemma 2B Q4_0	8	16	pp16384	723.23	736.30	1.02
MI50	gemma 2B Q4_0	8	32	pp16384	919.54	924.45	1.01
MI50	gemma 2B Q4_0	8	64	pp16384	1059.55	1077.96	1.02
MI50	gemma 2B Q4_0	8	128	pp16384	1686.29	1687.00	1.00
MI50	gemma 2B Q4_0	8	256	pp16384	2129.54	2161.97	1.02
MI50	gemma 2B Q4_0	8	512	pp16384	2301.56	2358.12	1.02
MI50	gemma 2B Q4_0	8	1024	pp16384	2495.75	2555.49	1.02
MI50	gemma 2B Q4_0	8	2048	pp16384	2553.87	2623.85	1.03
MI50	gemma3 1B Q4_0	4	1	pp16384	164.69	202.31	1.23
MI50	gemma3 1B Q4_0	4	2	pp16384	210.73	397.85	1.89
MI50	gemma3 1B Q4_0	4	4	pp16384	321.87	528.05	1.64
MI50	gemma3 1B Q4_0	4	8	pp16384	488.20	749.39	1.53
MI50	gemma3 1B Q4_0	4	16	pp16384	1067.05	1060.52	0.99
MI50	gemma3 1B Q4_0	4	32	pp16384	1466.17	1470.02	1.00
MI50	gemma3 1B Q4_0	4	64	pp16384	1786.39	1787.31	1.00
MI50	gemma3 1B Q4_0	4	128	pp16384	2902.76	2920.32	1.01
MI50	gemma3 1B Q4_0	4	256	pp16384	4668.56	4721.55	1.01
MI50	gemma3 1B Q4_0	4	512	pp16384	5672.12	5747.29	1.01
MI50	gemma3 1B Q4_0	4	1024	pp16384	6558.17	6697.35	1.02
MI50	gemma3 1B Q4_0	4	2048	pp16384	6883.96	7051.50	1.02
MI50	llama 1B Q4_0	4	1	pp16384	242.57	278.66	1.15
MI50	llama 1B Q4_0	4	2	pp16384	353.90	543.58	1.54
MI50	llama 1B Q4_0	4	4	pp16384	371.55	606.71	1.63
MI50	llama 1B Q4_0	4	8	pp16384	432.43	911.49	2.11
MI50	llama 1B Q4_0	4	16	pp16384	1059.64	1088.76	1.03
MI50	llama 1B Q4_0	4	32	pp16384	1331.31	1424.42	1.07
MI50	llama 1B Q4_0	4	64	pp16384	1583.11	1632.63	1.03
MI50	llama 1B Q4_0	4	128	pp16384	2286.05	2402.55	1.05
MI50	llama 1B Q4_0	4	256	pp16384	2826.65	3022.98	1.07
MI50	llama 1B Q4_0	4	512	pp16384	3185.15	3460.37	1.09
MI50	llama 1B Q4_0	4	1024	pp16384	3357.02	3739.09	1.11
MI50	llama 1B Q4_0	4	2048	pp16384	3421.31	3845.73	1.12
MI50	qwen3 0.6B Q4_0	2	1	pp16384	162.20	184.57	1.14
MI50	qwen3 0.6B Q4_0	2	2	pp16384	253.97	368.12	1.45
MI50	qwen3 0.6B Q4_0	2	4	pp16384	269.24	539.88	2.01
MI50	qwen3 0.6B Q4_0	2	8	pp16384	276.55	828.88	3.00
MI50	qwen3 0.6B Q4_0	2	16	pp16384	1021.37	1081.49	1.06
MI50	qwen3 0.6B Q4_0	2	32	pp16384	1292.01	1299.16	1.01
MI50	qwen3 0.6B Q4_0	2	64	pp16384	1539.11	1587.87	1.03
MI50	qwen3 0.6B Q4_0	2	128	pp16384	2178.27	2304.22	1.06
MI50	qwen3 0.6B Q4_0	2	256	pp16384	2817.67	3097.36	1.10
MI50	qwen3 0.6B Q4_0	2	512	pp16384	2975.17	3316.98	1.11
MI50	qwen3 0.6B Q4_0	2	1024	pp16384	3140.11	3577.30	1.14
MI50	qwen3 0.6B Q4_0	2	2048	pp16384	3197.84	3675.02	1.15
RX 6800	gemma 2B Q4_0	8	1	pp16384	130.83	153.47	1.17
RX 6800	gemma 2B Q4_0	8	2	pp16384	144.47	282.76	1.96
RX 6800	gemma 2B Q4_0	8	4	pp16384	184.49	444.55	2.41
RX 6800	gemma 2B Q4_0	8	8	pp16384	204.98	569.48	2.78
RX 6800	gemma 2B Q4_0	8	16	pp16384	627.12	704.11	1.12
RX 6800	gemma 2B Q4_0	8	32	pp16384	982.44	1030.58	1.05
RX 6800	gemma 2B Q4_0	8	64	pp16384	1277.53	1322.01	1.03
RX 6800	gemma 2B Q4_0	8	128	pp16384	1572.44	1617.47	1.03
RX 6800	gemma 2B Q4_0	8	256	pp16384	1798.28	1861.54	1.04
RX 6800	gemma 2B Q4_0	8	512	pp16384	1965.40	2042.51	1.04
RX 6800	gemma 2B Q4_0	8	1024	pp16384	2044.63	2132.01	1.04
RX 6800	gemma 2B Q4_0	8	2048	pp16384	2057.27	2142.36	1.04
RX 6800	gemma3 1B Q4_0	4	1	pp16384	145.73	167.24	1.15
RX 6800	gemma3 1B Q4_0	4	2	pp16384	228.12	325.23	1.43
RX 6800	gemma3 1B Q4_0	4	4	pp16384	389.37	573.70	1.47
RX 6800	gemma3 1B Q4_0	4	8	pp16384	572.42	852.88	1.49
RX 6800	gemma3 1B Q4_0	4	16	pp16384	1026.18	1044.69	1.02
RX 6800	gemma3 1B Q4_0	4	32	pp16384	1676.56	1679.82	1.00
RX 6800	gemma3 1B Q4_0	4	64	pp16384	2335.22	2421.80	1.04
RX 6800	gemma3 1B Q4_0	4	128	pp16384	3635.14	3724.20	1.02
RX 6800	gemma3 1B Q4_0	4	256	pp16384	4423.00	4520.34	1.02
RX 6800	gemma3 1B Q4_0	4	512	pp16384	5197.50	5324.55	1.02
RX 6800	gemma3 1B Q4_0	4	1024	pp16384	5534.11	5707.96	1.03
RX 6800	gemma3 1B Q4_0	4	2048	pp16384	5416.10	5608.01	1.04
RX 6800	llama 1B Q4_0	4	1	pp16384	161.23	203.56	1.26
RX 6800	llama 1B Q4_0	4	2	pp16384	274.30	392.64	1.43
RX 6800	llama 1B Q4_0	4	4	pp16384	388.18	661.61	1.70
RX 6800	llama 1B Q4_0	4	8	pp16384	474.39	879.80	1.85
RX 6800	llama 1B Q4_0	4	16	pp16384	964.20	993.17	1.03
RX 6800	llama 1B Q4_0	4	32	pp16384	1366.04	1293.80	0.95
RX 6800	llama 1B Q4_0	4	64	pp16384	1758.54	1783.32	1.01
RX 6800	llama 1B Q4_0	4	128	pp16384	2111.00	2169.63	1.03
RX 6800	llama 1B Q4_0	4	256	pp16384	2438.34	2474.46	1.01
RX 6800	llama 1B Q4_0	4	512	pp16384	2552.97	2635.47	1.03
RX 6800	llama 1B Q4_0	4	1024	pp16384	2633.76	2806.73	1.07
RX 6800	llama 1B Q4_0	4	2048	pp16384	2644.12	2831.54	1.07
RX 6800	qwen3 0.6B Q4_0	2	1	pp16384	121.95	133.86	1.10
RX 6800	qwen3 0.6B Q4_0	2	2	pp16384	209.27	261.31	1.25
RX 6800	qwen3 0.6B Q4_0	2	4	pp16384	293.08	497.67	1.70
RX 6800	qwen3 0.6B Q4_0	2	8	pp16384	380.72	770.64	2.02
RX 6800	qwen3 0.6B Q4_0	2	16	pp16384	835.10	935.36	1.12
RX 6800	qwen3 0.6B Q4_0	2	32	pp16384	1285.99	1345.56	1.05
RX 6800	qwen3 0.6B Q4_0	2	64	pp16384	1711.28	1662.70	0.97
RX 6800	qwen3 0.6B Q4_0	2	128	pp16384	2160.39	2110.09	0.98
RX 6800	qwen3 0.6B Q4_0	2	256	pp16384	2395.22	2350.87	0.98
RX 6800	qwen3 0.6B Q4_0	2	512	pp16384	2588.48	2531.94	0.98
RX 6800	qwen3 0.6B Q4_0	2	1024	pp16384	2710.27	2680.50	0.99
RX 6800	qwen3 0.6B Q4_0	2	2048	pp16384	2746.50	2704.52	0.98
P40	gemma 2B Q4_0	8	1	pp16384	136.46	145.68	1.07
P40	gemma 2B Q4_0	8	2	pp16384	257.59	283.79	1.10
P40	gemma 2B Q4_0	8	4	pp16384	321.20	356.73	1.11
P40	gemma 2B Q4_0	8	8	pp16384	453.33	521.46	1.15
P40	gemma 2B Q4_0	8	16	pp16384	827.93	923.70	1.12
P40	gemma 2B Q4_0	8	32	pp16384	1182.04	1187.99	1.01
P40	gemma 2B Q4_0	8	64	pp16384	1435.17	1434.64	1.00
P40	gemma 2B Q4_0	8	128	pp16384	1577.10	1567.24	0.99
P40	gemma 2B Q4_0	8	256	pp16384	1690.60	1669.24	0.99
P40	gemma 2B Q4_0	8	512	pp16384	1746.00	1749.15	1.00
P40	gemma 2B Q4_0	8	1024	pp16384	1816.87	1835.29	1.01
P40	gemma 2B Q4_0	8	2048	pp16384	1832.82	1845.11	1.01
P40	gemma3 1B Q4_0	4	1	pp16384	176.32	178.80	1.01
P40	gemma3 1B Q4_0	4	2	pp16384	349.08	391.32	1.12
P40	gemma3 1B Q4_0	4	4	pp16384	507.38	528.53	1.04
P40	gemma3 1B Q4_0	4	8	pp16384	789.66	811.80	1.03
P40	gemma3 1B Q4_0	4	16	pp16384	1579.54	1574.54	1.00
P40	gemma3 1B Q4_0	4	32	pp16384	2415.65	2271.73	0.94
P40	gemma3 1B Q4_0	4	64	pp16384	3250.38	3161.71	0.97
P40	gemma3 1B Q4_0	4	128	pp16384	4229.45	4103.76	0.97
P40	gemma3 1B Q4_0	4	256	pp16384	4751.62	4694.10	0.99
P40	gemma3 1B Q4_0	4	512	pp16384	5094.48	5008.90	0.98
P40	gemma3 1B Q4_0	4	1024	pp16384	5299.97	5233.38	0.99
P40	gemma3 1B Q4_0	4	2048	pp16384	5076.41	4991.79	0.98
P40	llama 1B Q4_0	4	1	pp16384	213.29	227.39	1.07
P40	llama 1B Q4_0	4	2	pp16384	383.43	445.89	1.16
P40	llama 1B Q4_0	4	4	pp16384	471.83	581.99	1.23
P40	llama 1B Q4_0	4	8	pp16384	636.59	850.52	1.34
P40	llama 1B Q4_0	4	16	pp16384	1218.00	1325.84	1.09
P40	llama 1B Q4_0	4	32	pp16384	1758.38	1734.19	0.99
P40	llama 1B Q4_0	4	64	pp16384	2092.75	2068.77	0.99
P40	llama 1B Q4_0	4	128	pp16384	2336.87	2301.87	0.99
P40	llama 1B Q4_0	4	256	pp16384	2533.50	2488.52	0.98
P40	llama 1B Q4_0	4	512	pp16384	2584.86	2541.65	0.98
P40	llama 1B Q4_0	4	1024	pp16384	2656.00	2621.65	0.99
P40	llama 1B Q4_0	4	2048	pp16384	2677.45	2647.84	0.99
P40	qwen3 0.6B Q4_0	4	1	pp16384	135.82	138.74	1.02
P40	qwen3 0.6B Q4_0	4	2	pp16384	246.87	287.76	1.17
P40	qwen3 0.6B Q4_0	4	4	pp16384	384.14	426.13	1.11
P40	qwen3 0.6B Q4_0	4	8	pp16384	518.79	686.87	1.32
P40	qwen3 0.6B Q4_0	4	16	pp16384	991.77	1137.43	1.15
P40	qwen3 0.6B Q4_0	4	32	pp16384	1355.88	1385.99	1.02
P40	qwen3 0.6B Q4_0	4	64	pp16384	1534.17	1654.65	1.08
P40	qwen3 0.6B Q4_0	4	128	pp16384	1662.60	1824.45	1.10
P40	qwen3 0.6B Q4_0	4	256	pp16384	1758.51	1967.01	1.12
P40	qwen3 0.6B Q4_0	4	512	pp16384	1811.06	2056.33	1.14
P40	qwen3 0.6B Q4_0	4	1024	pp16384	1853.22	2089.37	1.13
P40	qwen3 0.6B Q4_0	4	2048	pp16384	1867.47	2096.54	1.12

ggerganov · 2025-10-11T15:54:44Z

Been doing some tests with this branch and haven't noticed any problems so far.

IMbackK

I can confirm the performance changes on gfx1030, and found no issues on brief testing.
From static analysis it looks correct but its a bit difficult to follow what the changes to the code in fattn-tile.cu are since this pr includes organizational and functional code changes in one commit, which i would prefer be avoided.

JohannesGaessler · 2025-10-11T18:54:08Z

this pr includes organizational and functional code changes in one commit, which i would prefer be avoided.

I agree but in this case the changes to the kernel itself were relatively large anyways so I think it will need to be read in full either way. Generally speaking, would you prefer I link the relevant WIP branches in cases like this?

IMbackK · 2025-10-13T09:22:24Z

this pr includes organizational and functional code changes in one commit, which i would prefer be avoided.

I agree but in this case the changes to the kernel itself were relatively large anyways so I think it will need to be read in full either way. Generally speaking, would you prefer I link the relevant WIP branches in cases like this?

Ideally a pr like this should simply have 2 commits, one with the organizational changes and one with the functional changes. If that is impractical due to how the changes came about, yes a note that where intermediate states can be looked at would help.

* origin/master: (32 commits) metal : FA support F32 K and V and head size = 32 (ggml-org#16531) graph : support cacheless embeddings with FA and iSWA (ggml-org#16528) opencl: fix build targeting CL 2 (ggml-org#16554) CUDA: fix numerical issues in tile FA kernel (ggml-org#16540) ggml : fix build broken with -march=armv9-a on MacOS (ggml-org#16520) CANN: fix CPU memory leak in CANN backend (ggml-org#16549) fix: add remark plugin to render raw HTML as literal text (ggml-org#16505) metal: add support for opt_step_sgd (ggml-org#16539) ggml : fix scalar path for computing norm (ggml-org#16558) CANN: Update several operators to support FP16 data format (ggml-org#16251) metal : add opt_step_adamw and op_sum (ggml-org#16529) webui: remove client-side context pre-check and rely on backend for limits (ggml-org#16506) [SYCL] fix UT fault cases: count-equal, argsort, pad OPs (ggml-org#16521) ci : add Vulkan on Ubuntu with default packages build (ggml-org#16532) common : handle unicode during partial json parsing (ggml-org#16526) common : update presets (ggml-org#16504) ggml : Fix FP16 ELU positive branch (ggml-org#16519) hparams : add check for layer index in is_recurrent (ggml-org#16511) ggml: Correct SVE implementation in ggml_vec_dot_f16_unroll (ggml-org#16518) CUDA: faster tile FA, add oob checks, more HSs (ggml-org#16492) ...

CUDA: faster tile FA, add oob checks, more HSs

a2672e3

JohannesGaessler requested a review from slaren as a code owner October 9, 2025 22:03

github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs python python script changes ggml changes relating to the ggml tensor library for machine learning labels Oct 9, 2025

IMbackK approved these changes Oct 11, 2025

View reviewed changes

slaren approved these changes Oct 11, 2025

View reviewed changes

JohannesGaessler merged commit 11f0af5 into ggml-org:master Oct 11, 2025
71 checks passed

ggerganov mentioned this pull request Oct 12, 2025

graph : support cacheless embeddings with FA and iSWA #16528

Merged

Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Oct 12, 2025

Revert "CUDA: faster tile FA, add oob checks, more HSs (ggml-org#16492)"

6e9fc13

JohannesGaessler mentioned this pull request Oct 12, 2025

CUDA: fix numerical issue in tile FA kernel #16540

Merged

Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Oct 13, 2025

Revert "CUDA: faster tile FA, add oob checks, more HSs (ggml-org#16492)"

01d5b0f

Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Oct 13, 2025

Revert "CUDA: faster tile FA, add oob checks, more HSs (ggml-org#16492)"

3d37ce0

ggerganov mentioned this pull request Oct 15, 2025

mtmd : add **vision** support for Mistral Small 3.1 #13231

Merged

yael-works pushed a commit to yael-works/llama.cpp that referenced this pull request Oct 15, 2025

CUDA: faster tile FA, add oob checks, more HSs (ggml-org#16492)

5561b3f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

CUDA: faster tile FA, add oob checks, more HSs #16492

CUDA: faster tile FA, add oob checks, more HSs #16492

Uh oh!

JohannesGaessler commented Oct 9, 2025 •

edited

Loading

Uh oh!

ggerganov commented Oct 11, 2025

Uh oh!

IMbackK left a comment

Uh oh!

JohannesGaessler commented Oct 11, 2025

Uh oh!

Uh oh!

IMbackK commented Oct 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

CUDA: faster tile FA, add oob checks, more HSs #16492

CUDA: faster tile FA, add oob checks, more HSs #16492

Uh oh!

Conversation

JohannesGaessler commented Oct 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ggerganov commented Oct 11, 2025

Uh oh!

IMbackK left a comment

Choose a reason for hiding this comment

Uh oh!

JohannesGaessler commented Oct 11, 2025

Uh oh!

Uh oh!

IMbackK commented Oct 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

JohannesGaessler commented Oct 9, 2025 •

edited

Loading