CUDA: generalized (mma) FA, add Volta support #17505

JohannesGaessler · 2025-11-25T20:45:36Z

This PR makes the following changes to the CUDA FlashAttention code:

All kernels have been extended with support for attention masks that are not padded in mask->ne[1] direction. This is done by applying a modulo on the mask column that is being read so no conditional statements need to be evaluated. The impact on performance is negligible and I do not deem it necessary to compile additional template specializations. See ggml : remove KQ mask padding #16309 . cc @ggerganov .
The mma kernel has been extended with support for Volta tensor cores. Previously the WMMA kernel was used. The WMMA kernel is now only needed for AMD. After AMD support has been added to the mma kernel the WMMA kernel can be safely removed, leaving only 3 kernels to maintain going forward. On master the mma kernel has defects w.r.t. tile shapes that do not manifest as bugs, those should be fixed with this PR and I think it is now feasible for other developers to add support for e.g. AMD wmma instructions. cc @zhang-hui-yulo @jiachengjason @unverbraucht .
The tile template in mma.cuh has been extended with additional, optional arguments to safely handle situations where tiles of the same shape can have different physical data layouts.
The mma kernel is refactored to allow more flexible configuration. The configuration is now also done without the use of templating which seems to be causing issues for __launch_bounds__ when using ROCm (as of right now ROCm is not used).
The mma kernel is extended with support for out-of-bounds checks in direction of K->ne[1]. As with the tile kernel, because this comes at a cost to performance it is still preferable to pad the KV cache length. As of right now this is still required to be 256, for the currently supported GPUs it should be possible to lower this to 128 without issue once the WMMA kernel has been completely replaced. For Hopper it may still make sense to have a padding of 256 but as it is I have no idea whether the 256x64 instruction would actually have better performance than the 128x64 instruction.

As of right now the interface in mma.cuh is suboptimal and long-term I intend to refactor it to allow the use of tensor cores in a more uniform way. However, I don't know the exact requirements until we have proper support for AMD WMMA and AMD MFMA instructions. So for now I think the correct choice is to prioritize getting working support for those at the cost of maintainability and to do a refactor afterwards.

V100 performance

GPU	Model	Microbatch size	Test	t/s master	t/s `277014f`	Speedup
V100-PCIE-32GB	deepseek2 16B Q4_0	1	pp512@d32768	84.06	89.23	1.06
V100-PCIE-32GB	deepseek2 16B Q4_0	2	pp512@d32768	88.28	86.50	0.98
V100-PCIE-32GB	deepseek2 16B Q4_0	4	pp512@d32768	122.04	134.50	1.10
V100-PCIE-32GB	deepseek2 16B Q4_0	8	pp512@d32768	159.61	204.43	1.28
V100-PCIE-32GB	deepseek2 16B Q4_0	16	pp512@d32768	187.50	274.82	1.47
V100-PCIE-32GB	deepseek2 16B Q4_0	32	pp512@d32768	208.08	340.50	1.64
V100-PCIE-32GB	deepseek2 16B Q4_0	64	pp512@d32768	196.49	312.07	1.59
V100-PCIE-32GB	deepseek2 16B Q4_0	128	pp512@d32768	217.64	371.18	1.71
V100-PCIE-32GB	deepseek2 16B Q4_0	256	pp512@d32768	227.55	408.51	1.80
V100-PCIE-32GB	deepseek2 16B Q4_0	512	pp512@d32768	250.76	432.14	1.72
V100-PCIE-32GB	gemma 2B Q4_0	1	pp512@d32768	196.73	276.43	1.41
V100-PCIE-32GB	gemma 2B Q4_0	2	pp512@d32768	341.32	472.67	1.38
V100-PCIE-32GB	gemma 2B Q4_0	4	pp512@d32768	233.69	461.42	1.97
V100-PCIE-32GB	gemma 2B Q4_0	8	pp512@d32768	433.09	705.18	1.63
V100-PCIE-32GB	gemma 2B Q4_0	16	pp512@d32768	779.04	1095.12	1.41
V100-PCIE-32GB	gemma 2B Q4_0	32	pp512@d32768	981.00	1506.68	1.54
V100-PCIE-32GB	gemma 2B Q4_0	64	pp512@d32768	859.59	1260.66	1.47
V100-PCIE-32GB	gemma 2B Q4_0	128	pp512@d32768	1032.55	1735.64	1.68
V100-PCIE-32GB	gemma 2B Q4_0	256	pp512@d32768	1089.22	1833.70	1.68
V100-PCIE-32GB	gemma 2B Q4_0	512	pp512@d32768	995.95	1613.81	1.62
V100-PCIE-32GB	llama 1B Q4_0	1	pp512@d32768	237.92	323.72	1.36
V100-PCIE-32GB	llama 1B Q4_0	2	pp512@d32768	417.22	588.65	1.41
V100-PCIE-32GB	llama 1B Q4_0	4	pp512@d32768	448.34	838.65	1.87
V100-PCIE-32GB	llama 1B Q4_0	8	pp512@d32768	824.46	1445.37	1.75
V100-PCIE-32GB	llama 1B Q4_0	16	pp512@d32768	1435.92	1917.20	1.34
V100-PCIE-32GB	llama 1B Q4_0	32	pp512@d32768	1769.39	2566.43	1.45
V100-PCIE-32GB	llama 1B Q4_0	64	pp512@d32768	1991.61	2289.92	1.15
V100-PCIE-32GB	llama 1B Q4_0	128	pp512@d32768	2391.19	2843.04	1.19
V100-PCIE-32GB	llama 1B Q4_0	256	pp512@d32768	2312.60	2559.85	1.11
V100-PCIE-32GB	llama 1B Q4_0	512	pp512@d32768	1900.53	2137.76	1.12
V100-PCIE-32GB	llama 8B Q4_0	1	pp512@d32768	61.12	81.47	1.33
V100-PCIE-32GB	llama 8B Q4_0	2	pp512@d32768	115.57	154.44	1.34
V100-PCIE-32GB	llama 8B Q4_0	4	pp512@d32768	120.26	220.87	1.84
V100-PCIE-32GB	llama 8B Q4_0	8	pp512@d32768	215.88	323.48	1.50
V100-PCIE-32GB	llama 8B Q4_0	16	pp512@d32768	380.43	467.35	1.23
V100-PCIE-32GB	llama 8B Q4_0	32	pp512@d32768	470.78	656.82	1.40
V100-PCIE-32GB	llama 8B Q4_0	64	pp512@d32768	228.56	456.01	2.00
V100-PCIE-32GB	llama 8B Q4_0	128	pp512@d32768	278.85	670.43	2.40
V100-PCIE-32GB	llama 8B Q4_0	256	pp512@d32768	307.17	872.91	2.84
V100-PCIE-32GB	llama 8B Q4_0	512	pp512@d32768	314.34	932.41	2.97

Other GPU performance

GPU	Model	Microbatch size	Test	t/s master	t/s `e44ebb0`	Speedup
MI60 / MI50	llama 8B Q4_0	1	pp512@d32768	59.80	64.40	1.08
MI60 / MI50	llama 8B Q4_0	2	pp512@d32768	106.46	113.46	1.07
MI60 / MI50	llama 8B Q4_0	4	pp512@d32768	119.84	97.07	0.81
MI60 / MI50	llama 8B Q4_0	8	pp512@d32768	162.89	167.55	1.03
MI60 / MI50	llama 8B Q4_0	16	pp512@d32768	228.46	229.93	1.01
MI60 / MI50	llama 8B Q4_0	32	pp512@d32768	269.06	268.69	1.00
MI60 / MI50	llama 8B Q4_0	64	pp512@d32768	291.15	289.38	0.99
MI60 / MI50	llama 8B Q4_0	128	pp512@d32768	335.13	332.27	0.99
MI60 / MI50	llama 8B Q4_0	256	pp512@d32768	351.75	349.71	0.99
MI60 / MI50	llama 8B Q4_0	512	pp512@d32768	357.18	355.12	0.99
MI100	llama 8B Q4_0	1	pp512@d32768	77.78	82.66	1.06
MI100	llama 8B Q4_0	2	pp512@d32768	133.33	139.16	1.04
MI100	llama 8B Q4_0	4	pp512@d32768	164.44	169.21	1.03
MI100	llama 8B Q4_0	8	pp512@d32768	232.70	236.51	1.02
MI100	llama 8B Q4_0	16	pp512@d32768	424.09	431.27	1.02
MI100	llama 8B Q4_0	32	pp512@d32768	559.43	563.32	1.01
MI100	llama 8B Q4_0	64	pp512@d32768	648.34	648.77	1.00
MI100	llama 8B Q4_0	128	pp512@d32768	671.01	668.83	1.00
MI100	llama 8B Q4_0	256	pp512@d32768	696.50	692.00	0.99
MI100	llama 8B Q4_0	512	pp512@d32768	706.38	700.32	0.99
P40	llama 8B Q4_0	1	pp512@d32768	31.00	32.45	1.05
P40	llama 8B Q4_0	2	pp512@d32768	59.14	61.75	1.04
P40	llama 8B Q4_0	4	pp512@d32768	87.36	89.87	1.03
P40	llama 8B Q4_0	8	pp512@d32768	122.68	122.31	1.00
P40	llama 8B Q4_0	16	pp512@d32768	178.33	175.34	0.98
P40	llama 8B Q4_0	32	pp512@d32768	189.92	190.07	1.00
P40	llama 8B Q4_0	64	pp512@d32768	209.02	208.27	1.00
P40	llama 8B Q4_0	128	pp512@d32768	217.96	217.49	1.00
P40	llama 8B Q4_0	256	pp512@d32768	223.15	222.81	1.00
P40	llama 8B Q4_0	512	pp512@d32768	219.45	219.48	1.00
Radeon 8060S Graphics	llama 8B Q4_0	1	pp512@d32768	23.92	24.10	1.01
Radeon 8060S Graphics	llama 8B Q4_0	2	pp512@d32768	43.49	43.68	1.00
Radeon 8060S Graphics	llama 8B Q4_0	4	pp512@d32768	77.88	78.19	1.00
Radeon 8060S Graphics	llama 8B Q4_0	8	pp512@d32768	108.82	96.17	0.88
Radeon 8060S Graphics	llama 8B Q4_0	16	pp512@d32768	138.58	140.27	1.01
Radeon 8060S Graphics	llama 8B Q4_0	32	pp512@d32768	151.39	152.96	1.01
Radeon 8060S Graphics	llama 8B Q4_0	64	pp512@d32768	74.81	76.94	1.03
Radeon 8060S Graphics	llama 8B Q4_0	128	pp512@d32768	101.46	102.30	1.01
Radeon 8060S Graphics	llama 8B Q4_0	256	pp512@d32768	115.59	115.84	1.00
Radeon 8060S Graphics	llama 8B Q4_0	512	pp512@d32768	117.65	118.57	1.01
RTX 3090	llama 8B Q4_0	1	pp512@d32768	87.54	92.96	1.06
RTX 3090	llama 8B Q4_0	2	pp512@d32768	160.48	170.31	1.06
RTX 3090	llama 8B Q4_0	4	pp512@d32768	293.48	303.46	1.03
RTX 3090	llama 8B Q4_0	8	pp512@d32768	429.51	439.54	1.02
RTX 3090	llama 8B Q4_0	16	pp512@d32768	844.62	874.15	1.03
RTX 3090	llama 8B Q4_0	32	pp512@d32768	1184.30	1194.99	1.01
RTX 3090	llama 8B Q4_0	64	pp512@d32768	1491.70	1495.43	1.00
RTX 3090	llama 8B Q4_0	128	pp512@d32768	1612.42	1617.77	1.00
RTX 3090	llama 8B Q4_0	256	pp512@d32768	1716.96	1697.92	0.99
RTX 3090	llama 8B Q4_0	512	pp512@d32768	1470.93	1448.12	0.98
RTX 4090	llama 8B Q4_0	1	pp512@d32768	98.14	102.76	1.05
RTX 4090	llama 8B Q4_0	2	pp512@d32768	178.13	190.39	1.07
RTX 4090	llama 8B Q4_0	4	pp512@d32768	349.90	366.50	1.05
RTX 4090	llama 8B Q4_0	8	pp512@d32768	618.83	646.33	1.04
RTX 4090	llama 8B Q4_0	16	pp512@d32768	1095.54	1140.84	1.04
RTX 4090	llama 8B Q4_0	32	pp512@d32768	2007.89	2051.87	1.02
RTX 4090	llama 8B Q4_0	64	pp512@d32768	3091.16	3089.09	1.00
RTX 4090	llama 8B Q4_0	128	pp512@d32768	3188.55	3095.61	0.97
RTX 4090	llama 8B Q4_0	256	pp512@d32768	2961.18	2892.63	0.98
RTX 4090	llama 8B Q4_0	512	pp512@d32768	2464.56	2431.25	0.99
RTX 5090	llama 8B Q4_0	1	pp512@d32768	155.78	167.41	1.07
RTX 5090	llama 8B Q4_0	2	pp512@d32768	239.31	269.27	1.13
RTX 5090	llama 8B Q4_0	4	pp512@d32768	461.48	486.56	1.05
RTX 5090	llama 8B Q4_0	8	pp512@d32768	780.64	810.10	1.04
RTX 5090	llama 8B Q4_0	16	pp512@d32768	1381.19	1408.61	1.02
RTX 5090	llama 8B Q4_0	32	pp512@d32768	2253.55	2308.20	1.02
RTX 5090	llama 8B Q4_0	64	pp512@d32768	2827.63	2828.64	1.00
RTX 5090	llama 8B Q4_0	128	pp512@d32768	3009.14	3075.67	1.02
RTX 5090	llama 8B Q4_0	256	pp512@d32768	3078.24	2981.31	0.97
RTX 5090	llama 8B Q4_0	512	pp512@d32768	2698.04	2640.36	0.98
RX 6800	llama 8B Q4_0	1	pp512@d32768	42.25	44.60	1.06
RX 6800	llama 8B Q4_0	2	pp512@d32768	77.43	81.42	1.05
RX 6800	llama 8B Q4_0	4	pp512@d32768	105.08	108.86	1.04
RX 6800	llama 8B Q4_0	8	pp512@d32768	140.43	140.94	1.00
RX 6800	llama 8B Q4_0	16	pp512@d32768	173.28	175.32	1.01
RX 6800	llama 8B Q4_0	32	pp512@d32768	209.55	210.72	1.01
RX 6800	llama 8B Q4_0	64	pp512@d32768	235.46	235.80	1.00
RX 6800	llama 8B Q4_0	128	pp512@d32768	262.63	262.85	1.00
RX 6800	llama 8B Q4_0	256	pp512@d32768	274.40	274.65	1.00
RX 6800	llama 8B Q4_0	512	pp512@d32768	275.25	274.63	1.00
RX 9060 XT	llama 8B Q4_0	1	pp512@d32768	25.67	29.58	1.15
RX 9060 XT	llama 8B Q4_0	2	pp512@d32768	49.98	57.25	1.15
RX 9060 XT	llama 8B Q4_0	4	pp512@d32768	85.18	97.39	1.14
RX 9060 XT	llama 8B Q4_0	8	pp512@d32768	111.87	104.18	0.93
RX 9060 XT	llama 8B Q4_0	16	pp512@d32768	162.98	172.35	1.06
RX 9060 XT	llama 8B Q4_0	32	pp512@d32768	190.29	195.63	1.03
RX 9060 XT	llama 8B Q4_0	64	pp512@d32768	288.59	291.34	1.01
RX 9060 XT	llama 8B Q4_0	128	pp512@d32768	322.67	325.96	1.01
RX 9060 XT	llama 8B Q4_0	256	pp512@d32768	348.31	351.01	1.01
RX 9060 XT	llama 8B Q4_0	512	pp512@d32768	349.45	350.95	1.00

The performance numbers assume that the KQ mask is no longer being padded. This change is also in this PR. I don't have a good overview of which other backends maybe still need support for this change and whether or not it should be reverted prior to merging.

zhang-hui-yulo · 2025-11-26T08:54:07Z

Thank you for the info, I shall work on FA for RDNA4 once this PR is merged. Looks like that the logic of transposed tile is still empty.

Hedede · 2025-11-29T17:16:15Z

Testing the performance: prefill performance is greatly improved, however, TG is slower. I think it's better to use BEST_FATTN_KERNEL_VEC for tg.

On the master branch (7d2add5):

./build-volta/bin/llama-bench -m /models/llm/llama/llama-2-7b.Q4_0.gguf -fa 0,1 -p 512,1024,2048,4096,8192,16384 -n 128,256,512,1024
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: Tesla V100-SXM2-16GB, compute capability 7.0, VMM: yes

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	pp512	3227.63 ± 3.40
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	pp1024	3138.40 ± 0.48
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	pp2048	2961.39 ± 4.08
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	pp4096	2652.95 ± 2.57
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	pp8192	2169.29 ± 0.64
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	pp16384	1542.83 ± 0.32
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	tg128	139.04 ± 1.54
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	tg256	137.95 ± 1.71
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	tg512	133.65 ± 0.02
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	tg1024	133.28 ± 0.05
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp512	3118.69 ± 0.45
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp1024	2949.40 ± 0.83
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp2048	2652.39 ± 0.46
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp4096	2209.74 ± 0.43
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp8192	1655.14 ± 0.86
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp16384	1105.87 ± 0.08
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	tg128	151.20 ± 1.53
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	tg256	150.62 ± 1.65
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	tg512	149.25 ± 0.03
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	tg1024	145.21 ± 0.04

With this PR merged:

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	pp512	3232.19 ± 0.88
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	pp1024	3141.40 ± 0.52
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	pp2048	2967.57 ± 0.12
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	pp4096	2654.92 ± 0.59
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	pp8192	2171.34 ± 0.32
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	pp16384	1543.42 ± 0.34
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	tg128	139.45 ± 0.99
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	tg256	139.05 ± 1.09
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	tg512	134.43 ± 0.01
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	tg1024	134.54 ± 0.04
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp512	3513.77 ± 0.57
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp1024	3492.63 ± 3.15
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp2048	3414.79 ± 2.74
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp4096	3274.16 ± 2.85
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp8192	3005.68 ± 1.83
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp16384	2561.95 ± 0.80
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	tg128	145.68 ± 0.71
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	tg256	145.76 ± 0.79
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	tg512	143.93 ± 0.00
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	tg1024	139.73 ± 0.03

JohannesGaessler · 2025-11-29T17:44:15Z

Thank you for reporting this issue. The performance tuning for LLaMA 2 7b in particular was suboptimal because it's a very old model that doesn't use GQA and I forgot to test that particular scenario.

Hedede · 2025-11-29T18:39:53Z

I see that with some other models as well. For example, Qwen3 14B has slightly lower TG throughput with this PR even though other models are faster/same.

With PR:

model	size	params	backend	ngl	fa	test	t/s
qwen3 14B Q4_K - Medium	8.38 GiB	14.77 B	CUDA	99	1	pp512	1751.06 ± 0.90
qwen3 14B Q4_K - Medium	8.38 GiB	14.77 B	CUDA	99	1	pp1024	1745.60 ± 0.08
qwen3 14B Q4_K - Medium	8.38 GiB	14.77 B	CUDA	99	1	tg128	69.82 ± 0.01
qwen3 14B Q4_K - Medium	8.38 GiB	14.77 B	CUDA	99	1	tg256	69.85 ± 0.01
qwen3 8B Q8_0	8.11 GiB	8.19 B	CUDA	99	1	pp512	3581.97 ± 7.89
qwen3 8B Q8_0	8.11 GiB	8.19 B	CUDA	99	1	pp2048	3464.95 ± 3.24
qwen3 8B Q8_0	8.11 GiB	8.19 B	CUDA	99	1	tg128	83.64 ± 0.02
qwen3 8B Q8_0	8.11 GiB	8.19 B	CUDA	99	1	tg256	83.67 ± 0.00
gemma3 12B Q4_K - Medium	6.79 GiB	11.77 B	CUDA	99	1	pp512	2011.70 ± 2.65
gemma3 12B Q4_K - Medium	6.79 GiB	11.77 B	CUDA	99	1	pp2048	1924.82 ± 0.95
gemma3 12B Q4_K - Medium	6.79 GiB	11.77 B	CUDA	99	1	tg128	73.70 ± 0.06
gemma3 12B Q4_K - Medium	6.79 GiB	11.77 B	CUDA	99	1	tg256	73.66 ± 0.00
gpt-oss 20B Q4_K - Medium	10.81 GiB	20.91 B	CUDA	99	1	pp512	2651.95 ± 22.18
gpt-oss 20B Q4_K - Medium	10.81 GiB	20.91 B	CUDA	99	1	tg128	173.33 ± 0.03
gemma3 27B Q4_K - Medium	15.40 GiB	27.01 B	CUDA	99	1	pp512	904.26 ± 1.52
gemma3 27B Q4_K - Medium	15.40 GiB	27.01 B	CUDA	99	1	tg128	37.89 ± 0.00
qwen3moe 30B.A3B Q4_K - Medium	17.28 GiB	30.53 B	CUDA	99	1	pp512	1050.19 ± 10.54
qwen3moe 30B.A3B Q4_K - Medium	17.28 GiB	30.53 B	CUDA	99	1	tg128	138.72 ± 0.08
qwen3 32B Q4_K - Medium	18.40 GiB	32.76 B	CUDA	99	1	pp512	707.38 ± 0.61
qwen3 32B Q4_K - Medium	18.40 GiB	32.76 B	CUDA	99	1	tg128	34.09 ± 0.00

Without PR:

model	size	params	backend	ngl	fa	test	t/s
qwen3 14B Q4_K - Medium	8.38 GiB	14.77 B	CUDA	99	1	pp512	1592.39 ± 1.04
qwen3 14B Q4_K - Medium	8.38 GiB	14.77 B	CUDA	99	1	pp1024	1525.83 ± 0.33
qwen3 14B Q4_K - Medium	8.38 GiB	14.77 B	CUDA	99	1	tg128	71.48 ± 0.01
qwen3 14B Q4_K - Medium	8.38 GiB	14.77 B	CUDA	99	1	tg256	71.54 ± 0.02
qwen3 8B Q8_0	8.11 GiB	8.19 B	CUDA	99	1	pp512	3171.77 ± 4.33
qwen3 8B Q8_0	8.11 GiB	8.19 B	CUDA	99	1	pp2048	2645.87 ± 1.58
qwen3 8B Q8_0	8.11 GiB	8.19 B	CUDA	99	1	tg128	83.47 ± 0.05
qwen3 8B Q8_0	8.11 GiB	8.19 B	CUDA	99	1	tg256	83.47 ± 0.00
gemma3 12B Q4_K - Medium	6.79 GiB	11.77 B	CUDA	99	1	pp512	1862.49 ± 1.69
gemma3 12B Q4_K - Medium	6.79 GiB	11.77 B	CUDA	99	1	pp2048	1669.18 ± 0.47
gemma3 12B Q4_K - Medium	6.79 GiB	11.77 B	CUDA	99	1	tg128	73.07 ± 0.04
gemma3 12B Q4_K - Medium	6.79 GiB	11.77 B	CUDA	99	1	tg256	72.97 ± 0.01
gpt-oss 20B Q4_K - Medium	10.81 GiB	20.91 B	CUDA	99	1	pp512	2643.89 ± 21.33
gpt-oss 20B Q4_K - Medium	10.81 GiB	20.91 B	CUDA	99	1	tg128	172.43 ± 0.18
gemma3 27B Q4_K - Medium	15.40 GiB	27.01 B	CUDA	99	1	pp512	854.29 ± 1.01
gemma3 27B Q4_K - Medium	15.40 GiB	27.01 B	CUDA	99	1	tg128	37.62 ± 0.01
qwen3moe 30B.A3B Q4_K - Medium	17.28 GiB	30.53 B	CUDA	99	1	pp512	999.41 ± 9.82
qwen3moe 30B.A3B Q4_K - Medium	17.28 GiB	30.53 B	CUDA	99	1	tg128	139.05 ± 0.13
qwen3 32B Q4_K - Medium	18.40 GiB	32.76 B	CUDA	99	1	pp512	644.31 ± 0.56
qwen3 32B Q4_K - Medium	18.40 GiB	32.76 B	CUDA	99	1	tg128	33.90 ± 0.00

Hedede · 2025-11-29T19:01:03Z

OK, I pulled the latest changes, both models are faster now. Qwen3moe 30B-A3B is also slightly faster.

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp512	3511.11 ± 3.53
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	tg128	152.01 ± 0.03
qwen3 14B Q4_K - Medium	8.38 GiB	14.77 B	CUDA	99	1	pp512	1749.43 ± 1.16
qwen3 14B Q4_K - Medium	8.38 GiB	14.77 B	CUDA	99	1	tg128	71.63 ± 0.02
qwen3moe 30B.A3B Q4_K - Medium	17.28 GiB	30.53 B	CUDA	99	1	pp512	1052.40 ± 10.12
qwen3moe 30B.A3B Q4_K - Medium	17.28 GiB	30.53 B	CUDA	99	1	tg128	139.23 ± 0.03

JohannesGaessler requested review from am17an and ggerganov as code owners November 25, 2025 20:45

github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Nov 25, 2025

loci-dev mentioned this pull request Nov 25, 2025

UPSTREAM PR #17505: CUDA: ganeralized (mma) FA, add Volta support auroralabs-loci/llama.cpp#328

Open

JohannesGaessler changed the title ~~CUDA: ganeralized (mma) FA, add Volta support~~ CUDA: generalized (mma) FA, add Volta support Nov 25, 2025

JohannesGaessler force-pushed the cuda-fa-mma-update-5 branch 2 times, most recently from 48372ef to 2ef0c5f Compare November 25, 2025 23:09

ggerganov mentioned this pull request Nov 26, 2025

ggml : remove KQ mask padding #16309

Closed

5 tasks

unverbraucht mentioned this pull request Nov 27, 2025

HIP: Add RDNA3 WMMA support to MMF #17495

Open

JohannesGaessler added 3 commits November 28, 2025 16:52

CUDA: generalized (mma) FA, add Volta support

17f191e

fix const correctness

e2c50b1

fix turing config lookup

301ae30

JohannesGaessler force-pushed the cuda-fa-mma-update-5 branch from b92e6f8 to 301ae30 Compare November 28, 2025 16:03

refactor template parameters

13500e8

adjust kernel selection logic

394ced5

fix trailing whitespace

3e1ca0c

fix kernel selection logic

ec176ee

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

CUDA: generalized (mma) FA, add Volta support #17505

CUDA: generalized (mma) FA, add Volta support #17505

JohannesGaessler commented Nov 25, 2025 •

edited

Loading

Uh oh!

zhang-hui-yulo commented Nov 26, 2025

Uh oh!

Hedede commented Nov 29, 2025

Uh oh!

JohannesGaessler commented Nov 29, 2025

Uh oh!

Hedede commented Nov 29, 2025

Uh oh!

Hedede commented Nov 29, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

CUDA: generalized (mma) FA, add Volta support #17505

Are you sure you want to change the base?

CUDA: generalized (mma) FA, add Volta support #17505

Conversation

JohannesGaessler commented Nov 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zhang-hui-yulo commented Nov 26, 2025

Uh oh!

Hedede commented Nov 29, 2025

Uh oh!

JohannesGaessler commented Nov 29, 2025

Uh oh!

Hedede commented Nov 29, 2025

Uh oh!

Hedede commented Nov 29, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

JohannesGaessler commented Nov 25, 2025 •

edited

Loading