vulkan: Increase BK to 32; use BK/4 for non-CM mul_mm.comp #16636

SavicStefan · 2025-10-17T15:52:28Z

This PR adds for cache_a and cache_b to load an additional vec2, and increases BK=32 for non-CM mul_mm.comp

Performance Comparison (Without coopmat and coopmat2) NVIDIA GeForce RTX 4060 Ti

Kernel	Before(us/run)	After(us/run)	Δ %
`MUL_MAT(type_a=f32,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1)`	5767.79	5176.01	+10.26%
`MUL_MAT(type_a=f16,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1)`	5355.88	4105.95	+23.34%
`MUL_MAT(type_a=bf16,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1)`	5219.90	5432.22	-4.07%
`MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1)`	2722.40	2732.62	-0.38%
`MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1)`	2743.99	2753.02	-0.33%
`MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1)`	2843.99	2850.78	-0.24%
`MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1)`	2840.88	2841.73	-0.03%
`MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1)`	2853.15	2857.24	-0.14%
`MUL_MAT(type_a=mxfp4,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1)`	4327.78	4334.87	-0.16%
`MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1)`	4306.28	4289.52	+0.39%
`MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1)`	4751.79	4781.23	-0.62%
`MUL_MAT(type_a=q4_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1)`	4748.76	4785.89	-0.78%
`MUL_MAT(type_a=q5_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1)`	5155.43	5164.14	-0.17%
`MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1)`	4900.78	4914.74	-0.28%
`MUL_MAT(type_a=iq2_xxs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1)`	4318.07	4371.76	-1.24%
`MUL_MAT(type_a=iq2_xs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1)`	4643.73	4815.24	-3.69%
`MUL_MAT(type_a=iq2_s,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1)`	5250.76	5015.61	+4.48%
`MUL_MAT(type_a=iq3_xxs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1)`	4348.33	4388.21	-0.92%
`MUL_MAT(type_a=iq1_s,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1)`	4821.34	4570.77	+5.20%
`MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1)`	5646.37	5633.01	+0.24%
`MUL_MAT(type_a=iq4_nl,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1)`	4229.37	4240.83	-0.27%
`MUL_MAT(type_a=iq3_s,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1)`	4339.20	4358.97	-0.46%
`MUL_MAT(type_a=iq4_xs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1)`	4724.33	4779.14	-1.16%

Performance Comparison (Without coopmat and coopmat2) AMD Radeon RX 7800 XT

Kernel	Before(us/run)	After(us/run)	Δ %
`MUL_MAT(type_a=f32,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1)`	8873.61	5853.29	+34.04%
`MUL_MAT(type_a=f16,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1)`	6458.76	5747.87	+11.01%
`MUL_MAT(type_a=bf16,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1)`	7124.22	7401.83	-3.90%
`MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1)`	3289.51	3318.63	-0.89%
`MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1)`	3499.61	3527.61	-0.80%
`MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1)`	3424.27	3446.08	-0.64%
`MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1)`	3707.70	3732.88	-0.68%
`MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1)`	3747.02	3767.69	-0.55%
`MUL_MAT(type_a=mxfp4,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1)`	6160.74	6393.07	-3.77%
`MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1)`	5936.61	6047.77	-1.87%
`MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1)`	7717.80	7037.06	+8.82%
`MUL_MAT(type_a=q4_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1)`	8219.73	8849.61	-7.66%
`MUL_MAT(type_a=q5_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1)`	7289.05	7447.10	-2.17%
`MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1)`	7668.33	6923.90	+9.71%
`MUL_MAT(type_a=iq2_xxs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1)`	5797.82	5618.78	+3.09%
`MUL_MAT(type_a=iq2_xs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1)`	5764.74	5403.05	+6.27%
`MUL_MAT(type_a=iq2_s,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1)`	5695.78	5998.68	-5.32%
`MUL_MAT(type_a=iq3_xxs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1)`	6074.55	5980.28	+1.55%
`MUL_MAT(type_a=iq1_s,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1)`	5571.36	5367.69	+3.66%
`MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1)`	5704.28	5651.10	+0.93%
`MUL_MAT(type_a=iq4_nl,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1)`	6416.39	5307.34	+17.28%
`MUL_MAT(type_a=iq3_s,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1)`	5968.62	5845.84	+2.06%
`MUL_MAT(type_a=iq4_xs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1)`	8289.75	7982.64	+3.70%

0cc4m · 2025-10-18T10:10:28Z

This mostly affects non-quantized tests, some positive, some negative for Nvidia and Intel, no difference on (older) AMD. Any theories on why it behaves so differently based on input type?

Actual (quantized) model performance seems mostly unaffected.

RTX 3090 (without coopmat or integer dot)

Test	Before (TFLOPS)	After (TFLOPS)	Δ%
MUL_MAT(type_a=bf16,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1)	15.170	14.390	-5.14%
MUL_MAT(type_a=f16,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1)	15.400	18.970	+23.18%
MUL_MAT(type_a=f32,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1)	13.920	13.500	-3.02%
MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1)	14.530	13.750	-5.37%
MUL_MAT(type_a=iq1_s,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1)	14.070	16.080	+14.29%
MUL_MAT(type_a=iq2_s,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1)	13.460	13.650	+1.41%
MUL_MAT(type_a=iq2_xs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1)	14.150	15.930	+12.58%
MUL_MAT(type_a=iq2_xxs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1)	13.660	13.950	+2.12%
MUL_MAT(type_a=iq3_s,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1)	13.100	15.460	+18.02%
MUL_MAT(type_a=iq3_xxs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1)	15.610	12.920	-17.23%
MUL_MAT(type_a=iq4_nl,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1)	13.570	14.890	+9.73%
MUL_MAT(type_a=iq4_xs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1)	12.570	13.540	+7.72%
MUL_MAT(type_a=mxfp4,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1)	17.950	18.000	+0.28%
MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1)	14.160	17.730	+25.21%
MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1)	13.420	16.060	+19.67%
MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1)	17.000	18.780	+10.47%
MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1)	14.020	16.040	+14.41%
MUL_MAT(type_a=q4_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1)	12.330	13.460	+9.16%
MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1)	15.630	18.370	+17.53%
MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1)	18.100	14.020	-22.54%
MUL_MAT(type_a=q5_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1)	12.070	11.530	-4.47%
MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1)	12.860	14.110	+9.72%
MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1)	18.540	18.240	-1.62%

model	size	params	backend	ngl	fa	test	t/s (before)	t/s (after)	diff
llama 8B IQ1_S - 1.5625 bpw	1.87 GiB	8.03 B	Vulkan	99	0	pp512	1428.13 ± 4.99	1420.59 ± 2.74	-0.5%
llama 8B IQ1_S - 1.5625 bpw	1.87 GiB	8.03 B	Vulkan	99	1	pp512	1408.76 ± 3.46	1394.17 ± 3.05	-1.0%
llama 8B IQ2_M - 2.7 bpw	2.74 GiB	8.03 B	Vulkan	99	0	pp512	1328.26 ± 4.41	1320.42 ± 2.91	-0.6%
llama 8B IQ2_M - 2.7 bpw	2.74 GiB	8.03 B	Vulkan	99	1	pp512	1319.03 ± 0.92	1295.03 ± 0.88	-1.8%
llama 8B IQ4_XS - 4.25 bpw	4.13 GiB	8.03 B	Vulkan	99	0	pp512	1249.23 ± 3.43	1224.25 ± 2.11	-2.0%
llama 8B IQ4_XS - 4.25 bpw	4.13 GiB	8.03 B	Vulkan	99	1	pp512	1232.04 ± 2.15	1204.84 ± 2.13	-2.2%
llama 8B Q4_K - Small	4.36 GiB	8.03 B	Vulkan	99	0	pp512	1245.91 ± 3.63	1227.11 ± 2.81	-1.5%
llama 8B Q4_K - Small	4.36 GiB	8.03 B	Vulkan	99	1	pp512	1227.12 ± 4.74	1207.00 ± 3.60	-1.6%
llama 8B Q4_0	4.33 GiB	8.03 B	Vulkan	99	0	pp512	1417.78 ± 7.29	1403.64 ± 5.81	-1.0%
llama 8B Q4_0	4.33 GiB	8.03 B	Vulkan	99	1	pp512	1394.82 ± 4.26	1376.73 ± 3.74	-1.3%
llama 8B Q4_1	4.77 GiB	8.03 B	Vulkan	99	0	pp512	1392.13 ± 6.83	1391.03 ± 7.94	-0.1%
llama 8B Q4_1	4.77 GiB	8.03 B	Vulkan	99	1	pp512	1370.81 ± 4.58	1362.14 ± 3.75	-0.6%
llama 8B Q8_0	7.95 GiB	8.03 B	Vulkan	99	0	pp512	1389.96 ± 5.16	1363.92 ± 5.72	-1.9%
llama 8B Q8_0	7.95 GiB	8.03 B	Vulkan	99	1	pp512	1372.17 ± 6.30	1337.70 ± 6.51	-2.5%
qwen3moe 30B.A3B Q2_K - Medium	10.48 GiB	30.53 B	Vulkan	99	0	pp512	1168.93 ± 4.88	1155.29 ± 7.86	-1.2%
qwen3moe 30B.A3B Q2_K - Medium	10.48 GiB	30.53 B	Vulkan	99	1	pp512	1153.62 ± 8.97	1131.69 ± 9.52	-1.9%
gpt-oss 20B Q8_0	11.27 GiB	20.91 B	Vulkan	99	0	pp512	1372.82 ± 6.87	1360.07 ± 4.77	-0.9%
gpt-oss 20B Q8_0	11.27 GiB	20.91 B	Vulkan	99	1	pp512	1350.96 ± 9.15	1336.27 ± 6.46	-1.1%

Radeon Pro VII (without integer dot)

Test	Before (TFLOPS)	After (TFLOPS)	Δ%
MUL_MAT(type_a=bf16,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1)	5.300	5.270	-0.57%
MUL_MAT(type_a=f16,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1)	4.270	4.370	+2.34%
MUL_MAT(type_a=f32,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1)	5.070	5.160	+1.78%
MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1)	4.660	4.670	+0.21%
MUL_MAT(type_a=iq1_s,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1)	4.710	4.720	+0.21%
MUL_MAT(type_a=iq2_s,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1)	4.620	4.620	+0.00%
MUL_MAT(type_a=iq2_xs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1)	4.700	4.720	+0.43%
MUL_MAT(type_a=iq2_xxs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1)	4.690	4.720	+0.64%
MUL_MAT(type_a=iq3_s,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1)	4.570	4.570	+0.00%
MUL_MAT(type_a=iq3_xxs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1)	4.550	4.550	+0.00%
MUL_MAT(type_a=iq4_nl,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1)	4.740	4.770	+0.63%
MUL_MAT(type_a=iq4_xs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1)	4.340	4.340	+0.00%
MUL_MAT(type_a=mxfp4,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1)	4.760	4.770	+0.21%
MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1)	4.430	4.460	+0.68%
MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1)	4.100	4.110	+0.24%
MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1)	4.770	4.800	+0.63%
MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1)	4.800	4.820	+0.42%
MUL_MAT(type_a=q4_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1)	4.140	4.150	+0.24%
MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1)	4.520	4.540	+0.44%
MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1)	4.580	4.620	+0.87%
MUL_MAT(type_a=q5_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1)	3.920	3.960	+1.02%
MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1)	4.200	4.220	+0.48%
MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1)	4.660	4.690	+0.64%

model	size	params	backend	ngl	fa	test	t/s (before)	t/s (after)	diff
llama 8B IQ1_S - 1.5625 bpw	1.87 GiB	8.03 B	Vulkan	99	0	pp512	333.56 ± 1.54	334.33 ± 0.57	+0.2%
llama 8B IQ1_S - 1.5625 bpw	1.87 GiB	8.03 B	Vulkan	99	1	pp512	318.91 ± 0.29	317.94 ± 0.71	-0.3%
llama 8B IQ2_M - 2.7 bpw	2.74 GiB	8.03 B	Vulkan	99	0	pp512	325.10 ± 0.31	324.61 ± 0.72	-0.2%
llama 8B IQ2_M - 2.7 bpw	2.74 GiB	8.03 B	Vulkan	99	1	pp512	311.45 ± 0.37	309.19 ± 1.08	-0.7%
llama 8B IQ4_XS - 4.25 bpw	4.13 GiB	8.03 B	Vulkan	99	0	pp512	308.37 ± 0.35	306.52 ± 1.13	-0.6%
llama 8B IQ4_XS - 4.25 bpw	4.13 GiB	8.03 B	Vulkan	99	1	pp512	295.37 ± 0.38	292.86 ± 0.32	-0.8%
llama 8B Q4_K - Small	4.36 GiB	8.03 B	Vulkan	99	0	pp512	296.69 ± 0.64	292.75 ± 0.97	-1.3%
llama 8B Q4_K - Small	4.36 GiB	8.03 B	Vulkan	99	1	pp512	284.44 ± 0.28	281.94 ± 0.69	-0.9%
llama 8B Q4_0	4.33 GiB	8.03 B	Vulkan	99	0	pp512	343.14 ± 0.33	338.06 ± 0.96	-1.5%
llama 8B Q4_0	4.33 GiB	8.03 B	Vulkan	99	1	pp512	327.28 ± 0.25	324.34 ± 0.76	-0.9%
llama 8B Q4_1	4.77 GiB	8.03 B	Vulkan	99	0	pp512	344.91 ± 0.86	340.55 ± 0.82	-1.3%
llama 8B Q4_1	4.77 GiB	8.03 B	Vulkan	99	1	pp512	327.73 ± 1.39	326.89 ± 0.55	-0.3%
llama 8B Q8_0	7.95 GiB	8.03 B	Vulkan	99	0	pp512	334.02 ± 0.62	332.73 ± 0.63	-0.4%
llama 8B Q8_0	7.95 GiB	8.03 B	Vulkan	99	1	pp512	319.21 ± 0.35	317.35 ± 0.27	-0.6%
qwen3moe 30B.A3B Q2_K - Medium	10.48 GiB	30.53 B	Vulkan	99	0	pp512	383.00 ± 4.64	380.22 ± 5.54	-0.7%
qwen3moe 30B.A3B Q2_K - Medium	10.48 GiB	30.53 B	Vulkan	99	1	pp512	359.23 ± 2.87	353.52 ± 2.94	-1.6%
gpt-oss 20B Q8_0	11.27 GiB	20.91 B	Vulkan	99	0	pp512	538.00 ± 3.01	529.22 ± 5.06	-1.6%
gpt-oss 20B Q8_0	11.27 GiB	20.91 B	Vulkan	99	1	pp512	520.04 ± 2.56	514.61 ± 1.92	-1.0%

Intel A770 (without integer dot)

Test	Before (TFLOPS)	After (TFLOPS)	Δ%
MUL_MAT(type_a=bf16,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1)	5.110	3.520	-31.12%
MUL_MAT(type_a=f16,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1)	3.260	3.560	+9.20%
MUL_MAT(type_a=f32,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1)	4.770	4.000	-16.14%
MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1)	3.910	3.840	-1.79%
MUL_MAT(type_a=iq1_s,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1)	3.940	3.850	-2.28%
MUL_MAT(type_a=iq2_s,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1)	2.940	2.900	-1.36%
MUL_MAT(type_a=iq2_xs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1)	3.930	3.920	-0.25%
MUL_MAT(type_a=iq2_xxs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1)	3.510	3.480	-0.85%
MUL_MAT(type_a=iq3_s,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1)	3.380	3.360	-0.59%
MUL_MAT(type_a=iq3_xxs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1)	3.320	3.270	-1.51%
MUL_MAT(type_a=iq4_nl,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1)	3.150	3.150	+0.00%
MUL_MAT(type_a=iq4_xs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1)	2.900	2.890	-0.34%
MUL_MAT(type_a=mxfp4,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1)	3.100	3.110	+0.32%
MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1)	2.870	3.000	+4.53%
MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1)	2.850	2.850	+0.00%
MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1)	3.580	3.440	-3.91%
MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1)	3.650	3.570	-2.19%
MUL_MAT(type_a=q4_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1)	2.840	2.830	-0.35%
MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1)	3.030	2.820	-6.93%
MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1)	3.210	3.140	-2.18%
MUL_MAT(type_a=q5_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1)	2.760	2.760	+0.00%
MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1)	2.840	2.660	-6.34%
MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1)	3.320	3.250	-2.11%

model	size	params	backend	ngl	fa	test	t/s (before)	t/s (after)	diff
llama 8B IQ1_S - 1.5625 bpw	1.87 GiB	8.03 B	Vulkan	99	0	pp512	302.41 ± 0.36	293.98 ± 0.48	-2.8%
llama 8B IQ1_S - 1.5625 bpw	1.87 GiB	8.03 B	Vulkan	99	1	pp512	96.27 ± 0.10	95.83 ± 0.05	-0.5%
llama 8B IQ2_M - 2.7 bpw	2.74 GiB	8.03 B	Vulkan	99	0	pp512	229.93 ± 0.20	227.57 ± 0.27	-1.0%
llama 8B IQ2_M - 2.7 bpw	2.74 GiB	8.03 B	Vulkan	99	1	pp512	111.17 ± 0.08	110.85 ± 0.04	-0.3%
llama 8B IQ4_XS - 4.25 bpw	4.13 GiB	8.03 B	Vulkan	99	0	pp512	236.70 ± 0.14	234.64 ± 0.16	-0.9%
llama 8B IQ4_XS - 4.25 bpw	4.13 GiB	8.03 B	Vulkan	99	1	pp512	104.64 ± 0.06	104.79 ± 0.06	+0.1%
llama 8B Q4_K - Small	4.36 GiB	8.03 B	Vulkan	99	0	pp512	225.13 ± 0.15	225.52 ± 0.14	+0.2%
llama 8B Q4_K - Small	4.36 GiB	8.03 B	Vulkan	99	1	pp512	102.14 ± 0.01	102.36 ± 0.04	+0.2%
llama 8B Q4_0	4.33 GiB	8.03 B	Vulkan	99	0	pp512	287.46 ± 0.59	285.18 ± 0.48	-0.8%
llama 8B Q4_0	4.33 GiB	8.03 B	Vulkan	99	1	pp512	113.45 ± 0.11	113.47 ± 0.11	+0.0%
llama 8B Q4_1	4.77 GiB	8.03 B	Vulkan	99	0	pp512	287.89 ± 0.57	282.68 ± 0.25	-1.8%
llama 8B Q4_1	4.77 GiB	8.03 B	Vulkan	99	1	pp512	113.53 ± 0.09	112.86 ± 0.08	-0.6%
llama 8B Q8_0	7.95 GiB	8.03 B	Vulkan	99	0	pp512	266.09 ± 0.51	262.90 ± 0.52	-1.2%
llama 8B Q8_0	7.95 GiB	8.03 B	Vulkan	99	1	pp512	112.69 ± 0.06	112.58 ± 0.02	-0.1%
qwen3moe 30B.A3B Q2_K - Medium	10.48 GiB	30.53 B	Vulkan	99	0	pp512	300.48 ± 1.18	299.92 ± 0.53	-0.2%
qwen3moe 30B.A3B Q2_K - Medium	10.48 GiB	30.53 B	Vulkan	99	1	pp512	119.97 ± 0.36	120.08 ± 0.36	+0.1%
gpt-oss 20B Q8_0	11.27 GiB	20.91 B	Vulkan	99	0	pp512	425.55 ± 2.05	426.73 ± 1.24	+0.3%
gpt-oss 20B Q8_0	11.27 GiB	20.91 B	Vulkan	99	1	pp512	396.40 ± 1.59	398.89 ± 2.10	+0.6%

SavicStefan · 2025-10-21T13:54:56Z

The problem is that each type has its own loading part, but after that, they all share the same buf_a and buf_b types. The part responsible for calculating and accumulating (the part I modified) should be identical for all types. So I would expect that performance difference would be somewhere similarly for all of them.

0cc4m · 2025-10-25T06:44:20Z

That's true, but there are other complex factors that affect performance, like register use, the device compiler and memory alignment. I think in this case register use is most likely, since expanding cache size might put you over a limit. The mul_mm shader variants are always pretty close to register limits.

The largest improvements are for non-quantized matmul, which also points to register use. Maybe it's worth focussing on that case first.

Signed-off-by: Stefan Savic <[email protected]>

SavicStefan · 2025-11-04T13:30:53Z

I added this only for f32 and f16.

Performance Comparison (Without coopmat and coopmat2) NVIDIA GeForce RTX 4060 Ti

Performance Comparison

Kernel	Before(us/run)	After(us/run)	Δ %
`MUL_MAT(type_a=f32,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1)`	5843.50	5185.41	+11.26%
`MUL_MAT(type_a=f16,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1)`	5375.86	4108.55	+23.57%

0cc4m · 2025-11-08T08:27:25Z

Nice work, this is a good improvement all around, even for coopmat1.

F32 Test (MUL_MAT(type_a=f32,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1)) TFLOPS:

Version	GPU	Scalar	Coopmat1	Coopmat2
Master	RX 8060S	4.09	3.75
PR	RX 8060S	5.25	4.66
Master	Radeon Pro VII	5.11
PR	Radeon Pro VII	5.17
Master	Intel A770	4.78
PR	Intel A770	3.97
Master	RTX 3090	13.91	24.72	44.44
PR	RTX 3090	13.53	38.33	45.40

F16 Test (MUL_MAT(type_a=f16,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1)) TFLOPS:

Version	GPU	Scalar	Coopmat1	Coopmat2
Master	RX 8060S	4.39	5.35
PR	RX 8060S	6.48	6.76
Master	Radeon Pro VII	4.31
PR	Radeon Pro VII	4.37
Master	Intel A770	3.27
PR	Intel A770	3.56
Master	RTX 3090	15.53	28.40	80.29
PR	RTX 3090	19.02	45.83	80.17

SavicStefan requested a review from 0cc4m as a code owner October 17, 2025 15:52

github-actions bot added Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels Oct 17, 2025

SavicStefan closed this Nov 4, 2025

SavicStefan force-pushed the BK/4_and_BK=32 branch from ff700e3 to fcfce04 Compare November 4, 2025 13:27

Using only for f16 and f32

40c7031

Signed-off-by: Stefan Savic <[email protected]>

SavicStefan reopened this Nov 4, 2025

DajanaV mentioned this pull request Nov 4, 2025

UPSTREAM PR #16636: vulkan: Increase BK to 32; use BK/4 for non-CM mul_mm.comp auroralabs-loci/llama.cpp#78

Open

0cc4m approved these changes Nov 8, 2025

View reviewed changes

0cc4m merged commit b8a5cfd into ggml-org:master Nov 8, 2025
59 of 63 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

vulkan: Increase BK to 32; use BK/4 for non-CM mul_mm.comp #16636

vulkan: Increase BK to 32; use BK/4 for non-CM mul_mm.comp #16636

Uh oh!

SavicStefan commented Oct 17, 2025

Uh oh!

0cc4m commented Oct 18, 2025

Uh oh!

SavicStefan commented Oct 21, 2025 •

edited

Loading

Uh oh!

0cc4m commented Oct 25, 2025

Uh oh!

SavicStefan commented Nov 4, 2025 •

edited

Loading

Performance Comparison

Uh oh!

0cc4m commented Nov 8, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

vulkan: Increase BK to 32; use BK/4 for non-CM mul_mm.comp #16636

vulkan: Increase BK to 32; use BK/4 for non-CM mul_mm.comp #16636

Uh oh!

Conversation

SavicStefan commented Oct 17, 2025

Uh oh!

0cc4m commented Oct 18, 2025

Uh oh!

SavicStefan commented Oct 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

0cc4m commented Oct 25, 2025

Uh oh!

SavicStefan commented Nov 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Performance Comparison

Uh oh!

0cc4m commented Nov 8, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

SavicStefan commented Oct 21, 2025 •

edited

Loading

SavicStefan commented Nov 4, 2025 •

edited

Loading