Skip to content

Conversation

SavicStefan
Copy link
Contributor

This PR adds the implementation for ACC_TYPE_VEC2. This change, with non-coopmat shaders, using ACC_TYPE_VEC2 improves caching behavior, as accessing 32-bit values is generally more efficient than accessing 16-bit values.

Performance Comparison (Without coopmat and coopmat2) NVIDIA GeForce RTX 4060 Ti
Name Before (us/run) After (us/run) Δ% (Improvement)
MUL_MAT(type_a=f32,type_b=f32,m=4096,n=512,k=14336) 5767.64 5479.83 +4.99%
MUL_MAT(type_a=f16,type_b=f32,m=4096,n=512,k=14336) 5421.40 5047.91 +6.88%
MUL_MAT(type_a=bf16,type_b=f32,m=4096,n=512,k=14336) 5281.02 6002.14 −13.66%
MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=512,k=14336) 2741.43 2748.71 −0.27%
MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=512,k=14336) 2766.60 2764.23 +0.09%
MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=512,k=14336) 2877.49 2875.25 +0.08%
MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=512,k=14336) 2869.17 2867.33 +0.06%
MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=512,k=14336) 2887.17 2890.27 −0.11%
MUL_MAT(type_a=mxfp4,type_b=f32,m=4096,n=512,k=14336) 4976.57 4043.75 +18.75%
MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=512,k=14336) 4938.25 4120.32 +16.56%
MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=512,k=14336) 5287.85 4548.30 +13.99%
MUL_MAT(type_a=q4_K,type_b=f32,m=4096,n=512,k=14336) 5373.34 4566.63 +15.01%
MUL_MAT(type_a=q5_K,type_b=f32,m=4096,n=512,k=14336) 5769.13 4907.47 +14.94%
MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=512,k=14336) 5507.98 4524.96 +17.85%
MUL_MAT(type_a=iq2_xxs,type_b=f32,m=4096,n=512,k=14336) 4877.02 4043.75 +17.07%
MUL_MAT(type_a=iq2_xs,type_b=f32,m=4096,n=512,k=14336) 5010.98 4112.35 +17.94%
MUL_MAT(type_a=iq2_s,type_b=f32,m=4096,n=512,k=14336) 4863.99 4065.67 +16.41%
MUL_MAT(type_a=iq3_xxs,type_b=f32,m=4096,n=512,k=14336) 4957.83 4129.54 +16.70%
MUL_MAT(type_a=iq1_s,type_b=f32,m=4096,n=512,k=14336) 4583.30 3788.42 +17.34%
MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=512,k=14336) 5128.29 4280.64 +16.52%
MUL_MAT(type_a=iq4_nl,type_b=f32,m=4096,n=512,k=14336) 4885.91 3992.67 +18.27%
MUL_MAT(type_a=iq3_s,type_b=f32,m=4096,n=512,k=14336) 4933.56 4084.30 +17.22%
MUL_MAT(type_a=iq4_xs,type_b=f32,m=4096,n=512,k=14336) 5389.60 4489.23 +16.67%
Performance before(Without coopmat and coopmat2) NVIDIA GeForce RTX 4060 Ti
MUL_MAT(type_a=f32,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                   174 runs -  5767.64 us/run -  60.13 GFLOP/run -  10.43 TFLOPS
MUL_MAT(type_a=f16,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                   186 runs -  5421.40 us/run -  60.13 GFLOP/run -  11.09 TFLOPS
MUL_MAT(type_a=bf16,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                  190 runs -  5281.02 us/run -  60.13 GFLOP/run -  11.39 TFLOPS
MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                  366 runs -  2741.43 us/run -  60.13 GFLOP/run -  21.93 TFLOPS
MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                  362 runs -  2766.60 us/run -  60.13 GFLOP/run -  21.73 TFLOPS
MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                  348 runs -  2877.49 us/run -  60.13 GFLOP/run -  20.90 TFLOPS
MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                  350 runs -  2869.17 us/run -  60.13 GFLOP/run -  20.96 TFLOPS
MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                  348 runs -  2887.17 us/run -  60.13 GFLOP/run -  20.83 TFLOPS
MUL_MAT(type_a=mxfp4,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                 202 runs -  4976.57 us/run -  60.13 GFLOP/run -  12.08 TFLOPS
MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                  204 runs -  4938.25 us/run -  60.13 GFLOP/run -  12.18 TFLOPS
MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                  190 runs -  5287.85 us/run -  60.13 GFLOP/run -  11.37 TFLOPS
MUL_MAT(type_a=q4_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                  188 runs -  5373.34 us/run -  60.13 GFLOP/run -  11.19 TFLOPS
MUL_MAT(type_a=q5_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                  174 runs -  5769.13 us/run -  60.13 GFLOP/run -  10.42 TFLOPS
MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                  182 runs -  5507.98 us/run -  60.13 GFLOP/run -  10.92 TFLOPS
MUL_MAT(type_a=iq2_xxs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):               206 runs -  4877.02 us/run -  60.13 GFLOP/run -  12.33 TFLOPS
MUL_MAT(type_a=iq2_xs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                200 runs -  5010.98 us/run -  60.13 GFLOP/run -  12.00 TFLOPS
MUL_MAT(type_a=iq2_s,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                 206 runs -  4863.99 us/run -  60.13 GFLOP/run -  12.36 TFLOPS
MUL_MAT(type_a=iq3_xxs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):               202 runs -  4957.83 us/run -  60.13 GFLOP/run -  12.13 TFLOPS
MUL_MAT(type_a=iq1_s,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                 220 runs -  4583.30 us/run -  60.13 GFLOP/run -  13.12 TFLOPS
MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                 196 runs -  5128.29 us/run -  60.13 GFLOP/run -  11.73 TFLOPS
MUL_MAT(type_a=iq4_nl,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                206 runs -  4885.91 us/run -  60.13 GFLOP/run -  12.31 TFLOPS
MUL_MAT(type_a=iq3_s,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                 204 runs -  4933.56 us/run -  60.13 GFLOP/run -  12.19 TFLOPS
MUL_MAT(type_a=iq4_xs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                186 runs -  5389.60 us/run -  60.13 GFLOP/run -  11.16 TFLOPS

Performance after(Without coopmat and coopmat2) NVIDIA GeForce RTX 4060 Ti
MUL_MAT(type_a=f32,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                   184 runs -  5479.83 us/run -  60.13 GFLOP/run -  10.97 TFLOPS
MUL_MAT(type_a=f16,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                   200 runs -  5047.91 us/run -  60.13 GFLOP/run -  11.91 TFLOPS
MUL_MAT(type_a=bf16,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                  168 runs -  6002.14 us/run -  60.13 GFLOP/run -  10.02 TFLOPS
MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                  364 runs -  2748.71 us/run -  60.13 GFLOP/run -  21.88 TFLOPS
MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                  362 runs -  2764.23 us/run -  60.13 GFLOP/run -  21.75 TFLOPS
MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                  348 runs -  2875.25 us/run -  60.13 GFLOP/run -  20.91 TFLOPS
MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                  350 runs -  2867.33 us/run -  60.13 GFLOP/run -  20.97 TFLOPS
MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                  348 runs -  2890.27 us/run -  60.13 GFLOP/run -  20.80 TFLOPS
MUL_MAT(type_a=mxfp4,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                 248 runs -  4043.75 us/run -  60.13 GFLOP/run -  14.87 TFLOPS
MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                  244 runs -  4120.32 us/run -  60.13 GFLOP/run -  14.59 TFLOPS
MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                  220 runs -  4548.30 us/run -  60.13 GFLOP/run -  13.22 TFLOPS
MUL_MAT(type_a=q4_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                  220 runs -  4566.63 us/run -  60.13 GFLOP/run -  13.17 TFLOPS
MUL_MAT(type_a=q5_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                  204 runs -  4907.47 us/run -  60.13 GFLOP/run -  12.25 TFLOPS
MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                  222 runs -  4524.96 us/run -  60.13 GFLOP/run -  13.29 TFLOPS
MUL_MAT(type_a=iq2_xxs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):               248 runs -  4043.75 us/run -  60.13 GFLOP/run -  14.87 TFLOPS
MUL_MAT(type_a=iq2_xs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                244 runs -  4112.35 us/run -  60.13 GFLOP/run -  14.62 TFLOPS
MUL_MAT(type_a=iq2_s,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                 246 runs -  4065.67 us/run -  60.13 GFLOP/run -  14.79 TFLOPS
MUL_MAT(type_a=iq3_xxs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):               244 runs -  4129.54 us/run -  60.13 GFLOP/run -  14.56 TFLOPS
MUL_MAT(type_a=iq1_s,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                 264 runs -  3788.42 us/run -  60.13 GFLOP/run -  15.87 TFLOPS
MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                 234 runs -  4280.64 us/run -  60.13 GFLOP/run -  14.05 TFLOPS
MUL_MAT(type_a=iq4_nl,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                252 runs -  3992.67 us/run -  60.13 GFLOP/run -  15.06 TFLOPS
MUL_MAT(type_a=iq3_s,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                 246 runs -  4084.30 us/run -  60.13 GFLOP/run -  14.72 TFLOPS
MUL_MAT(type_a=iq4_xs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                224 runs -  4489.23 us/run -  60.13 GFLOP/run -  13.39 TFLOPS

@SavicStefan SavicStefan requested a review from 0cc4m as a code owner September 23, 2025 15:07
@github-actions github-actions bot added Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels Sep 23, 2025
@0cc4m
Copy link
Collaborator

0cc4m commented Sep 27, 2025

test-backend-ops -o MUL_MAT_ID is fine on AMD and Intel, but not passing on Nvidia. Something is not fully correct yet. The only difference I can think of is that Nvidia uses the large shader variant. Does it pass for you?

Here are performance results from my devices. It's very good for Nvidia Ampere (which won't be using the code in practice due to coopmat), but neutral or negative on AMD. Not sure why this is.

RTX 3090 without coopmat or integer dot
model size params backend ngl fa test t/s (before) t/s (after) diff
llama 8B IQ1_S - 1.5625 bpw 1.87 GiB 8.03 B Vulkan 99 0 pp512 1246.28 ± 3.00 1489.61 ± 4.57 +19.5%
llama 8B IQ1_S - 1.5625 bpw 1.87 GiB 8.03 B Vulkan 99 1 pp512 1232.94 ± 1.94 1460.53 ± 2.86 +18.5%
llama 8B IQ2_M - 2.7 bpw 2.74 GiB 8.03 B Vulkan 99 0 pp512 1170.17 ± 4.69 1369.08 ± 2.46 +17.0%
llama 8B IQ2_M - 2.7 bpw 2.74 GiB 8.03 B Vulkan 99 1 pp512 1153.60 ± 4.26 1345.56 ± 0.97 +16.6%
llama 8B IQ4_XS - 4.25 bpw 4.13 GiB 8.03 B Vulkan 99 0 pp512 1090.78 ± 3.64 1289.07 ± 2.33 +18.2%
llama 8B IQ4_XS - 4.25 bpw 4.13 GiB 8.03 B Vulkan 99 1 pp512 1078.59 ± 1.14 1266.90 ± 0.73 +17.5%
llama 8B Q4_K - Small 4.36 GiB 8.03 B Vulkan 99 0 pp512 1097.40 ± 1.35 1268.20 ± 1.71 +15.6%
llama 8B Q4_K - Small 4.36 GiB 8.03 B Vulkan 99 1 pp512 1079.72 ± 4.00 1245.62 ± 4.37 +15.4%
llama 8B Q4_0 4.33 GiB 8.03 B Vulkan 99 0 pp512 1223.43 ± 3.87 1471.25 ± 7.55 +20.3%
llama 8B Q4_0 4.33 GiB 8.03 B Vulkan 99 1 pp512 1202.96 ± 6.81 1437.72 ± 6.67 +19.5%
llama 8B Q4_1 4.77 GiB 8.03 B Vulkan 99 0 pp512 1213.63 ± 4.77 1439.74 ± 4.71 +18.6%
llama 8B Q4_1 4.77 GiB 8.03 B Vulkan 99 1 pp512 1187.41 ± 5.71 1411.04 ± 2.48 +18.8%
llama 8B Q8_0 7.95 GiB 8.03 B Vulkan 99 0 pp512 1206.44 ± 4.42 1440.83 ± 8.28 +19.4%
llama 8B Q8_0 7.95 GiB 8.03 B Vulkan 99 1 pp512 1190.69 ± 3.29 1410.32 ± 7.14 +18.4%
qwen3moe 30B.A3B Q2_K - Medium 10.48 GiB 30.53 B Vulkan 99 0 pp512 875.14 ± 8.27 1082.85 ± 8.25 +23.7%
qwen3moe 30B.A3B Q2_K - Medium 10.48 GiB 30.53 B Vulkan 99 1 pp512 857.42 ± 7.47 1077.03 ± 3.25 +25.6%
gpt-oss 20B Q8_0 11.27 GiB 20.91 B Vulkan 99 0 pp512 1008.49 ± 6.09 1453.68 ± 9.48 +44.1%
gpt-oss 20B Q8_0 11.27 GiB 20.91 B Vulkan 99 1 pp512 1013.39 ± 13.40 1443.74 ± 5.54 +42.5%
AMD Radeon Pro VII without integer dot
model size params backend ngl fa test t/s (before) t/s (after) diff
llama 8B IQ1_S - 1.5625 bpw 1.87 GiB 8.03 B Vulkan 99 0 pp512 331.84 ± 1.24 331.86 ± 1.16 +0.0%
llama 8B IQ1_S - 1.5625 bpw 1.87 GiB 8.03 B Vulkan 99 1 pp512 318.19 ± 0.34 316.42 ± 0.62 -0.6%
llama 8B IQ2_M - 2.7 bpw 2.74 GiB 8.03 B Vulkan 99 0 pp512 324.79 ± 0.89 322.82 ± 0.64 -0.6%
llama 8B IQ2_M - 2.7 bpw 2.74 GiB 8.03 B Vulkan 99 1 pp512 311.98 ± 0.42 309.54 ± 0.27 -0.8%
llama 8B IQ4_XS - 4.25 bpw 4.13 GiB 8.03 B Vulkan 99 0 pp512 308.80 ± 0.45 304.46 ± 1.18 -1.4%
llama 8B IQ4_XS - 4.25 bpw 4.13 GiB 8.03 B Vulkan 99 1 pp512 296.04 ± 0.09 291.84 ± 0.58 -1.4%
llama 8B Q4_K - Small 4.36 GiB 8.03 B Vulkan 99 0 pp512 295.78 ± 1.20 293.10 ± 1.40 -0.9%
llama 8B Q4_K - Small 4.36 GiB 8.03 B Vulkan 99 1 pp512 284.57 ± 0.66 280.86 ± 0.17 -1.3%
llama 8B Q4_0 4.33 GiB 8.03 B Vulkan 99 0 pp512 342.79 ± 0.35 336.89 ± 1.07 -1.7%
llama 8B Q4_0 4.33 GiB 8.03 B Vulkan 99 1 pp512 327.33 ± 0.26 324.10 ± 0.64 -1.0%
llama 8B Q4_1 4.77 GiB 8.03 B Vulkan 99 0 pp512 344.59 ± 0.38 338.37 ± 0.36 -1.8%
llama 8B Q4_1 4.77 GiB 8.03 B Vulkan 99 1 pp512 328.27 ± 0.70 324.29 ± 0.14 -1.2%
llama 8B Q8_0 7.95 GiB 8.03 B Vulkan 99 0 pp512 335.49 ± 1.22 330.65 ± 0.35 -1.4%
llama 8B Q8_0 7.95 GiB 8.03 B Vulkan 99 1 pp512 320.89 ± 0.58 317.12 ± 0.25 -1.2%
qwen3moe 30B.A3B Q2_K - Medium 10.48 GiB 30.53 B Vulkan 99 0 pp512 390.62 ± 2.24 379.93 ± 3.53 -2.7%
qwen3moe 30B.A3B Q2_K - Medium 10.48 GiB 30.53 B Vulkan 99 1 pp512 361.05 ± 3.19 353.02 ± 2.66 -2.2%
gpt-oss 20B Q8_0 11.27 GiB 20.91 B Vulkan 99 0 pp512 536.31 ± 4.08 524.24 ± 6.29 -2.3%
gpt-oss 20B Q8_0 11.27 GiB 20.91 B Vulkan 99 1 pp512 528.92 ± 5.79 522.94 ± 5.67 -1.1%
Intel A770 without integer dot
model size params backend ngl fa test t/s (before) t/s (after) diff
llama 8B IQ1_S - 1.5625 bpw 1.87 GiB 8.03 B Vulkan 99 0 pp512 302.05 ± 0.26 281.15 ± 0.84 -6.9%
llama 8B IQ1_S - 1.5625 bpw 1.87 GiB 8.03 B Vulkan 99 1 pp512 103.91 ± 0.07 91.16 ± 0.06 -12.3%
llama 8B IQ2_M - 2.7 bpw 2.74 GiB 8.03 B Vulkan 99 0 pp512 233.62 ± 0.21 229.47 ± 0.29 -1.8%
llama 8B IQ2_M - 2.7 bpw 2.74 GiB 8.03 B Vulkan 99 1 pp512 116.52 ± 0.04 97.96 ± 0.09 -15.9%
llama 8B IQ4_XS - 4.25 bpw 4.13 GiB 8.03 B Vulkan 99 0 pp512 234.15 ± 0.24 232.09 ± 0.31 -0.9%
llama 8B IQ4_XS - 4.25 bpw 4.13 GiB 8.03 B Vulkan 99 1 pp512 107.71 ± 0.10 91.82 ± 0.05 -14.8%
llama 8B Q4_K - Small 4.36 GiB 8.03 B Vulkan 99 0 pp512 229.22 ± 0.39 227.45 ± 0.42 -0.8%
llama 8B Q4_K - Small 4.36 GiB 8.03 B Vulkan 99 1 pp512 105.69 ± 0.07 90.58 ± 0.06 -14.3%
llama 8B Q4_0 4.33 GiB 8.03 B Vulkan 99 0 pp512 292.04 ± 1.09 288.26 ± 0.27 -1.3%
llama 8B Q4_0 4.33 GiB 8.03 B Vulkan 99 1 pp512 121.09 ± 0.06 98.07 ± 0.11 -19.0%
llama 8B Q4_1 4.77 GiB 8.03 B Vulkan 99 0 pp512 291.05 ± 0.34 282.37 ± 0.33 -3.0%
llama 8B Q4_1 4.77 GiB 8.03 B Vulkan 99 1 pp512 121.47 ± 0.10 98.10 ± 0.10 -19.2%
llama 8B Q8_0 7.95 GiB 8.03 B Vulkan 99 0 pp512 268.73 ± 0.53 266.03 ± 0.39 -1.0%
llama 8B Q8_0 7.95 GiB 8.03 B Vulkan 99 1 pp512 118.45 ± 0.08 100.82 ± 0.08 -14.9%
qwen3moe 30B.A3B Q2_K - Medium 10.48 GiB 30.53 B Vulkan 99 0 pp512 299.10 ± 1.24 300.25 ± 1.39 +0.4%
qwen3moe 30B.A3B Q2_K - Medium 10.48 GiB 30.53 B Vulkan 99 1 pp512 123.48 ± 0.32 109.35 ± 0.34 -11.4%
gpt-oss 20B Q8_0 11.27 GiB 20.91 B Vulkan 99 0 pp512 425.12 ± 1.63 426.10 ± 2.06 +0.2%
gpt-oss 20B Q8_0 11.27 GiB 20.91 B Vulkan 99 1 pp512 403.26 ± 4.00 401.98 ± 4.18 -0.3%

@SavicStefan
Copy link
Contributor Author

So it should be:

// Old
const uint sums_idx = (cr * WNITER + wsic) * (WMITER * TN) + cc * TN + wsir;

// New
const uint sums_idx = (cr * WNITER + wsic) * (WMITER * TN) + cc * WMITER + wsir;

But for some reason these two are giving whole different perfomance.

@0cc4m
Copy link
Collaborator

0cc4m commented Oct 3, 2025

No, that doesn't fix the issue for me. Can you reproduce it on your end? All the n=129 mul_mat_id tests fail on my RTX 3090, if I disable coopmat and coompat2.

@SavicStefan
Copy link
Contributor Author

For me everything is passing, including MUL_MAT and MUL_MAT_ID, on: NVIDIA RTX 4060 Ti, NVIDIA RTX 3060, NVIDIA RTX 2060 and AMD Radeon RX 7800 XT.

@0cc4m
Copy link
Collaborator

0cc4m commented Oct 4, 2025

Now the tests are passing, but I'm seeing a major regression with mul_mat_id on Nvidia. Do you also see that?

Intel is also not looking good, but I am not sure why it is affecting only the FA run.

RTX 3090 without coopmat or integer dot
model size params backend ngl fa test t/s (before) t/s (after) diff
llama 8B IQ1_S - 1.5625 bpw 1.87 GiB 8.03 B Vulkan 99 0 pp512 1249.84 ± 2.24 1236.76 ± 3.65 -1.0%
llama 8B IQ1_S - 1.5625 bpw 1.87 GiB 8.03 B Vulkan 99 1 pp512 1235.29 ± 4.47 1221.49 ± 1.13 -1.1%
llama 8B IQ2_M - 2.7 bpw 2.74 GiB 8.03 B Vulkan 99 0 pp512 1184.77 ± 3.37 1166.32 ± 5.76 -1.6%
llama 8B IQ2_M - 2.7 bpw 2.74 GiB 8.03 B Vulkan 99 1 pp512 1167.65 ± 4.05 1148.48 ± 4.41 -1.6%
llama 8B IQ4_XS - 4.25 bpw 4.13 GiB 8.03 B Vulkan 99 0 pp512 1105.88 ± 2.95 1091.12 ± 2.51 -1.3%
llama 8B IQ4_XS - 4.25 bpw 4.13 GiB 8.03 B Vulkan 99 1 pp512 1089.83 ± 1.81 1077.27 ± 2.82 -1.2%
llama 8B Q4_K - Small 4.36 GiB 8.03 B Vulkan 99 0 pp512 1112.28 ± 3.66 1085.31 ± 1.16 -2.4%
llama 8B Q4_K - Small 4.36 GiB 8.03 B Vulkan 99 1 pp512 1097.07 ± 0.50 1069.25 ± 3.60 -2.5%
llama 8B Q4_0 4.33 GiB 8.03 B Vulkan 99 0 pp512 1239.38 ± 4.73 1228.66 ± 3.40 -0.9%
llama 8B Q4_0 4.33 GiB 8.03 B Vulkan 99 1 pp512 1220.21 ± 0.75 1200.66 ± 10.82 -1.6%
llama 8B Q4_1 4.77 GiB 8.03 B Vulkan 99 0 pp512 1229.71 ± 5.00 1201.45 ± 4.66 -2.3%
llama 8B Q4_1 4.77 GiB 8.03 B Vulkan 99 1 pp512 1211.11 ± 2.40 1174.88 ± 4.91 -3.0%
llama 8B Q8_0 7.95 GiB 8.03 B Vulkan 99 0 pp512 1221.42 ± 3.47 1195.96 ± 6.15 -2.1%
llama 8B Q8_0 7.95 GiB 8.03 B Vulkan 99 1 pp512 1204.10 ± 1.47 1162.32 ± 15.64 -3.5%
qwen3moe 30B.A3B Q2_K - Medium 10.48 GiB 30.53 B Vulkan 99 0 pp512 854.04 ± 5.10 499.91 ± 3.58 -41.5%
qwen3moe 30B.A3B Q2_K - Medium 10.48 GiB 30.53 B Vulkan 99 1 pp512 842.91 ± 9.01 495.56 ± 6.38 -41.2%
gpt-oss 20B Q8_0 11.27 GiB 20.91 B Vulkan 99 0 pp512 1003.03 ± 4.29 502.41 ± 2.07 -49.9%
gpt-oss 20B Q8_0 11.27 GiB 20.91 B Vulkan 99 1 pp512 983.62 ± 4.70 491.35 ± 3.97 -50.0%
AMD Radeon Pro VII without integer dot
model size params backend ngl fa test t/s (before) t/s (after) diff
llama 8B IQ1_S - 1.5625 bpw 1.87 GiB 8.03 B Vulkan 99 0 pp512 333.04 ± 1.28 332.77 ± 0.81 -0.1%
llama 8B IQ1_S - 1.5625 bpw 1.87 GiB 8.03 B Vulkan 99 1 pp512 317.96 ± 0.84 317.02 ± 0.48 -0.3%
llama 8B IQ2_M - 2.7 bpw 2.74 GiB 8.03 B Vulkan 99 0 pp512 324.73 ± 0.32 321.44 ± 0.90 -1.0%
llama 8B IQ2_M - 2.7 bpw 2.74 GiB 8.03 B Vulkan 99 1 pp512 310.50 ± 0.30 308.18 ± 0.35 -0.7%
llama 8B IQ4_XS - 4.25 bpw 4.13 GiB 8.03 B Vulkan 99 0 pp512 307.57 ± 0.74 305.04 ± 0.33 -0.8%
llama 8B IQ4_XS - 4.25 bpw 4.13 GiB 8.03 B Vulkan 99 1 pp512 294.61 ± 1.18 292.36 ± 0.54 -0.8%
llama 8B Q4_K - Small 4.36 GiB 8.03 B Vulkan 99 0 pp512 297.14 ± 0.23 293.23 ± 0.85 -1.3%
llama 8B Q4_K - Small 4.36 GiB 8.03 B Vulkan 99 1 pp512 284.75 ± 0.26 281.07 ± 0.33 -1.3%
llama 8B Q4_0 4.33 GiB 8.03 B Vulkan 99 0 pp512 341.99 ± 0.54 337.60 ± 1.56 -1.3%
llama 8B Q4_0 4.33 GiB 8.03 B Vulkan 99 1 pp512 326.74 ± 0.18 322.81 ± 0.48 -1.2%
llama 8B Q4_1 4.77 GiB 8.03 B Vulkan 99 0 pp512 344.16 ± 1.24 337.19 ± 0.41 -2.0%
llama 8B Q4_1 4.77 GiB 8.03 B Vulkan 99 1 pp512 327.87 ± 0.40 323.54 ± 0.83 -1.3%
llama 8B Q8_0 7.95 GiB 8.03 B Vulkan 99 0 pp512 336.78 ± 1.05 330.38 ± 0.68 -1.9%
llama 8B Q8_0 7.95 GiB 8.03 B Vulkan 99 1 pp512 321.49 ± 0.31 316.69 ± 0.72 -1.5%
qwen3moe 30B.A3B Q2_K - Medium 10.48 GiB 30.53 B Vulkan 99 0 pp512 384.21 ± 3.93 375.36 ± 2.77 -2.3%
qwen3moe 30B.A3B Q2_K - Medium 10.48 GiB 30.53 B Vulkan 99 1 pp512 358.21 ± 3.31 352.74 ± 4.74 -1.5%
gpt-oss 20B Q8_0 11.27 GiB 20.91 B Vulkan 99 0 pp512 532.04 ± 6.30 525.63 ± 3.20 -1.2%
gpt-oss 20B Q8_0 11.27 GiB 20.91 B Vulkan 99 1 pp512 517.19 ± 3.89 511.98 ± 5.30 -1.0%
Intel A770 without integer dot
model size params backend ngl fa test t/s (before) t/s (after) diff
llama 8B IQ1_S - 1.5625 bpw 1.87 GiB 8.03 B Vulkan 99 0 pp512 301.73 ± 0.58 279.56 ± 0.32 -7.3%
llama 8B IQ1_S - 1.5625 bpw 1.87 GiB 8.03 B Vulkan 99 1 pp512 103.72 ± 0.07 91.05 ± 0.10 -12.2%
llama 8B IQ2_M - 2.7 bpw 2.74 GiB 8.03 B Vulkan 99 0 pp512 233.63 ± 0.22 228.94 ± 0.92 -2.0%
llama 8B IQ2_M - 2.7 bpw 2.74 GiB 8.03 B Vulkan 99 1 pp512 115.97 ± 0.06 97.72 ± 0.06 -15.7%
llama 8B IQ4_XS - 4.25 bpw 4.13 GiB 8.03 B Vulkan 99 0 pp512 233.50 ± 0.37 230.96 ± 1.19 -1.1%
llama 8B IQ4_XS - 4.25 bpw 4.13 GiB 8.03 B Vulkan 99 1 pp512 107.43 ± 0.07 91.83 ± 0.04 -14.5%
llama 8B Q4_K - Small 4.36 GiB 8.03 B Vulkan 99 0 pp512 229.20 ± 0.29 225.28 ± 2.13 -1.7%
llama 8B Q4_K - Small 4.36 GiB 8.03 B Vulkan 99 1 pp512 105.52 ± 0.10 90.61 ± 0.06 -14.1%
llama 8B Q4_0 4.33 GiB 8.03 B Vulkan 99 0 pp512 291.20 ± 0.74 286.64 ± 1.28 -1.6%
llama 8B Q4_0 4.33 GiB 8.03 B Vulkan 99 1 pp512 120.73 ± 0.04 98.32 ± 0.11 -18.6%
llama 8B Q4_1 4.77 GiB 8.03 B Vulkan 99 0 pp512 290.40 ± 0.42 280.94 ± 1.29 -3.3%
llama 8B Q4_1 4.77 GiB 8.03 B Vulkan 99 1 pp512 120.99 ± 0.03 98.43 ± 0.04 -18.6%
llama 8B Q8_0 7.95 GiB 8.03 B Vulkan 99 0 pp512 268.07 ± 0.58 265.05 ± 0.43 -1.1%
llama 8B Q8_0 7.95 GiB 8.03 B Vulkan 99 1 pp512 118.04 ± 0.04 100.91 ± 0.06 -14.5%
qwen3moe 30B.A3B Q2_K - Medium 10.48 GiB 30.53 B Vulkan 99 0 pp512 297.46 ± 0.93 298.04 ± 0.72 +0.2%
qwen3moe 30B.A3B Q2_K - Medium 10.48 GiB 30.53 B Vulkan 99 1 pp512 123.23 ± 0.33 109.07 ± 0.44 -11.5%
gpt-oss 20B Q8_0 11.27 GiB 20.91 B Vulkan 99 0 pp512 424.90 ± 0.82 425.06 ± 1.08 +0.0%
gpt-oss 20B Q8_0 11.27 GiB 20.91 B Vulkan 99 1 pp512 395.22 ± 2.66 393.56 ± 2.66 -0.4%

@0cc4m
Copy link
Collaborator

0cc4m commented Oct 13, 2025

I'm still waiting to hear whether you also see the Nvidia regression.

@github-actions github-actions bot added the python python script changes label Oct 13, 2025
@SavicStefan
Copy link
Contributor Author

SavicStefan commented Oct 13, 2025

Performance Comparison (Without coopmat and coopmat2) NVIDIA GeForce RTX 4060 Ti
Kernel Before(us/run) After(us/run) Δ %
MUL_MAT(type_a=f32,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1) 5792.89 5763.41 +0.51%
MUL_MAT(type_a=f16,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1) 5381.34 5394.30 -0.24%
MUL_MAT(type_a=bf16,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1) 5257.40 5242.01 +0.29%
MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1) 2723.99 2730.32 -0.23%
MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1) 2753.02 2778.81 -0.94%
MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1) 2855.15 2856.38 -0.04%
MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1) 2843.80 2844.25 -0.02%
MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1) 2867.90 2863.70 +0.15%
MUL_MAT(type_a=mxfp4,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1) 4949.09 4326.33 +12.58%
MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1) 4870.27 4313.31 +11.44%
MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1) 5247.09 4759.04 +9.30%
MUL_MAT(type_a=q4_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1) 5375.22 4748.48 +11.66%
MUL_MAT(type_a=q5_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1) 5743.62 5168.56 +10.01%
MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1) 5503.70 4901.42 +10.94%
MUL_MAT(type_a=iq2_xxs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1) 4988.72 4357.20 +12.66%
MUL_MAT(type_a=iq2_xs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1) 5158.45 4028.82 +21.90%
MUL_MAT(type_a=iq2_s,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1) 4958.32 4393.31 +11.40%
MUL_MAT(type_a=iq3_xxs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1) 5027.39 4315.70 +14.16%
MUL_MAT(type_a=iq1_s,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1) 6110.26 5009.41 +18.02%
MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1) 5718.64 5030.39 +12.04%
MUL_MAT(type_a=iq4_nl,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1) 4870.25 4238.55 +12.97%
MUL_MAT(type_a=iq3_s,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1) 4950.08 4366.56 +11.79%
MUL_MAT(type_a=iq4_xs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1) 5310.09 4723.28 +11.05%
Performance Comparison (Without coopmat and coopmat2) AMD Radeon RX 7800 XT
Kernel Before(us/run) After(us/run) Δ %
MUL_MAT(type_a=f32,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1) 8877.93 8830.00 +0.54%
MUL_MAT(type_a=f16,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1) 6353.04 6526.51 -2.73%
MUL_MAT(type_a=bf16,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1) 7147.87 7219.50 -1.00%
MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1) 3327.31 3350.99 -0.71%
MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1) 3540.99 3556.59 -0.44%
MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1) 3452.07 3483.37 -0.91%
MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1) 3738.56 3779.61 -1.10%
MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1) 3771.78 3799.56 -0.74%
MUL_MAT(type_a=mxfp4,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1) 5477.72 5822.40 -6.29%
MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1) 5977.67 6078.14 -1.68%
MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1) 7511.03 7829.84 -4.24%
MUL_MAT(type_a=q4_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1) 8022.06 8015.15 +0.09%
MUL_MAT(type_a=q5_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1) 7063.25 7409.50 -4.90%
MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1) 7477.86 7793.75 -4.22%
MUL_MAT(type_a=iq2_xxs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1) 5475.90 5892.24 -7.60%
MUL_MAT(type_a=iq2_xs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1) 5515.12 5853.49 -6.14%
MUL_MAT(type_a=iq2_s,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1) 5620.81 5733.57 -2.01%
MUL_MAT(type_a=iq3_xxs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1) 5732.37 6135.72 -7.04%
MUL_MAT(type_a=iq1_s,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1) 5404.80 5637.84 -4.31%
MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1) 5527.86 5741.80 -3.87%
MUL_MAT(type_a=iq4_nl,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1) 5360.15 6463.96 -20.59%
MUL_MAT(type_a=iq3_s,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1) 5796.81 6032.64 -4.07%
MUL_MAT(type_a=iq4_xs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1) 8312.47 7739.69 +6.89%

@0cc4m
Copy link
Collaborator

0cc4m commented Oct 14, 2025

Yeah, that fixed the regression, nice job! I'm wondering why it doesn't make a positive difference for AMD or Intel. Intel especially doesn't like it that much, but I don't think that's a big enough problem to stop this PR, especially since Intel doesn't work that well anyways without integer dot. Thank you for the contribution and patience!

RTX 3090 without coopmat or integer dot
model size params backend ngl fa test t/s (before) t/s (after) diff
llama 8B IQ1_S - 1.5625 bpw 1.87 GiB 8.03 B Vulkan 99 0 pp512 1259.73 ± 4.46 1403.94 ± 2.99 +11.4%
llama 8B IQ1_S - 1.5625 bpw 1.87 GiB 8.03 B Vulkan 99 1 pp512 1248.28 ± 0.69 1384.31 ± 3.78 +10.9%
llama 8B IQ2_M - 2.7 bpw 2.74 GiB 8.03 B Vulkan 99 0 pp512 1187.40 ± 7.29 1308.19 ± 3.82 +10.2%
llama 8B IQ2_M - 2.7 bpw 2.74 GiB 8.03 B Vulkan 99 1 pp512 1168.03 ± 5.62 1285.64 ± 0.92 +10.1%
llama 8B IQ4_XS - 4.25 bpw 4.13 GiB 8.03 B Vulkan 99 0 pp512 1106.91 ± 4.17 1218.55 ± 1.94 +10.1%
llama 8B IQ4_XS - 4.25 bpw 4.13 GiB 8.03 B Vulkan 99 1 pp512 1093.26 ± 1.14 1203.99 ± 3.24 +10.1%
llama 8B Q4_K - Small 4.36 GiB 8.03 B Vulkan 99 0 pp512 1114.69 ± 2.23 1216.71 ± 3.70 +9.2%
llama 8B Q4_K - Small 4.36 GiB 8.03 B Vulkan 99 1 pp512 1099.75 ± 1.76 1187.63 ± 12.68 +8.0%
llama 8B Q4_0 4.33 GiB 8.03 B Vulkan 99 0 pp512 1239.55 ± 3.51 1385.42 ± 4.70 +11.8%
llama 8B Q4_0 4.33 GiB 8.03 B Vulkan 99 1 pp512 1222.05 ± 3.65 1354.77 ± 2.97 +10.9%
llama 8B Q4_1 4.77 GiB 8.03 B Vulkan 99 0 pp512 1229.52 ± 1.89 1367.20 ± 5.96 +11.2%
llama 8B Q4_1 4.77 GiB 8.03 B Vulkan 99 1 pp512 1212.47 ± 4.52 1333.52 ± 10.41 +10.0%
llama 8B Q8_0 7.95 GiB 8.03 B Vulkan 99 0 pp512 1221.09 ± 4.24 1366.96 ± 5.18 +11.9%
llama 8B Q8_0 7.95 GiB 8.03 B Vulkan 99 1 pp512 1204.54 ± 3.55 1340.80 ± 13.49 +11.3%
qwen3moe 30B.A3B Q2_K - Medium 10.48 GiB 30.53 B Vulkan 99 0 pp512 853.50 ± 5.16 1152.78 ± 6.26 +35.1%
qwen3moe 30B.A3B Q2_K - Medium 10.48 GiB 30.53 B Vulkan 99 1 pp512 842.17 ± 10.25 1128.58 ± 10.07 +34.0%
gpt-oss 20B Q8_0 11.27 GiB 20.91 B Vulkan 99 0 pp512 1002.14 ± 4.22 1353.03 ± 4.29 +35.0%
gpt-oss 20B Q8_0 11.27 GiB 20.91 B Vulkan 99 1 pp512 984.17 ± 5.52 1330.16 ± 10.06 +35.2%
AMD Radeon Pro VII without integer dot
model size params backend ngl fa test t/s (before) t/s (after) diff
llama 8B IQ1_S - 1.5625 bpw 1.87 GiB 8.03 B Vulkan 99 0 pp512 332.03 ± 0.52 329.15 ± 0.55 -0.9%
llama 8B IQ1_S - 1.5625 bpw 1.87 GiB 8.03 B Vulkan 99 1 pp512 318.17 ± 0.33 315.19 ± 0.54 -0.9%
llama 8B IQ2_M - 2.7 bpw 2.74 GiB 8.03 B Vulkan 99 0 pp512 324.83 ± 0.75 319.68 ± 0.94 -1.6%
llama 8B IQ2_M - 2.7 bpw 2.74 GiB 8.03 B Vulkan 99 1 pp512 311.45 ± 0.22 307.89 ± 0.85 -1.1%
llama 8B IQ4_XS - 4.25 bpw 4.13 GiB 8.03 B Vulkan 99 0 pp512 308.51 ± 0.54 303.37 ± 1.23 -1.7%
llama 8B IQ4_XS - 4.25 bpw 4.13 GiB 8.03 B Vulkan 99 1 pp512 295.57 ± 0.29 291.61 ± 0.55 -1.3%
llama 8B Q4_K - Small 4.36 GiB 8.03 B Vulkan 99 0 pp512 296.94 ± 0.55 292.38 ± 1.28 -1.5%
llama 8B Q4_K - Small 4.36 GiB 8.03 B Vulkan 99 1 pp512 284.59 ± 0.41 280.78 ± 0.22 -1.3%
llama 8B Q4_0 4.33 GiB 8.03 B Vulkan 99 0 pp512 342.74 ± 1.31 336.49 ± 1.32 -1.8%
llama 8B Q4_0 4.33 GiB 8.03 B Vulkan 99 1 pp512 327.17 ± 0.35 322.95 ± 0.61 -1.3%
llama 8B Q4_1 4.77 GiB 8.03 B Vulkan 99 0 pp512 344.18 ± 1.32 340.97 ± 1.46 -0.9%
llama 8B Q4_1 4.77 GiB 8.03 B Vulkan 99 1 pp512 327.60 ± 0.93 324.38 ± 0.27 -1.0%
llama 8B Q8_0 7.95 GiB 8.03 B Vulkan 99 0 pp512 337.78 ± 0.74 330.34 ± 0.33 -2.2%
llama 8B Q8_0 7.95 GiB 8.03 B Vulkan 99 1 pp512 321.37 ± 0.54 316.05 ± 0.15 -1.7%
qwen3moe 30B.A3B Q2_K - Medium 10.48 GiB 30.53 B Vulkan 99 0 pp512 385.67 ± 1.52 373.75 ± 0.66 -3.1%
qwen3moe 30B.A3B Q2_K - Medium 10.48 GiB 30.53 B Vulkan 99 1 pp512 359.19 ± 3.28 351.91 ± 4.43 -2.0%
gpt-oss 20B Q8_0 11.27 GiB 20.91 B Vulkan 99 0 pp512 539.39 ± 2.70 524.84 ± 1.63 -2.7%
gpt-oss 20B Q8_0 11.27 GiB 20.91 B Vulkan 99 1 pp512 521.23 ± 2.10 510.12 ± 2.71 -2.1%
Intel A770 without integer dot
model size params backend ngl fa test t/s (before) t/s (after) diff
llama 8B IQ1_S - 1.5625 bpw 1.87 GiB 8.03 B Vulkan 99 0 pp512 301.55 ± 0.39 301.27 ± 0.59 -0.1%
llama 8B IQ1_S - 1.5625 bpw 1.87 GiB 8.03 B Vulkan 99 1 pp512 103.76 ± 0.07 96.16 ± 0.05 -7.3%
llama 8B IQ2_M - 2.7 bpw 2.74 GiB 8.03 B Vulkan 99 0 pp512 233.63 ± 0.14 229.97 ± 0.11 -1.6%
llama 8B IQ2_M - 2.7 bpw 2.74 GiB 8.03 B Vulkan 99 1 pp512 115.97 ± 0.04 111.24 ± 0.03 -4.1%
llama 8B IQ4_XS - 4.25 bpw 4.13 GiB 8.03 B Vulkan 99 0 pp512 233.67 ± 0.29 235.43 ± 0.37 +0.8%
llama 8B IQ4_XS - 4.25 bpw 4.13 GiB 8.03 B Vulkan 99 1 pp512 107.36 ± 0.05 104.73 ± 0.06 -2.4%
llama 8B Q4_K - Small 4.36 GiB 8.03 B Vulkan 99 0 pp512 229.20 ± 0.26 225.68 ± 0.19 -1.5%
llama 8B Q4_K - Small 4.36 GiB 8.03 B Vulkan 99 1 pp512 105.48 ± 0.05 102.15 ± 0.06 -3.2%
llama 8B Q4_0 4.33 GiB 8.03 B Vulkan 99 0 pp512 290.97 ± 0.51 287.01 ± 0.42 -1.4%
llama 8B Q4_0 4.33 GiB 8.03 B Vulkan 99 1 pp512 120.74 ± 0.08 113.50 ± 0.09 -6.0%
llama 8B Q4_1 4.77 GiB 8.03 B Vulkan 99 0 pp512 290.20 ± 0.19 287.12 ± 0.28 -1.1%
llama 8B Q4_1 4.77 GiB 8.03 B Vulkan 99 1 pp512 121.08 ± 0.07 113.57 ± 0.11 -6.2%
llama 8B Q8_0 7.95 GiB 8.03 B Vulkan 99 0 pp512 267.74 ± 0.31 265.79 ± 0.38 -0.7%
llama 8B Q8_0 7.95 GiB 8.03 B Vulkan 99 1 pp512 118.09 ± 0.08 112.74 ± 0.07 -4.5%
qwen3moe 30B.A3B Q2_K - Medium 10.48 GiB 30.53 B Vulkan 99 0 pp512 297.64 ± 0.58 300.17 ± 0.85 +0.9%
qwen3moe 30B.A3B Q2_K - Medium 10.48 GiB 30.53 B Vulkan 99 1 pp512 123.17 ± 0.34 120.06 ± 0.35 -2.5%
gpt-oss 20B Q8_0 11.27 GiB 20.91 B Vulkan 99 0 pp512 424.81 ± 0.53 424.87 ± 0.89 +0.0%
gpt-oss 20B Q8_0 11.27 GiB 20.91 B Vulkan 99 1 pp512 396.00 ± 2.42 395.35 ± 2.13 -0.2%

@0cc4m 0cc4m merged commit ffa0590 into ggml-org:master Oct 14, 2025
55 of 59 checks passed
ddh0 added a commit to ddh0/llama.cpp that referenced this pull request Oct 14, 2025
* cuda : remove legacy copy-op pointer indirection code (ggml-org#16485)

* remove legacy copy-op pointer indirection code

* further removal of copy-op indirection code

* renamed check_node_graph_compatibility_and_refresh_copy_ops function

* CUDA: add fp kernel for larger batch size MoE (ggml-org#16512)

* CUDA: kernel for larger batch sizes for MoE

* WIP

* WIP

* WIP

* WIP

* WIP

* WIP

* fixup

* tests

* Move mmq_ids_helper to mmid

* cleanup

* Remove redundant checks

* CUDA: use fastdiv + ggml_cuda_mad for mmvf (ggml-org#16557)

* CUDA: use fastdiv + ggml_cuda_mad for mmvf

* use bf16 directly + fix formatting

* Add exception for HIP code

* CUDA: enable FA for FP32 KV cache (ggml-org#16546)

* vulkan: Improve build time for MSVC (ggml-org#16545)

Enable CMP0147 so custom build steps (invoking vulkan-shader-gen) are run in parallel.

Enable /MP so source files are compiled in parallel.

* vulkan: Support FA with K/V in F32 (ggml-org#16543)

* CUDA + openCL: fix bug in accessing rms_norm->src while doing fusion (ggml-org#16577)

* vulkan: Add ACC_TYPE_VEC2 implementation (ggml-org#16203)

Signed-off-by: Stefan Savic <[email protected]>
Co-authored-by: Stefan Savic <[email protected]>

* metal : avoid using Metal's gpuAddress property (ggml-org#16576)

* metal : avoid using Metal's gpuAddress property

* metal : fix rope kernels buffer check

---------

Signed-off-by: Stefan Savic <[email protected]>
Co-authored-by: Anav Prasad <[email protected]>
Co-authored-by: Aman Gupta <[email protected]>
Co-authored-by: Johannes Gäßler <[email protected]>
Co-authored-by: Jeff Bolz <[email protected]>
Co-authored-by: SavicStefan <[email protected]>
Co-authored-by: Stefan Savic <[email protected]>
Co-authored-by: Georgi Gerganov <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning python python script changes Vulkan Issues specific to the Vulkan backend

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants