Skip to content

Conversation

@lovedheart
Copy link

./build/bin/Release/test-backend-ops.exe perf -o MUL_MAT -p type_a=iq1_m

Tested on AMD 8845HS 780M iGPU

n PR: μs/run PR: GFLOPS Main: μs/run Main: GFLOPS Speedup vs Main
1 224.28 523.63 282.44 415.80 1.26x
2 310.53 756.38 385.04 610.01 1.24x
3 408.65 862.15 515.79 683.08 1.26x
4 589.40 797.02 1244.08 377.60 2.11x
5 1075.96 545.75 4427.85 132.62 4.11x
8 2576.61 364.64 4985.43 188.45 1.94x
512 11601.05 5180.00 11948.15 5030.00 1.03x

@lovedheart lovedheart requested a review from 0cc4m as a code owner November 1, 2025 00:03
@github-actions github-actions bot added Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels Nov 1, 2025
return;

// Number of rows to process for this workgroup
const uint rows_to_process = min(NUM_ROWS, p.stride_d - first_row);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty surprised if it helped to make the changes in this function - this will prevent the compiler from unrolling loops.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see a difference from adding this, so I would prefer to keep it as it was.

@lovedheart Can you benchmark if the changes in the main function make a difference for you?

@0cc4m
Copy link
Collaborator

0cc4m commented Nov 7, 2025

I don't see much of a difference either way. Maybe slight improvement on RDNA3 for n=1, maybe slightly negative on GCN, Nvidia and Intel. Hard to tell, it's close to run-to-run variance.

AMD RX 8060S

ggml_vulkan: 0 = Radeon 8060S Graphics (RADV GFX1151) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat

before:
  MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):               13632 runs -    74.80 us/run - 117.44 MFLOP/run -   1.57 TFLOPS
  MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):               11928 runs -    85.84 us/run - 234.88 MFLOP/run -   2.74 TFLOPS
  MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                8804 runs -   113.87 us/run - 352.32 MFLOP/run -   3.09 TFLOPS
  MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                7029 runs -   144.38 us/run - 469.76 MFLOP/run -   3.25 TFLOPS
  MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                3420 runs -   300.60 us/run - 587.20 MFLOP/run -   1.95 TFLOPS
  MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                 856 runs -  1247.46 us/run - 939.52 MFLOP/run - 753.15 GFLOPS
  MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):               182 runs -  5539.74 us/run -  60.13 GFLOP/run -  10.85 TFLOPS

after:
  MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):               14484 runs -    71.63 us/run - 117.44 MFLOP/run -   1.64 TFLOPS
  MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):               11076 runs -    92.69 us/run - 234.88 MFLOP/run -   2.53 TFLOPS
  MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                9088 runs -   113.53 us/run - 352.32 MFLOP/run -   3.10 TFLOPS
  MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                7242 runs -   140.27 us/run - 469.76 MFLOP/run -   3.35 TFLOPS
  MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                6156 runs -   165.65 us/run - 587.20 MFLOP/run -   3.54 TFLOPS
  MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                 749 runs -  1423.48 us/run - 939.52 MFLOP/run - 660.02 GFLOPS
  MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):               174 runs -  5764.40 us/run -  60.13 GFLOP/run -  10.43 TFLOPS

AMD Radeon Pro VII

ggml_vulkan: 0 = AMD Radeon (TM) Pro VII (RADV VEGA20) (radv) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: none

before:
  MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):               11076 runs -    97.17 us/run - 117.44 MFLOP/run -   1.21 TFLOPS
  MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                7242 runs -   144.39 us/run - 234.88 MFLOP/run -   1.63 TFLOPS
  MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                4260 runs -   249.32 us/run - 352.32 MFLOP/run -   1.41 TFLOPS
  MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                3195 runs -   313.11 us/run - 469.76 MFLOP/run -   1.50 TFLOPS
  MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                2736 runs -   368.32 us/run - 587.20 MFLOP/run -   1.59 TFLOPS
  MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                 428 runs -  3086.22 us/run - 939.52 MFLOP/run - 304.43 GFLOPS
  MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                78 runs - 12824.58 us/run -  60.13 GFLOP/run -   4.69 TFLOPS

after:
  MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                9372 runs -   110.16 us/run - 117.44 MFLOP/run -   1.07 TFLOPS
  MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                6816 runs -   151.09 us/run - 234.88 MFLOP/run -   1.55 TFLOPS
  MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                4260 runs -   236.26 us/run - 352.32 MFLOP/run -   1.49 TFLOPS
  MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                3621 runs -   281.72 us/run - 469.76 MFLOP/run -   1.67 TFLOPS
  MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                3078 runs -   325.78 us/run - 587.20 MFLOP/run -   1.80 TFLOPS
  MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                 321 runs -  3558.65 us/run - 939.52 MFLOP/run - 264.01 GFLOPS
  MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                78 runs - 12860.79 us/run -  60.13 GFLOP/run -   4.68 TFLOPS

Intel A770

ggml_vulkan: 0 = Intel(R) Arc(tm) A770 Graphics (DG2) (Intel open-source Mesa driver) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 65536 | int dot: 1 | matrix cores: none

before:
  MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                7668 runs -   139.32 us/run - 117.44 MFLOP/run - 842.96 GFLOPS
  MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                2556 runs -   405.05 us/run - 234.88 MFLOP/run - 579.89 GFLOPS
  MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                1136 runs -   956.64 us/run - 352.32 MFLOP/run - 368.29 GFLOPS
  MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                 426 runs -  3181.79 us/run - 469.76 MFLOP/run - 147.64 GFLOPS
  MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                 342 runs -  5578.36 us/run - 587.20 MFLOP/run - 105.26 GFLOPS
  MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                 107 runs -  9632.67 us/run - 939.52 MFLOP/run -  97.54 GFLOPS
  MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                66 runs - 15407.65 us/run -  60.13 GFLOP/run -   3.90 TFLOPS

after:
  MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                7668 runs -   143.76 us/run - 117.44 MFLOP/run - 816.93 GFLOPS
  MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                2982 runs -   377.97 us/run - 234.88 MFLOP/run - 621.42 GFLOPS
  MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                1420 runs -   747.32 us/run - 352.32 MFLOP/run - 471.45 GFLOPS
  MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                 639 runs -  1968.56 us/run - 469.76 MFLOP/run - 238.63 GFLOPS
  MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                 513 runs -  2413.24 us/run - 587.20 MFLOP/run - 243.33 GFLOPS
  MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                 107 runs -  9919.79 us/run - 939.52 MFLOP/run -  94.71 GFLOPS
  MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                66 runs - 15310.42 us/run -  60.13 GFLOP/run -   3.93 TFLOPS

Nvidia RTX 3090

ggml_vulkan: 0 = NVIDIA GeForce RTX 3090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2

before:
  MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):               11928 runs -    83.89 us/run - 117.44 MFLOP/run -   1.40 TFLOPS
  MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                9798 runs -   103.71 us/run - 234.88 MFLOP/run -   2.26 TFLOPS
  MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                4828 runs -   208.74 us/run - 352.32 MFLOP/run -   1.69 TFLOPS
  MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                5112 runs -   201.51 us/run - 469.76 MFLOP/run -   2.33 TFLOPS
  MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                2907 runs -   358.87 us/run - 587.20 MFLOP/run -   1.64 TFLOPS
  MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                2247 runs -   448.54 us/run - 939.52 MFLOP/run -   2.09 TFLOPS
  MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):               682 runs -  1467.35 us/run -  60.13 GFLOP/run -  40.98 TFLOPS

after:
  MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):               11928 runs -    85.92 us/run - 117.44 MFLOP/run -   1.37 TFLOPS
  MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                6816 runs -   148.84 us/run - 234.88 MFLOP/run -   1.58 TFLOPS
  MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                5112 runs -   198.05 us/run - 352.32 MFLOP/run -   1.78 TFLOPS
  MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                3621 runs -   286.45 us/run - 469.76 MFLOP/run -   1.64 TFLOPS
  MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                3249 runs -   311.52 us/run - 587.20 MFLOP/run -   1.88 TFLOPS
  MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                 749 runs -  1439.02 us/run - 939.52 MFLOP/run - 652.89 GFLOPS
  MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):               706 runs -  1418.96 us/run -  60.13 GFLOP/run -  42.38 TFLOPS

@lovedheart
Copy link
Author

The code seems to fix the performance only on Windows. In Linux, I cannot see the improvement.

In comparison with ROCm, it produced

D:\llama_latest>build\bin\test-backend-ops.exe perf -o MUL_MAT -p iq1_m
HIP Library Path: C:\WINDOWS\SYSTEM32\amdhip64_7.dll
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
  Device 0: AMD Radeon 780M Graphics, gfx1103 (0x1103), VMM: no, Wave Size: 32
Testing 2 devices

Backend 1/2: ROCm0
  Device description: AMD Radeon 780M Graphics
  Device memory: 59327 MB (59175 MB free)

  MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                5112 runs -   210.56 us/run - 117.44 MFLOP/run - 557.76 GFLOPS
  MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                4260 runs -   257.42 us/run - 234.88 MFLOP/run - 912.44 GFLOPS
  MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                3124 runs -   323.46 us/run - 352.32 MFLOP/run -   1.09 TFLOPS
  MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                2556 runs -   414.22 us/run - 469.76 MFLOP/run -   1.13 TFLOPS
  MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                2052 runs -   495.51 us/run - 587.20 MFLOP/run -   1.19 TFLOPS
  MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                1284 runs -   808.00 us/run - 939.52 MFLOP/run -   1.16 TFLOPS
  MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                88 runs - 11595.92 us/run -  60.13 GFLOP/run -   5.19 TFLOPS
  Backend ROCm0: OK
Backend 2/2: CPU
  Skipping CPU backend
2/2 backends passed
OK

Copy link
Collaborator

@0cc4m 0cc4m left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall it's fine, but please clean up the purely cosmetical code changes and check whether the main function changes are necessary.

[[unroll]] for (uint i = 0; i < NUM_ROWS; ++i)
temp[j][i] = FLOAT_TYPE(0);
}
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of the above changes in this function are just code style, please revert them. It's okay to improve readability and style of code you're touching anyways, but that doesn't apply here. I also prefer to keep curly brackets after loops or ifs.

FLOAT_TYPE temp[NUM_COLS][NUM_ROWS];

void calc_superblock(const uint a_offset, const uint b_offset, const uint ib32, const uint i, const uint num_blocks_per_row, const uint first_row, const uint num_rows) {
// ------------------ calc_superblock (final optimized version) ------------------
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment isn't necessary.

return;

// Number of rows to process for this workgroup
const uint rows_to_process = min(NUM_ROWS, p.stride_d - first_row);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see a difference from adding this, so I would prefer to keep it as it was.

@lovedheart Can you benchmark if the changes in the main function make a difference for you?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning Vulkan Issues specific to the Vulkan backend

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants