|
| 1 | +Submission 2025-05-08 |
| 2 | +===================== |
| 3 | + |
| 4 | +SIMD Lanes |
| 5 | +---------- |
| 6 | + |
| 7 | +This section considers matrix-matrix multiplications, that require instructions where only a subset of SIMD lanes are active. |
| 8 | + |
| 9 | +1. Implement a kernel for M=14, N=6 and K=64 and wrap it in the matmul_14_6_64 function |
| 10 | +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| 11 | + |
| 12 | +File: ``neon_4_1.s`` |
| 13 | + |
| 14 | +For this kernel ``matmul_14_6_64`` we adapt the already implemented kernel ``matmul_16_6_64``. The only change is that we now use 3 ``fmla`` instructions that operate on 4 scalars, and one ``fmla`` instruction that only uses 2 scalars: :math:`4 \cdot 3 + 1 \cdot 2 = 14`. |
| 15 | + |
| 16 | +We load the full 16 floats and ignore the last 2: |
| 17 | + |
| 18 | +.. code-block:: asm |
| 19 | + :linenos: |
| 20 | +
|
| 21 | + ... |
| 22 | + // Load first column from the 14x6 matrix c - load full 16 entries - ignore last 2 |
| 23 | + ld1 {v25.4s, v26.4s, v27.4s, v28.4s}, [x2], x5 |
| 24 | + // Load second column from the 14x6 matrix c |
| 25 | + ld1 {v17.4s, v18.4s, v19.4s, v20.4s}, [x2], x5 |
| 26 | + // Load third column from the 14x6 matrix c |
| 27 | + ld1 {v21.4s, v22.4s, v23.4s, v24.4s}, [x2], x5 |
| 28 | + // Load fourth column from the 14x6 matrix c |
| 29 | + ld1 {v5.4s, v6.4s, v7.4s, v8.4s}, [x2], x5 |
| 30 | + // Load fifth column from the 14x6 matrix c |
| 31 | + ld1 {v9.4s, v10.4s, v11.4s, v12.4s}, [x2], x5 |
| 32 | + // Load sixth column from the 14x6 matrix c |
| 33 | + ld1 {v13.4s, v14.4s, v15.4s, v16.4s}, [x2], x5 |
| 34 | + ... |
| 35 | +
|
| 36 | +Next the loop over K: |
| 37 | + |
| 38 | +.. code-block:: asm |
| 39 | + :linenos: |
| 40 | +
|
| 41 | + ... |
| 42 | + mov x9, #64 // x9 iterator for K loop |
| 43 | + matmul_loop_over_K: |
| 44 | + sub x9, x9, #1 |
| 45 | +
|
| 46 | + // Load first column data from the 14x1 matrix a (again 16 but we'll only using two from v3) |
| 47 | + ld1 {v0.4s, v1.4s, v2.4s, v3.4s}, [x0], x3 |
| 48 | +
|
| 49 | + // run the known matmul_16_6_1_unrolled kernel with modification to matmult_14_6_1 |
| 50 | + // Load first element from the 1x6 matrix b |
| 51 | + ldr s4, [x1] |
| 52 | + add x1, x1, x4 |
| 53 | +
|
| 54 | + // Calculate first column of c |
| 55 | + fmla v25.4s, v0.4s, v4.s[0] // 4 floats |
| 56 | + fmla v26.4s, v1.4s, v4.s[0] // 4 floats |
| 57 | + fmla v27.4s, v2.4s, v4.s[0] // 4 floats |
| 58 | + fmla v28.2s, v3.2s, v4.s[0] // 2 floats |
| 59 | +
|
| 60 | + // Load second element from the 1x6 matrix b |
| 61 | + ldr s4, [x1] |
| 62 | + add x1, x1, x4 |
| 63 | +
|
| 64 | + // Calculate second column of c |
| 65 | + fmla v17.4s, v0.4s, v4.s[0] |
| 66 | + fmla v18.4s, v1.4s, v4.s[0] |
| 67 | + fmla v19.4s, v2.4s, v4.s[0] |
| 68 | + fmla v20.2s, v3.2s, v4.s[0] |
| 69 | + ... |
| 70 | +
|
| 71 | +We store the full 16 computed floats back to memory but only add an offset of 14 floats because the last two floats aren't used. The last 14 values are exactly stored (8+4+2). |
| 72 | + |
| 73 | +.. code-block:: asm |
| 74 | + :linenos: |
| 75 | + ... |
| 76 | + // Store first column back to memory |
| 77 | + st1 {v25.4s, v26.4s, v27.4s, v28.4s}, [x2], x5 // offset of 14 floats |
| 78 | + // Store second column back to memory |
| 79 | + st1 {v17.4s, v18.4s, v19.4s, v20.4s}, [x2], x5 // offset of 14 floats |
| 80 | + // Store third column back to memory |
| 81 | + st1 {v21.4s, v22.4s, v23.4s, v24.4s}, [x2], x5 // offset of 14 floats |
| 82 | + // Store fourth column back to memory |
| 83 | + st1 {v5.4s, v6.4s, v7.4s, v8.4s}, [x2], x5 // offset of 14 floats |
| 84 | + // Store fifth column back to memory |
| 85 | + st1 {v9.4s, v10.4s, v11.4s, v12.4s}, [x2], x5 // offset of 14 floats |
| 86 | + // Store sixth column back to memory (exactly last 14 elements) |
| 87 | + stp q13, q14, [x2] // 8 floats |
| 88 | + str q15, [x2, #32] // 4 floats |
| 89 | + str d16, [x2, #48] // 2 floats |
| 90 | + ... |
| 91 | +
|
| 92 | +2. Implement a kernel for M=15, N=6 and K=64 and wrap it in the matmul_15_6_64 function |
| 93 | +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| 94 | + |
| 95 | +File: ``neon_4_2.s`` |
| 96 | + |
| 97 | +For this kernel ``matmul_15_6_64`` we adapt the already implemented kernel ``matmul_16_6_64``. The only change is that we ignore the last computed float value from the 4 ``fmla`` instructions when saving back to memory. |
| 98 | + |
| 99 | +We load the full 16 floats and ignore the last one: |
| 100 | + |
| 101 | +.. code-block:: asm |
| 102 | + :linenos: |
| 103 | +
|
| 104 | + ... |
| 105 | + // Load first column from the 15x6 matrix c - load full 16 entries - ignore last |
| 106 | + ld1 {v25.4s, v26.4s, v27.4s, v28.4s}, [x2], x5 |
| 107 | + // Load second column from the 15x6 matrix c |
| 108 | + ld1 {v17.4s, v18.4s, v19.4s, v20.4s}, [x2], x5 |
| 109 | + // Load third column from the 15x6 matrix c |
| 110 | + ld1 {v21.4s, v22.4s, v23.4s, v24.4s}, [x2], x5 |
| 111 | + // Load fourth column from the 15x6 matrix c |
| 112 | + ld1 {v5.4s, v6.4s, v7.4s, v8.4s}, [x2], x5 |
| 113 | + // Load fifth column from the 15x6 matrix c |
| 114 | + ld1 {v9.4s, v10.4s, v11.4s, v12.4s}, [x2], x5 |
| 115 | + // Load sixth column from the 15x6 matrix c |
| 116 | + ld1 {v13.4s, v14.4s, v15.4s, v16.4s}, [x2], x5 |
| 117 | + ... |
| 118 | +
|
| 119 | +Next the loop over K: |
| 120 | + |
| 121 | +.. code-block:: asm |
| 122 | + :linenos: |
| 123 | +
|
| 124 | + ... |
| 125 | + mov x9, #64 // x9 iterator for K loop |
| 126 | + matmul_loop_over_K: |
| 127 | + sub x9, x9, #1 |
| 128 | +
|
| 129 | + // Load first column data from the 15x1 matrix a |
| 130 | + ld1 {v0.4s, v1.4s, v2.4s, v3.4s}, [x0], x3 |
| 131 | + // ldp q0, q1, [x0] // 4 + 4 values |
| 132 | + // ldr q2, [x0, #32] // 4 values |
| 133 | + // ldr d3, [x0, #48] // 2 values |
| 134 | +
|
| 135 | + // run the known matmul_16_6_1_unrolled kernel with modification to matmult_15_6_1 |
| 136 | + // Load first element from the 1x6 matrix b |
| 137 | + ldr s4, [x1] |
| 138 | + add x1, x1, x4 |
| 139 | +
|
| 140 | + // Calculate first column of c |
| 141 | + fmla v25.4s, v0.4s, v4.s[0] |
| 142 | + fmla v26.4s, v1.4s, v4.s[0] |
| 143 | + fmla v27.4s, v2.4s, v4.s[0] |
| 144 | + fmla v28.4s, v3.4s, v4.s[0] |
| 145 | +
|
| 146 | + // Load second element from the 1x6 matrix b |
| 147 | + ldr s4, [x1] |
| 148 | + add x1, x1, x4 |
| 149 | +
|
| 150 | + // Calculate second column of c |
| 151 | + fmla v17.4s, v0.4s, v4.s[0] |
| 152 | + fmla v18.4s, v1.4s, v4.s[0] |
| 153 | + fmla v19.4s, v2.4s, v4.s[0] |
| 154 | + fmla v20.4s, v3.4s, v4.s[0] |
| 155 | + ... |
| 156 | +
|
| 157 | +We store the full 16 computed floats back to memory but only add an offset of 15 floats because the last float isn't used. The last 15 values are exactly stored (8+4+2+1). |
| 158 | + |
| 159 | +.. code-block:: asm |
| 160 | + :linenos: |
| 161 | +
|
| 162 | + ... |
| 163 | + // Store first column back to memory |
| 164 | + st1 {v25.4s, v26.4s, v27.4s, v28.4s}, [x2], x5 // offset of 15 floats |
| 165 | + // Store second column back to memory |
| 166 | + st1 {v17.4s, v18.4s, v19.4s, v20.4s}, [x2], x5 // offset of 15 floats |
| 167 | + // Store third column back to memory |
| 168 | + st1 {v21.4s, v22.4s, v23.4s, v24.4s}, [x2], x5 // offset of 15 floats |
| 169 | + // Store fourth column back to memory |
| 170 | + st1 {v5.4s, v6.4s, v7.4s, v8.4s}, [x2], x5 // offset of 15 floats |
| 171 | + // Store fifth column back to memory |
| 172 | + st1 {v9.4s, v10.4s, v11.4s, v12.4s}, [x2], x5 // offset of 15 floats |
| 173 | + // Store sixth column back to memory (exactly last 15 elements) |
| 174 | + stp q13, q14, [x2] // 8 floats |
| 175 | + str q15, [x2, #32] // 4 floats |
| 176 | + str d16, [x2, #48] // 2 floats |
| 177 | + mov w9, v16.s[2] |
| 178 | + str w9, [x2, #56] // 1 floats |
| 179 | + ... |
| 180 | +
|
| 181 | +3. Test and optimize the kernels. Report your performance in GFLOPS |
| 182 | +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| 183 | + |
| 184 | +Optimized benchmark results: |
| 185 | + |
| 186 | +.. code-block:: asm |
| 187 | + :emphasize-lines: 4, 8 |
| 188 | +
|
| 189 | + -------------------------------------------------------------------------------------------------------------------------------------------- |
| 190 | + Benchmark Time CPU Iterations FLOPS |
| 191 | + -------------------------------------------------------------------------------------------------------------------------------------------- |
| 192 | + GemmMxNxKFixture<14, 6, 64>/BM_matmul_14_6_64/min_warmup_time:1.000_mean 94.8 ns 94.5 ns 10 113.789G/s |
| 193 | + GemmMxNxKFixture<14, 6, 64>/BM_matmul_14_6_64/min_warmup_time:1.000_median 94.8 ns 94.5 ns 10 113.775G/s |
| 194 | + GemmMxNxKFixture<14, 6, 64>/BM_matmul_14_6_64/min_warmup_time:1.000_stddev 0.671 ns 0.659 ns 10 790.609M/s |
| 195 | + GemmMxNxKFixture<14, 6, 64>/BM_matmul_14_6_64/min_warmup_time:1.000_cv 0.71 % 0.70 % 10 0.69% |
| 196 | + GemmMxNxKFixture<15, 6, 64>/BM_matmul_15_6_64/min_warmup_time:1.000_mean 95.5 ns 95.1 ns 10 121.074G/s |
| 197 | + GemmMxNxKFixture<15, 6, 64>/BM_matmul_15_6_64/min_warmup_time:1.000_median 95.4 ns 95.1 ns 10 121.09G/s |
| 198 | + GemmMxNxKFixture<15, 6, 64>/BM_matmul_15_6_64/min_warmup_time:1.000_stddev 0.295 ns 0.293 ns 10 373.529M/s |
| 199 | + GemmMxNxKFixture<15, 6, 64>/BM_matmul_15_6_64/min_warmup_time:1.000_cv 0.31 % 0.31 % 10 0.31% |
| 200 | +
|
| 201 | +- **matmul_14_6_64** kernel: :math:`113.8` GFLOPS |
| 202 | +- **matmul_15_6_64** kernel: :math:`121.1` GFLOPS |
| 203 | + |
| 204 | +Accumulator Block Shapes |
| 205 | +------------------------ |
| 206 | + |
| 207 | +This section considers a matrix-matrix multiplication where a high-performance implementation may require accumulator blocks with different shapes. |
| 208 | + |
| 209 | +1. Implement a kernel for M=15, N=6 and K=64 and wrap it in the matmul_64_64_64 function |
| 210 | +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| 211 | + |
| 212 | +File: ``neon_5_1.s`` |
| 213 | + |
| 214 | +matmul_64_48_64 |
| 215 | + |
| 216 | +For this kernel ``matmul_64_64_64`` we adapt the already implemented kernel ``matmul_64_48_64``. The only changes is that we removed two ``fmla`` blocks from the inner loop: |
| 217 | + |
| 218 | +.. code-block:: asm |
| 219 | + :linenos: |
| 220 | + |
| 221 | + ... |
| 222 | + mov x15, #64 // x15 iterator for K loop |
| 223 | + matmul_loop_over_K: |
| 224 | + sub x15, x15, #1 |
| 225 | +
|
| 226 | + // Load first column data from the 16x1 matrix a |
| 227 | + ld1 {v0.4s, v1.4s, v2.4s, v3.4s}, [x0], x3 |
| 228 | +
|
| 229 | + // run the matmul_16_4_1_unrolled kernel |
| 230 | + // Load first element from the 1x4 matrix b |
| 231 | + ldr s4, [x1] |
| 232 | + add x1, x1, x4 |
| 233 | +
|
| 234 | + // Calculate first column of c |
| 235 | + fmla v25.4s, v0.4s, v4.s[0] |
| 236 | + fmla v26.4s, v1.4s, v4.s[0] |
| 237 | + fmla v27.4s, v2.4s, v4.s[0] |
| 238 | + fmla v28.4s, v3.4s, v4.s[0] |
| 239 | +
|
| 240 | +
|
| 241 | + // Load second element from the 1x4 matrix b |
| 242 | + ldr s4, [x1] |
| 243 | + add x1, x1, x4 |
| 244 | +
|
| 245 | + // Calculate second column of c |
| 246 | + fmla v17.4s, v0.4s, v4.s[0] |
| 247 | + fmla v18.4s, v1.4s, v4.s[0] |
| 248 | + fmla v19.4s, v2.4s, v4.s[0] |
| 249 | + fmla v20.4s, v3.4s, v4.s[0] |
| 250 | +
|
| 251 | + |
| 252 | + // Load third element from the 1x4 matrix b |
| 253 | + ldr s4, [x1] |
| 254 | + add x1, x1, x4 |
| 255 | +
|
| 256 | + // Calculated third column of c |
| 257 | + fmla v21.4s, v0.4s, v4.s[0] |
| 258 | + fmla v22.4s, v1.4s, v4.s[0] |
| 259 | + fmla v23.4s, v2.4s, v4.s[0] |
| 260 | + fmla v24.4s, v3.4s, v4.s[0] |
| 261 | +
|
| 262 | +
|
| 263 | + // Load fourth element from the 1x4 matrix b |
| 264 | + ldr s4, [x1] |
| 265 | + add x1, x1, x4 |
| 266 | +
|
| 267 | + // Calculate fourth column of c |
| 268 | + fmla v5.4s, v0.4s, v4.s[0] |
| 269 | + fmla v6.4s, v1.4s, v4.s[0] |
| 270 | + fmla v7.4s, v2.4s, v4.s[0] |
| 271 | + fmla v8.4s, v3.4s, v4.s[0] |
| 272 | +
|
| 273 | +
|
| 274 | + // offset x6 to the next element in the column |
| 275 | + add x6, x6, #4 // #4 = sizeof(float) |
| 276 | +
|
| 277 | + // Restore x1 to be incremented again |
| 278 | + mov x1, x6 |
| 279 | +
|
| 280 | + // Loop back to K |
| 281 | + cbnz x15, matmul_loop_over_K |
| 282 | + ... |
| 283 | +
|
| 284 | +Then changed the number of loops over M to four :math:`4 \cdot 16 = 64`: |
| 285 | + |
| 286 | +.. code-block:: asm |
| 287 | + :linenos: |
| 288 | + |
| 289 | + ... |
| 290 | + mov x16, #4 // x16 iterator for M loop |
| 291 | + matmul_loop_over_M: |
| 292 | + sub x16, x16, #1 |
| 293 | +
|
| 294 | + // Load first column from the 16x6 matrix c |
| 295 | + ld1 {v25.4s, v26.4s, v27.4s, v28.4s}, [x2], x5 |
| 296 | + // Load second column from the 16x6 matrix c |
| 297 | + ld1 {v17.4s, v18.4s, v19.4s, v20.4s}, [x2], x5 |
| 298 | + // Load third column from the 16x6 matrix c |
| 299 | + ld1 {v21.4s, v22.4s, v23.4s, v24.4s}, [x2], x5 |
| 300 | + // Load fourth column from the 16x6 matrix c |
| 301 | + ld1 {v5.4s, v6.4s, v7.4s, v8.4s}, [x2], x5 |
| 302 | +
|
| 303 | + mov x15, #64 // x15 iterator for K loop |
| 304 | + matmul_loop_over_K: |
| 305 | + sub x15, x15, #1 |
| 306 | + ... |
| 307 | +
|
| 308 | +And finaly changed the number of loops over N to 16 :math:`16 \cdot 4 = 64`: |
| 309 | + |
| 310 | +.. code-block:: asm |
| 311 | + :linenos: |
| 312 | + |
| 313 | + ... |
| 314 | + mov x17, #16 // x17 iterator for N loop |
| 315 | + matmul_loop_over_N: |
| 316 | + sub x17, x17, #1 |
| 317 | +
|
| 318 | + mov x16, #4 // x16 iterator for M loop |
| 319 | + matmul_loop_over_M: |
| 320 | + sub x16, x16, #1 |
| 321 | + ... |
| 322 | +
|
| 323 | +2. Test and optimize the kernel. Report your performance in GFLOPS |
| 324 | +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| 325 | + |
| 326 | +Optimized benchmark result: |
| 327 | + |
| 328 | +.. code-block:: asm |
| 329 | + :emphasize-lines: 4, 8 |
| 330 | +
|
| 331 | + -------------------------------------------------------------------------------------------------------------------------------------------- |
| 332 | + Benchmark Time CPU Iterations FLOPS |
| 333 | + -------------------------------------------------------------------------------------------------------------------------------------------- |
| 334 | + GemmMxNxKFixture<64, 64, 64>/BM_matmul_64_64_64/min_warmup_time:1.000_mean 4111 ns 4097 ns 10 127.964G/s |
| 335 | + GemmMxNxKFixture<64, 64, 64>/BM_matmul_64_64_64/min_warmup_time:1.000_median 4110 ns 4096 ns 10 127.988G/s |
| 336 | + GemmMxNxKFixture<64, 64, 64>/BM_matmul_64_64_64/min_warmup_time:1.000_stddev 13.7 ns 13.8 ns 10 431.794M/s |
| 337 | + GemmMxNxKFixture<64, 64, 64>/BM_matmul_64_64_64/min_warmup_time:1.000_cv 0.33 % 0.34 % 10 0.34% |
| 338 | +
|
| 339 | +- **matmul_14_64_64** kernel: :math:`128.0` GFLOPS |
| 340 | + |
| 341 | +Microkernel |
| 342 | +----------- |
| 343 | + |
| 344 | +1. Implement generate function, support only the setting of an FP32 microkernel for C+=AB for M=16, N=6, K=1 and test for errors |
| 345 | +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| 346 | + |
| 347 | +1. Add support for k parameter by generating a K loop around the microkernel |
| 348 | +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| 349 | + |
| 350 | +1. Test the kernel generation. Report performance in GFLOPS |
| 351 | +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
0 commit comments