Skip to content

Commit a696b21

Browse files
committed
doc: neon task 4 and 5
1 parent eb632ae commit a696b21

File tree

4 files changed

+358
-10
lines changed

4 files changed

+358
-10
lines changed

docs_sphinx/getting_started/building_project.rst

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -89,7 +89,7 @@ Building
8989

9090
.. code-block::
9191
92-
cmake --build . --config Release --target benchmark
92+
cmake --build . --config Release --target benchmarks
9393
9494
Options for ``--config`` are **Release** and **Debug**. :raw-html:`</br>`
9595
Options for ``--target`` are **benchmarks** and **tests**
@@ -98,7 +98,7 @@ Building
9898

9999
.. code-block:: bash
100100
101-
Options for ``--target`` are **benchmark** and **tests**
101+
Options for ``--target`` are **benchmarks** and **tests**
102102

103103

104104
+--------------------+--------------------------------------------------------------------------------------------------------------------+
Lines changed: 351 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,351 @@
1+
Submission 2025-05-08
2+
=====================
3+
4+
SIMD Lanes
5+
----------
6+
7+
This section considers matrix-matrix multiplications, that require instructions where only a subset of SIMD lanes are active.
8+
9+
1. Implement a kernel for M=14, N=6 and K=64 and wrap it in the matmul_14_6_64 function
10+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
11+
12+
File: ``neon_4_1.s``
13+
14+
For this kernel ``matmul_14_6_64`` we adapt the already implemented kernel ``matmul_16_6_64``. The only change is that we now use 3 ``fmla`` instructions that operate on 4 scalars, and one ``fmla`` instruction that only uses 2 scalars: :math:`4 \cdot 3 + 1 \cdot 2 = 14`.
15+
16+
We load the full 16 floats and ignore the last 2:
17+
18+
.. code-block:: asm
19+
:linenos:
20+
21+
...
22+
// Load first column from the 14x6 matrix c - load full 16 entries - ignore last 2
23+
ld1 {v25.4s, v26.4s, v27.4s, v28.4s}, [x2], x5
24+
// Load second column from the 14x6 matrix c
25+
ld1 {v17.4s, v18.4s, v19.4s, v20.4s}, [x2], x5
26+
// Load third column from the 14x6 matrix c
27+
ld1 {v21.4s, v22.4s, v23.4s, v24.4s}, [x2], x5
28+
// Load fourth column from the 14x6 matrix c
29+
ld1 {v5.4s, v6.4s, v7.4s, v8.4s}, [x2], x5
30+
// Load fifth column from the 14x6 matrix c
31+
ld1 {v9.4s, v10.4s, v11.4s, v12.4s}, [x2], x5
32+
// Load sixth column from the 14x6 matrix c
33+
ld1 {v13.4s, v14.4s, v15.4s, v16.4s}, [x2], x5
34+
...
35+
36+
Next the loop over K:
37+
38+
.. code-block:: asm
39+
:linenos:
40+
41+
...
42+
mov x9, #64 // x9 iterator for K loop
43+
matmul_loop_over_K:
44+
sub x9, x9, #1
45+
46+
// Load first column data from the 14x1 matrix a (again 16 but we'll only using two from v3)
47+
ld1 {v0.4s, v1.4s, v2.4s, v3.4s}, [x0], x3
48+
49+
// run the known matmul_16_6_1_unrolled kernel with modification to matmult_14_6_1
50+
// Load first element from the 1x6 matrix b
51+
ldr s4, [x1]
52+
add x1, x1, x4
53+
54+
// Calculate first column of c
55+
fmla v25.4s, v0.4s, v4.s[0] // 4 floats
56+
fmla v26.4s, v1.4s, v4.s[0] // 4 floats
57+
fmla v27.4s, v2.4s, v4.s[0] // 4 floats
58+
fmla v28.2s, v3.2s, v4.s[0] // 2 floats
59+
60+
// Load second element from the 1x6 matrix b
61+
ldr s4, [x1]
62+
add x1, x1, x4
63+
64+
// Calculate second column of c
65+
fmla v17.4s, v0.4s, v4.s[0]
66+
fmla v18.4s, v1.4s, v4.s[0]
67+
fmla v19.4s, v2.4s, v4.s[0]
68+
fmla v20.2s, v3.2s, v4.s[0]
69+
...
70+
71+
We store the full 16 computed floats back to memory but only add an offset of 14 floats because the last two floats aren't used. The last 14 values are exactly stored (8+4+2).
72+
73+
.. code-block:: asm
74+
:linenos:
75+
...
76+
// Store first column back to memory
77+
st1 {v25.4s, v26.4s, v27.4s, v28.4s}, [x2], x5 // offset of 14 floats
78+
// Store second column back to memory
79+
st1 {v17.4s, v18.4s, v19.4s, v20.4s}, [x2], x5 // offset of 14 floats
80+
// Store third column back to memory
81+
st1 {v21.4s, v22.4s, v23.4s, v24.4s}, [x2], x5 // offset of 14 floats
82+
// Store fourth column back to memory
83+
st1 {v5.4s, v6.4s, v7.4s, v8.4s}, [x2], x5 // offset of 14 floats
84+
// Store fifth column back to memory
85+
st1 {v9.4s, v10.4s, v11.4s, v12.4s}, [x2], x5 // offset of 14 floats
86+
// Store sixth column back to memory (exactly last 14 elements)
87+
stp q13, q14, [x2] // 8 floats
88+
str q15, [x2, #32] // 4 floats
89+
str d16, [x2, #48] // 2 floats
90+
...
91+
92+
2. Implement a kernel for M=15, N=6 and K=64 and wrap it in the matmul_15_6_64 function
93+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
94+
95+
File: ``neon_4_2.s``
96+
97+
For this kernel ``matmul_15_6_64`` we adapt the already implemented kernel ``matmul_16_6_64``. The only change is that we ignore the last computed float value from the 4 ``fmla`` instructions when saving back to memory.
98+
99+
We load the full 16 floats and ignore the last one:
100+
101+
.. code-block:: asm
102+
:linenos:
103+
104+
...
105+
// Load first column from the 15x6 matrix c - load full 16 entries - ignore last
106+
ld1 {v25.4s, v26.4s, v27.4s, v28.4s}, [x2], x5
107+
// Load second column from the 15x6 matrix c
108+
ld1 {v17.4s, v18.4s, v19.4s, v20.4s}, [x2], x5
109+
// Load third column from the 15x6 matrix c
110+
ld1 {v21.4s, v22.4s, v23.4s, v24.4s}, [x2], x5
111+
// Load fourth column from the 15x6 matrix c
112+
ld1 {v5.4s, v6.4s, v7.4s, v8.4s}, [x2], x5
113+
// Load fifth column from the 15x6 matrix c
114+
ld1 {v9.4s, v10.4s, v11.4s, v12.4s}, [x2], x5
115+
// Load sixth column from the 15x6 matrix c
116+
ld1 {v13.4s, v14.4s, v15.4s, v16.4s}, [x2], x5
117+
...
118+
119+
Next the loop over K:
120+
121+
.. code-block:: asm
122+
:linenos:
123+
124+
...
125+
mov x9, #64 // x9 iterator for K loop
126+
matmul_loop_over_K:
127+
sub x9, x9, #1
128+
129+
// Load first column data from the 15x1 matrix a
130+
ld1 {v0.4s, v1.4s, v2.4s, v3.4s}, [x0], x3
131+
// ldp q0, q1, [x0] // 4 + 4 values
132+
// ldr q2, [x0, #32] // 4 values
133+
// ldr d3, [x0, #48] // 2 values
134+
135+
// run the known matmul_16_6_1_unrolled kernel with modification to matmult_15_6_1
136+
// Load first element from the 1x6 matrix b
137+
ldr s4, [x1]
138+
add x1, x1, x4
139+
140+
// Calculate first column of c
141+
fmla v25.4s, v0.4s, v4.s[0]
142+
fmla v26.4s, v1.4s, v4.s[0]
143+
fmla v27.4s, v2.4s, v4.s[0]
144+
fmla v28.4s, v3.4s, v4.s[0]
145+
146+
// Load second element from the 1x6 matrix b
147+
ldr s4, [x1]
148+
add x1, x1, x4
149+
150+
// Calculate second column of c
151+
fmla v17.4s, v0.4s, v4.s[0]
152+
fmla v18.4s, v1.4s, v4.s[0]
153+
fmla v19.4s, v2.4s, v4.s[0]
154+
fmla v20.4s, v3.4s, v4.s[0]
155+
...
156+
157+
We store the full 16 computed floats back to memory but only add an offset of 15 floats because the last float isn't used. The last 15 values are exactly stored (8+4+2+1).
158+
159+
.. code-block:: asm
160+
:linenos:
161+
162+
...
163+
// Store first column back to memory
164+
st1 {v25.4s, v26.4s, v27.4s, v28.4s}, [x2], x5 // offset of 15 floats
165+
// Store second column back to memory
166+
st1 {v17.4s, v18.4s, v19.4s, v20.4s}, [x2], x5 // offset of 15 floats
167+
// Store third column back to memory
168+
st1 {v21.4s, v22.4s, v23.4s, v24.4s}, [x2], x5 // offset of 15 floats
169+
// Store fourth column back to memory
170+
st1 {v5.4s, v6.4s, v7.4s, v8.4s}, [x2], x5 // offset of 15 floats
171+
// Store fifth column back to memory
172+
st1 {v9.4s, v10.4s, v11.4s, v12.4s}, [x2], x5 // offset of 15 floats
173+
// Store sixth column back to memory (exactly last 15 elements)
174+
stp q13, q14, [x2] // 8 floats
175+
str q15, [x2, #32] // 4 floats
176+
str d16, [x2, #48] // 2 floats
177+
mov w9, v16.s[2]
178+
str w9, [x2, #56] // 1 floats
179+
...
180+
181+
3. Test and optimize the kernels. Report your performance in GFLOPS
182+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
183+
184+
Optimized benchmark results:
185+
186+
.. code-block:: asm
187+
:emphasize-lines: 4, 8
188+
189+
--------------------------------------------------------------------------------------------------------------------------------------------
190+
Benchmark Time CPU Iterations FLOPS
191+
--------------------------------------------------------------------------------------------------------------------------------------------
192+
GemmMxNxKFixture<14, 6, 64>/BM_matmul_14_6_64/min_warmup_time:1.000_mean 94.8 ns 94.5 ns 10 113.789G/s
193+
GemmMxNxKFixture<14, 6, 64>/BM_matmul_14_6_64/min_warmup_time:1.000_median 94.8 ns 94.5 ns 10 113.775G/s
194+
GemmMxNxKFixture<14, 6, 64>/BM_matmul_14_6_64/min_warmup_time:1.000_stddev 0.671 ns 0.659 ns 10 790.609M/s
195+
GemmMxNxKFixture<14, 6, 64>/BM_matmul_14_6_64/min_warmup_time:1.000_cv 0.71 % 0.70 % 10 0.69%
196+
GemmMxNxKFixture<15, 6, 64>/BM_matmul_15_6_64/min_warmup_time:1.000_mean 95.5 ns 95.1 ns 10 121.074G/s
197+
GemmMxNxKFixture<15, 6, 64>/BM_matmul_15_6_64/min_warmup_time:1.000_median 95.4 ns 95.1 ns 10 121.09G/s
198+
GemmMxNxKFixture<15, 6, 64>/BM_matmul_15_6_64/min_warmup_time:1.000_stddev 0.295 ns 0.293 ns 10 373.529M/s
199+
GemmMxNxKFixture<15, 6, 64>/BM_matmul_15_6_64/min_warmup_time:1.000_cv 0.31 % 0.31 % 10 0.31%
200+
201+
- **matmul_14_6_64** kernel: :math:`113.8` GFLOPS
202+
- **matmul_15_6_64** kernel: :math:`121.1` GFLOPS
203+
204+
Accumulator Block Shapes
205+
------------------------
206+
207+
This section considers a matrix-matrix multiplication where a high-performance implementation may require accumulator blocks with different shapes.
208+
209+
1. Implement a kernel for M=15, N=6 and K=64 and wrap it in the matmul_64_64_64 function
210+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
211+
212+
File: ``neon_5_1.s``
213+
214+
matmul_64_48_64
215+
216+
For this kernel ``matmul_64_64_64`` we adapt the already implemented kernel ``matmul_64_48_64``. The only changes is that we removed two ``fmla`` blocks from the inner loop:
217+
218+
.. code-block:: asm
219+
:linenos:
220+
221+
...
222+
mov x15, #64 // x15 iterator for K loop
223+
matmul_loop_over_K:
224+
sub x15, x15, #1
225+
226+
// Load first column data from the 16x1 matrix a
227+
ld1 {v0.4s, v1.4s, v2.4s, v3.4s}, [x0], x3
228+
229+
// run the matmul_16_4_1_unrolled kernel
230+
// Load first element from the 1x4 matrix b
231+
ldr s4, [x1]
232+
add x1, x1, x4
233+
234+
// Calculate first column of c
235+
fmla v25.4s, v0.4s, v4.s[0]
236+
fmla v26.4s, v1.4s, v4.s[0]
237+
fmla v27.4s, v2.4s, v4.s[0]
238+
fmla v28.4s, v3.4s, v4.s[0]
239+
240+
241+
// Load second element from the 1x4 matrix b
242+
ldr s4, [x1]
243+
add x1, x1, x4
244+
245+
// Calculate second column of c
246+
fmla v17.4s, v0.4s, v4.s[0]
247+
fmla v18.4s, v1.4s, v4.s[0]
248+
fmla v19.4s, v2.4s, v4.s[0]
249+
fmla v20.4s, v3.4s, v4.s[0]
250+
251+
252+
// Load third element from the 1x4 matrix b
253+
ldr s4, [x1]
254+
add x1, x1, x4
255+
256+
// Calculated third column of c
257+
fmla v21.4s, v0.4s, v4.s[0]
258+
fmla v22.4s, v1.4s, v4.s[0]
259+
fmla v23.4s, v2.4s, v4.s[0]
260+
fmla v24.4s, v3.4s, v4.s[0]
261+
262+
263+
// Load fourth element from the 1x4 matrix b
264+
ldr s4, [x1]
265+
add x1, x1, x4
266+
267+
// Calculate fourth column of c
268+
fmla v5.4s, v0.4s, v4.s[0]
269+
fmla v6.4s, v1.4s, v4.s[0]
270+
fmla v7.4s, v2.4s, v4.s[0]
271+
fmla v8.4s, v3.4s, v4.s[0]
272+
273+
274+
// offset x6 to the next element in the column
275+
add x6, x6, #4 // #4 = sizeof(float)
276+
277+
// Restore x1 to be incremented again
278+
mov x1, x6
279+
280+
// Loop back to K
281+
cbnz x15, matmul_loop_over_K
282+
...
283+
284+
Then changed the number of loops over M to four :math:`4 \cdot 16 = 64`:
285+
286+
.. code-block:: asm
287+
:linenos:
288+
289+
...
290+
mov x16, #4 // x16 iterator for M loop
291+
matmul_loop_over_M:
292+
sub x16, x16, #1
293+
294+
// Load first column from the 16x6 matrix c
295+
ld1 {v25.4s, v26.4s, v27.4s, v28.4s}, [x2], x5
296+
// Load second column from the 16x6 matrix c
297+
ld1 {v17.4s, v18.4s, v19.4s, v20.4s}, [x2], x5
298+
// Load third column from the 16x6 matrix c
299+
ld1 {v21.4s, v22.4s, v23.4s, v24.4s}, [x2], x5
300+
// Load fourth column from the 16x6 matrix c
301+
ld1 {v5.4s, v6.4s, v7.4s, v8.4s}, [x2], x5
302+
303+
mov x15, #64 // x15 iterator for K loop
304+
matmul_loop_over_K:
305+
sub x15, x15, #1
306+
...
307+
308+
And finaly changed the number of loops over N to 16 :math:`16 \cdot 4 = 64`:
309+
310+
.. code-block:: asm
311+
:linenos:
312+
313+
...
314+
mov x17, #16 // x17 iterator for N loop
315+
matmul_loop_over_N:
316+
sub x17, x17, #1
317+
318+
mov x16, #4 // x16 iterator for M loop
319+
matmul_loop_over_M:
320+
sub x16, x16, #1
321+
...
322+
323+
2. Test and optimize the kernel. Report your performance in GFLOPS
324+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
325+
326+
Optimized benchmark result:
327+
328+
.. code-block:: asm
329+
:emphasize-lines: 4, 8
330+
331+
--------------------------------------------------------------------------------------------------------------------------------------------
332+
Benchmark Time CPU Iterations FLOPS
333+
--------------------------------------------------------------------------------------------------------------------------------------------
334+
GemmMxNxKFixture<64, 64, 64>/BM_matmul_64_64_64/min_warmup_time:1.000_mean 4111 ns 4097 ns 10 127.964G/s
335+
GemmMxNxKFixture<64, 64, 64>/BM_matmul_64_64_64/min_warmup_time:1.000_median 4110 ns 4096 ns 10 127.988G/s
336+
GemmMxNxKFixture<64, 64, 64>/BM_matmul_64_64_64/min_warmup_time:1.000_stddev 13.7 ns 13.8 ns 10 431.794M/s
337+
GemmMxNxKFixture<64, 64, 64>/BM_matmul_64_64_64/min_warmup_time:1.000_cv 0.33 % 0.34 % 10 0.34%
338+
339+
- **matmul_14_64_64** kernel: :math:`128.0` GFLOPS
340+
341+
Microkernel
342+
-----------
343+
344+
1. Implement generate function, support only the setting of an FP32 microkernel for C+=AB for M=16, N=6, K=1 and test for errors
345+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
346+
347+
1. Add support for k parameter by generating a K loop around the microkernel
348+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
349+
350+
1. Test the kernel generation. Report performance in GFLOPS
351+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

submissions/submission_25_05_08/neon_4_1.s

Lines changed: 1 addition & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -54,11 +54,8 @@ matmul_14_6_64:
5454
matmul_loop_over_K:
5555
sub x9, x9, #1
5656

57-
// Load first column data from the 14x1 matrix a
57+
// Load first column data from the 14x1 matrix a (again 16 but we'll only using two from v3)
5858
ld1 {v0.4s, v1.4s, v2.4s, v3.4s}, [x0], x3
59-
// ldp q0, q1, [x0] // 4 + 4 values
60-
// ldr q2, [x0, #32] // 4 values
61-
// ldr d3, [x0, #48] // 2 values
6259

6360
// run the known matmul_16_6_1_unrolled kernel with modification to matmult_14_6_1
6461
// Load first element from the 1x6 matrix b

0 commit comments

Comments
 (0)