@@ -68,9 +68,149 @@ This section microbenchmarks the execution throughput and latency of FP32 Neon i
6868Microkernel
6969-----------
7070
71- Implement a Neon microkernel that computes C+=AB for M=16, N=6, and K=1. Wrap your microkernel in the `matmul_16_6_1 ` function.
72- ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
73-
74-
75- Test and optimize your microkernel. Report its performance in GFLOPS.
76- ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
71+ 1. Implement a Neon microkernel that computes C+=AB for M=16, N=6, and K=1.
72+ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
73+ - Files ``submissions/submission_25_05_01/neon_2_1_simple.s ``
74+ - Driver ``submissions/submission_25_05_01/neon_2_1_driver.cpp ``
75+
76+ Implementation loops over each column over the matrix c to be calculated.
77+
78+ .. code-block :: asm
79+ :linenos:
80+
81+ ...
82+
83+ // Offset the used leading dimension by the size of floats (4byte == 2 lshifts)
84+ lsl x3, x3, #2 // x3 * 4 = x3 * sizeof(float)
85+ lsl x4, x4, #2 // x4 * 4 = x4 * sizeof(float)
86+ lsl x5, x5, #2 // x5 * 4 = x5 * sizeof(float)
87+
88+ // Load all data from the 16x1 matrix a
89+ ld1 {v0.4s, v1.4s, v2.4s, v3.4s}, [x0]
90+
91+ // Init the loop counter
92+ mov x6, #6
93+ process_next_column:
94+ // Iteration -= 1
95+ subs x6, x6, #1
96+
97+ // Load next element from the 1x6 matrix
98+ // ldr s4, [x1], #4 // one-liner but not using the argument offset
99+ ldr s4, [x1]
100+ add x1, x1, x4
101+
102+ // Load next column from the 16x6 matrix c
103+ ld1 {v17.4s, v18.4s, v19.4s, v20.4s}, [x2]
104+
105+ // Calculate the next row of c
106+ fmla v17.4s, v0.4s, v4.s[0]
107+ fmla v18.4s, v1.4s, v4.s[0]
108+ fmla v19.4s, v2.4s, v4.s[0]
109+ fmla v20.4s, v3.4s, v4.s[0]
110+
111+ // Store the result back to memory
112+ st1 {v17.4s, v18.4s, v19.4s, v20.4s}, [x2], x5
113+
114+ // Compare and branch on not-zero
115+ cbnz x6, process_next_column
116+
117+ ...
118+
119+
120+ 2. Test and optimize your microkernel. Report its performance in GFLOPS.
121+ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
122+ - Files
123+ - ``submissions/submission_25_05_01/neon_2_1.h ``
124+ - ``submissions/submission_25_05_01/neon_2_1_unrolled.s ``
125+ - Tests ``submissions/submission_25_05_01/neon_2_1.test.cpp ``
126+ - Benchmarks ``submissions/submission_25_05_01/neon_2_1.bench.cpp ``
127+
128+ Optimization
129+ ############
130+
131+ To optimize the kernel we unrolled the loop into 3 different register ranges (v15-v28, v17-v20, v21-v24),
132+ to allow for less dependency between the calculation of columns.
133+ These 3 different ``fmla `` blocks gets repeated with ``.rept 2 `` to achieve the total of 6 column of calculation.
134+
135+ .. code-block :: asm
136+ :linenos:
137+
138+ ...
139+
140+ .rept 2
141+ // Load first element from the 1x6 matrix b
142+ ldr s4, [x1]
143+ add x1, x1, x4
144+ // Load first column from the 16x6 matrix c
145+ ld1 {v25.4s, v26.4s, v27.4s, v28.4s}, [x2]
146+
147+ // Calculate first column of c
148+ fmla v25.4s, v0.4s, v4.s[0]
149+ fmla v26.4s, v1.4s, v4.s[0]
150+ fmla v27.4s, v2.4s, v4.s[0]
151+ fmla v28.4s, v3.4s, v4.s[0]
152+
153+ // Store first column back to memory
154+ st1 {v25.4s, v26.4s, v27.4s, v28.4s}, [x2], x5
155+
156+ // Load second element from the 1x6 matrix b
157+ ldr s4, [x1]
158+ add x1, x1, x4
159+ // Load second column from the 16x6 matrix c
160+ ld1 {v17.4s, v18.4s, v19.4s, v20.4s}, [x2]
161+
162+ // Calculate second column of c
163+ fmla v17.4s, v0.4s, v4.s[0]
164+ fmla v18.4s, v1.4s, v4.s[0]
165+ fmla v19.4s, v2.4s, v4.s[0]
166+ fmla v20.4s, v3.4s, v4.s[0]
167+
168+ // Store second column back to memory
169+ st1 {v17.4s, v18.4s, v19.4s, v20.4s}, [x2], x5
170+
171+ // Load third element from the 1x6 matrix b
172+ ldr s4, [x1]
173+ add x1, x1, x4
174+ // Load third column from the 16x6 matrix c
175+ ld1 {v21.4s, v22.4s, v23.4s, v24.4s}, [x2]
176+
177+ // Calculated third column of c
178+ fmla v21.4s, v0.4s, v4.s[0]
179+ fmla v22.4s, v1.4s, v4.s[0]
180+ fmla v23.4s, v2.4s, v4.s[0]
181+ fmla v24.4s, v3.4s, v4.s[0]
182+
183+ // Store third column back to memory
184+ st1 {v21.4s, v22.4s, v23.4s, v24.4s}, [x2], x5
185+ .endr
186+
187+ ...
188+
189+ Benchmarks
190+ ##########
191+
192+ We run the benchmark with the following command:
193+
194+ .. code-block ::
195+
196+ ./benchmarks --benchmark_counters_tabular=true --benchmark_repetitions=10 --benchmark_report_aggregates_only=true
197+
198+ Therefore we do 10 repetitions of the benchmark which do about ``120 000 000 `` iterations each on our matmul kernels.
199+
200+ .. code-block ::
201+ :emphasize-lines: 4, 8
202+
203+ ----------------------------------------------------------------------------------------------------------------------------------
204+ Benchmark Time CPU Iterations FLOPS
205+ ----------------------------------------------------------------------------------------------------------------------------------
206+ Gemm16x6x1Fixture/BM_matmul_16_6_1_simple/min_warmup_time:1.000_mean 5.89 ns 5.87 ns 10 32.7048G/s
207+ Gemm16x6x1Fixture/BM_matmul_16_6_1_simple/min_warmup_time:1.000_median 5.89 ns 5.87 ns 10 32.7228G/s
208+ Gemm16x6x1Fixture/BM_matmul_16_6_1_simple/min_warmup_time:1.000_stddev 0.046 ns 0.044 ns 10 244.331M/s
209+ Gemm16x6x1Fixture/BM_matmul_16_6_1_simple/min_warmup_time:1.000_cv 0.77 % 0.75 % 10 0.75%
210+ Gemm16x6x1Fixture/BM_matmul_16_6_1_unrolled/min_warmup_time:1.000_mean 5.74 ns 5.72 ns 10 33.5453G/s
211+ Gemm16x6x1Fixture/BM_matmul_16_6_1_unrolled/min_warmup_time:1.000_median 5.73 ns 5.71 ns 10 33.6103G/s
212+ Gemm16x6x1Fixture/BM_matmul_16_6_1_unrolled/min_warmup_time:1.000_stddev 0.051 ns 0.050 ns 10 291.918M/s
213+ Gemm16x6x1Fixture/BM_matmul_16_6_1_unrolled/min_warmup_time:1.000_cv 0.90 % 0.88 % 10 0.87%
214+
215+ We see that the simple first implementation of our matmul kernel gets about **32.7 GFLOPS **.
216+ The optimized unrolled version gets about 0.8 GFLOPS more resulting in **33.5 GFLOPS **.
0 commit comments