Skip to content

Commit e225bfe

Browse files
committed
doc: neon 2
1 parent 1496934 commit e225bfe

File tree

2 files changed

+149
-7
lines changed

2 files changed

+149
-7
lines changed

docs_sphinx/getting_started/building_project.rst

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -122,6 +122,7 @@ All the executables can be found in ``../machine-learning-compilers/build``.
122122
The available executables are ``benchmarks`` and ``tests``.
123123

124124
.. note::
125+
125126
They are available when build with their respective ``--target``
126127

127128
E.g. the ``benchmarks`` executable can be run with the following command:
@@ -133,4 +134,5 @@ E.g. the ``benchmarks`` executable can be run with the following command:
133134
The most desired command for the ``benchmarks`` might be:
134135

135136
.. code-block::
136-
./benchmarks --benchmark_counters_tabular=true --benchmark_repetitions=10
137+
138+
./benchmarks --benchmark_counters_tabular=true --benchmark_repetitions=10 --benchmark_report_aggregates_only=true

docs_sphinx/submissions/report_25_05_01.rst

Lines changed: 146 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -68,9 +68,149 @@ This section microbenchmarks the execution throughput and latency of FP32 Neon i
6868
Microkernel
6969
-----------
7070

71-
Implement a Neon microkernel that computes C+=AB for M=16, N=6, and K=1. Wrap your microkernel in the `matmul_16_6_1` function.
72-
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
73-
74-
75-
Test and optimize your microkernel. Report its performance in GFLOPS.
76-
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
71+
1. Implement a Neon microkernel that computes C+=AB for M=16, N=6, and K=1.
72+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
73+
- Files ``submissions/submission_25_05_01/neon_2_1_simple.s``
74+
- Driver ``submissions/submission_25_05_01/neon_2_1_driver.cpp``
75+
76+
Implementation loops over each column over the matrix c to be calculated.
77+
78+
.. code-block:: asm
79+
:linenos:
80+
81+
...
82+
83+
// Offset the used leading dimension by the size of floats (4byte == 2 lshifts)
84+
lsl x3, x3, #2 // x3 * 4 = x3 * sizeof(float)
85+
lsl x4, x4, #2 // x4 * 4 = x4 * sizeof(float)
86+
lsl x5, x5, #2 // x5 * 4 = x5 * sizeof(float)
87+
88+
// Load all data from the 16x1 matrix a
89+
ld1 {v0.4s, v1.4s, v2.4s, v3.4s}, [x0]
90+
91+
// Init the loop counter
92+
mov x6, #6
93+
process_next_column:
94+
// Iteration -= 1
95+
subs x6, x6, #1
96+
97+
// Load next element from the 1x6 matrix
98+
// ldr s4, [x1], #4 // one-liner but not using the argument offset
99+
ldr s4, [x1]
100+
add x1, x1, x4
101+
102+
// Load next column from the 16x6 matrix c
103+
ld1 {v17.4s, v18.4s, v19.4s, v20.4s}, [x2]
104+
105+
// Calculate the next row of c
106+
fmla v17.4s, v0.4s, v4.s[0]
107+
fmla v18.4s, v1.4s, v4.s[0]
108+
fmla v19.4s, v2.4s, v4.s[0]
109+
fmla v20.4s, v3.4s, v4.s[0]
110+
111+
// Store the result back to memory
112+
st1 {v17.4s, v18.4s, v19.4s, v20.4s}, [x2], x5
113+
114+
// Compare and branch on not-zero
115+
cbnz x6, process_next_column
116+
117+
...
118+
119+
120+
2. Test and optimize your microkernel. Report its performance in GFLOPS.
121+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
122+
- Files
123+
- ``submissions/submission_25_05_01/neon_2_1.h``
124+
- ``submissions/submission_25_05_01/neon_2_1_unrolled.s``
125+
- Tests ``submissions/submission_25_05_01/neon_2_1.test.cpp``
126+
- Benchmarks ``submissions/submission_25_05_01/neon_2_1.bench.cpp``
127+
128+
Optimization
129+
############
130+
131+
To optimize the kernel we unrolled the loop into 3 different register ranges (v15-v28, v17-v20, v21-v24),
132+
to allow for less dependency between the calculation of columns.
133+
These 3 different ``fmla`` blocks gets repeated with ``.rept 2`` to achieve the total of 6 column of calculation.
134+
135+
.. code-block:: asm
136+
:linenos:
137+
138+
...
139+
140+
.rept 2
141+
// Load first element from the 1x6 matrix b
142+
ldr s4, [x1]
143+
add x1, x1, x4
144+
// Load first column from the 16x6 matrix c
145+
ld1 {v25.4s, v26.4s, v27.4s, v28.4s}, [x2]
146+
147+
// Calculate first column of c
148+
fmla v25.4s, v0.4s, v4.s[0]
149+
fmla v26.4s, v1.4s, v4.s[0]
150+
fmla v27.4s, v2.4s, v4.s[0]
151+
fmla v28.4s, v3.4s, v4.s[0]
152+
153+
// Store first column back to memory
154+
st1 {v25.4s, v26.4s, v27.4s, v28.4s}, [x2], x5
155+
156+
// Load second element from the 1x6 matrix b
157+
ldr s4, [x1]
158+
add x1, x1, x4
159+
// Load second column from the 16x6 matrix c
160+
ld1 {v17.4s, v18.4s, v19.4s, v20.4s}, [x2]
161+
162+
// Calculate second column of c
163+
fmla v17.4s, v0.4s, v4.s[0]
164+
fmla v18.4s, v1.4s, v4.s[0]
165+
fmla v19.4s, v2.4s, v4.s[0]
166+
fmla v20.4s, v3.4s, v4.s[0]
167+
168+
// Store second column back to memory
169+
st1 {v17.4s, v18.4s, v19.4s, v20.4s}, [x2], x5
170+
171+
// Load third element from the 1x6 matrix b
172+
ldr s4, [x1]
173+
add x1, x1, x4
174+
// Load third column from the 16x6 matrix c
175+
ld1 {v21.4s, v22.4s, v23.4s, v24.4s}, [x2]
176+
177+
// Calculated third column of c
178+
fmla v21.4s, v0.4s, v4.s[0]
179+
fmla v22.4s, v1.4s, v4.s[0]
180+
fmla v23.4s, v2.4s, v4.s[0]
181+
fmla v24.4s, v3.4s, v4.s[0]
182+
183+
// Store third column back to memory
184+
st1 {v21.4s, v22.4s, v23.4s, v24.4s}, [x2], x5
185+
.endr
186+
187+
...
188+
189+
Benchmarks
190+
##########
191+
192+
We run the benchmark with the following command:
193+
194+
.. code-block::
195+
196+
./benchmarks --benchmark_counters_tabular=true --benchmark_repetitions=10 --benchmark_report_aggregates_only=true
197+
198+
Therefore we do 10 repetitions of the benchmark which do about ``120 000 000`` iterations each on our matmul kernels.
199+
200+
.. code-block::
201+
:emphasize-lines: 4, 8
202+
203+
----------------------------------------------------------------------------------------------------------------------------------
204+
Benchmark Time CPU Iterations FLOPS
205+
----------------------------------------------------------------------------------------------------------------------------------
206+
Gemm16x6x1Fixture/BM_matmul_16_6_1_simple/min_warmup_time:1.000_mean 5.89 ns 5.87 ns 10 32.7048G/s
207+
Gemm16x6x1Fixture/BM_matmul_16_6_1_simple/min_warmup_time:1.000_median 5.89 ns 5.87 ns 10 32.7228G/s
208+
Gemm16x6x1Fixture/BM_matmul_16_6_1_simple/min_warmup_time:1.000_stddev 0.046 ns 0.044 ns 10 244.331M/s
209+
Gemm16x6x1Fixture/BM_matmul_16_6_1_simple/min_warmup_time:1.000_cv 0.77 % 0.75 % 10 0.75%
210+
Gemm16x6x1Fixture/BM_matmul_16_6_1_unrolled/min_warmup_time:1.000_mean 5.74 ns 5.72 ns 10 33.5453G/s
211+
Gemm16x6x1Fixture/BM_matmul_16_6_1_unrolled/min_warmup_time:1.000_median 5.73 ns 5.71 ns 10 33.6103G/s
212+
Gemm16x6x1Fixture/BM_matmul_16_6_1_unrolled/min_warmup_time:1.000_stddev 0.051 ns 0.050 ns 10 291.918M/s
213+
Gemm16x6x1Fixture/BM_matmul_16_6_1_unrolled/min_warmup_time:1.000_cv 0.90 % 0.88 % 10 0.87%
214+
215+
We see that the simple first implementation of our matmul kernel gets about **32.7 GFLOPS**.
216+
The optimized unrolled version gets about 0.8 GFLOPS more resulting in **33.5 GFLOPS**.

0 commit comments

Comments
 (0)