Skip to content

Commit 5225bb0

Browse files
committed
doc: iteration neon
1 parent bd0253d commit 5225bb0

File tree

1 file changed

+98
-38
lines changed

1 file changed

+98
-38
lines changed

docs_sphinx/chapters/neon.rst

Lines changed: 98 additions & 38 deletions
Original file line numberDiff line numberDiff line change
@@ -1,16 +1,24 @@
11
Neon
22
====
33

4+
This chapter focuses on implementing the first kernels using ARM64 `Neon <https://developer.arm.com/Architectures/Neon>`_ instructions.
5+
The goal is to develop highly optimized kernels for matrix-matrix multiplication and batch-reduced matrix multiplication.
6+
47
Execution Throughput and Latency
58
--------------------------------
69

7-
This section microbenchmarks the execution throughput and latency of FP32 Neon instructions.
10+
First, we will microbenchmark the execution throughput and latency of selected FP32 NEON instructions. This will provide a better
11+
understanding of their performance characteristics and serve as a reference point for performance expectations.
812

913
1. Execution Throughput
1014
^^^^^^^^^^^^^^^^^^^^^^^
1115

1216
**Task**: Microbenchmark the execution throughput of the following instructions:
1317

18+
Each subtask is sturctured into four parts: the file containing the implementation of the subtask, the driver file that runs the assembly code,
19+
a compilation command to create an executable, and a short description of the results. The results of the microbenchmarks are documented in the
20+
image below:
21+
1422
.. image:: ../_static/images/report_25_05_01/neon_1_1.png
1523
:align: center
1624

@@ -20,34 +28,40 @@ This section microbenchmarks the execution throughput and latency of FP32 Neon i
2028
- Driver: ``submissions/submission_25_05_01/neon_1_1_driver.cpp``
2129
- Compilation: ``g++ -o neon_1_1.exe neon_1_1_driver.cpp neon_1_1.s``
2230
- We have :math:`13.2304 \cdot 10^{10}` instructions per second.
23-
That are :math:`13.2304 \cdot 10^{10} / 8 = 16.538 \cdot 10^9` instructions per ALU per second.
24-
This aligns with a **throughput of** :math:`\approx 4` **instruction per cycle**, as it is known from benchmarks that the performance cores of the M4 chip have a clock speed of 4.4 GHz.
25-
31+
That are :math:`132.304` GFLOPs/sec in total.
2632

2733
**Subtask**: ``FMLA`` (vector) with arrangement specifier ``2S``.
2834

2935
- File: ``submissions/submission_25_05_01/neon_1_1.s``
3036
- Driver: ``submissions/submission_25_05_01/neon_1_1_driver.cpp``
3137
- Compilation: ``g++ -o neon_1_1.exe neon_1_1_driver.cpp neon_1_1.s``
3238
- We have :math:`6.65221 \cdot 10^{10}` instructions per second.
33-
That are :math:`6.65221 \cdot 10^{10} / 8 = 8.31526 \cdot 10^9` instructions per ALU per second.
34-
This aligns with a **throughput of** :math:`\approx 2` **instruction per cycle**, as it is known from benchmarks that the performance cores of the M4 chip have a clock speed of 4.4 GHz.
35-
39+
That are :math:`66.5221` GFLOPs/sec in total.
3640

3741
**Subtask**: ``FMADD`` (scalar), single-precision variant.
3842

3943
- File: ``submissions/submission_25_05_01/neon_1_1.s``
4044
- Driver: ``submissions/submission_25_05_01/neon_1_1_driver.cpp``
4145
- Compilation: ``g++ -o neon_1_1.exe neon_1_1_driver.cpp neon_1_1.s``
4246
- We have :math:`1.12728 \cdot 10^{10}` instructions per second.
43-
That are :math:`1.12728 \cdot 10^{10} / 8 = 1.4091 \cdot 10^9` instructions per ALU per second.
44-
This aligns with a **throughput of** :math:`\approx 1/3` **instruction per cycle**, as it is known from benchmarks that the performance cores of the M4 chip have a clock speed of 4.4 GHz.
47+
That are :math:`11.2728` GFLOPs/sec in total.
48+
49+
**Summary**
50+
51+
It can be seen that the usage of SIMD lanes can increase the throughput significantly. From the scala ``FMADD`` instruction to the vector
52+
``FMLA``instruction with arrangement specifier ``2S`` the throughput increases by a factor of about 6. The throughput of the vector
53+
``FMLA`` instruction with arrangement specifier ``4S`` is about twice as high as the one with ``2S``, resulting in a factor of about 12 compared to
54+
the scalar ``FMADD`` instruction. This shows the power of SIMD instructions and how they can be used to increase the throughput.
4555

4656
2. Execution Latency
4757
^^^^^^^^^^^^^^^^^^^^
4858

4959
**Task**: Microbenchmark the execution latency of ``FMLA`` (vector) with arrangement specifier ``4s``. Consider the following two cases:
5060

61+
Same structure as above, with the file containing the implementation of the subtask, the driver file that runs the assembly code,
62+
a compilation command to create an executable, and a short description of the results. The results of the microbenchmarks are documented
63+
in the image below:
64+
5165
.. image:: ../_static/images/report_25_05_01/neon_1_2.png
5266
:align: center
5367

@@ -67,10 +81,17 @@ This section microbenchmarks the execution throughput and latency of FP32 Neon i
6781
- We have :math:`11.7019 \cdot 10^9` instruction per seconds in a single ALU.
6882
Resulting in a **latency of** :math:`\approx 1/3` **cycle** for the known clock speed of 4.4 GHz.
6983

84+
**Summary**
85+
86+
We see that the latency of the ``FMLA`` instruction is about 1/3 of a cycle, regardless of whether there is a dependency on one of the
87+
source registers or only on the destination register.
7088

7189
Microkernel
7290
-----------
7391

92+
Next, we implement the first microkernel for the matrix-matrix multiplication of :math:`16 \times 1`matrice with a :math:`1 \times 6` matrix
93+
which uses a :math:`16 \times 6` accumulator matrix C and computes C+=AB.
94+
7495
1. matmul_16_6_1
7596
^^^^^^^^^^^^^^^^
7697

@@ -79,7 +100,7 @@ Microkernel
79100
- File: ``submissions/submission_25_05_01/neon_2_simple.s``
80101
- Driver: ``submissions/submission_25_05_01/neon_2_driver.cpp``
81102

82-
Implementation loops over each column over the matrix c to be calculated.
103+
Implementation of the microkernel looping over each of the six columns of the matrix C:
83104

84105
.. code-block:: asm
85106
:linenos:
@@ -120,14 +141,16 @@ Implementation loops over each column over the matrix c to be calculated.
120141
cbnz x6, process_next_column
121142
...
122143
144+
.. _neon_2_optimization:
145+
123146
2. Performance
124147
^^^^^^^^^^^^^^
125148

126149
**Task**: Test and optimize your microkernel. Report its performance in GFLOPS.
127150

128-
- Files:
129-
- ``submissions/submission_25_05_01/neon_2.h``
130-
- ``submissions/submission_25_05_01/neon_2_unrolled.s``
151+
- Files:
152+
- ``submissions/submission_25_05_01/neon_2.h`` using a loop over the columns
153+
- ``submissions/submission_25_05_01/neon_2_unrolled.s`` using an unrolled version of the loop
131154
- Tests: ``submissions/submission_25_05_01/neon_2.test.cpp``
132155
- Benchmarks: ``submissions/submission_25_05_01/neon_2.bench.cpp``
133156

@@ -221,11 +244,16 @@ The optimized unrolled version gets about 0.7 GFLOPS more resulting in **33.7 GF
221244
Loops
222245
-----
223246

247+
To scale the microkernel to larger matrices, we will introduce loops over the *K*, *M*, and *N* dimensions.
248+
224249
1. Loop over K
225250
^^^^^^^^^^^^^^
226251

227252
**Task**: Loop over K: Implement a kernel that computes C+=AB for M=16, N=6 and K=64. Wrap your kernel in the ``matmul_16_6_64`` function.
228253

254+
The first loop implemented is over the *K* dimension, which is the most inner loop in the matrix multiplication. The result is a microkernel
255+
that computes C+=AB for M=16, N=6 and K=64.
256+
229257
- File ``submissions/submission_25_05_01/neon_3_1.s``
230258

231259
.. code-block:: asm
@@ -358,6 +386,8 @@ Loops
358386

359387
**Task**: Loop over M: Implement a kernel that computes C+=AB for M=64, N=6 and K=64. Wrap your kernel in the ``matmul_64_6_64`` function.
360388

389+
The next extension is to loop over the *M* dimension to allow computation of C+=AB for *M*=64, N=6 and K=64*.
390+
361391
- File ``submissions/submission_25_05_01/neon_3_2.s``
362392

363393
.. code-block:: asm
@@ -397,12 +427,15 @@ Loops
397427
// Loop back to M
398428
cbnz x16, matmul_loop_over_M
399429
430+
.. _neon_3_loop_over_N:
400431

401432
3. Loop over N
402433
^^^^^^^^^^^^^^
403434

404435
**Task**: Loop over N: Implement a kernel that computes C+=AB for M=64, N=48 and K=64. Wrap your kernel in the ``matmul_64_48_64`` function.
405436

437+
The final extension is to loop over the *N* dimension to allow computation of C+=AB for *M*=64, *N*=48 and *K*=64*.
438+
406439
- File ``submissions/submission_25_05_01/neon_3_3.s``
407440

408441
.. code-block:: asm
@@ -459,7 +492,8 @@ Loops
459492

460493
**Subtask**: Optimization
461494

462-
Usage of already optimized `matmul_16_6_1` from task 2.
495+
Usage of already optimized `matmul_16_6_1` from task :ref:`neon_2_optimization` to as inner microkernel for the
496+
loop over K, M, and N.
463497

464498
**Subtask**: Benchmarks
465499

@@ -497,16 +531,26 @@ We run the benchmark with the following command:
497531
SIMD Lanes
498532
----------
499533

500-
This section considers matrix-matrix multiplications, that require instructions where only a subset of SIMD lanes are active.
534+
Up to this point, our *M* and *K* dimensions have always been multiples of 4. This allowed us to fully utilize all SIMD lanes when loading
535+
and storing data from memory. That means we could load or store 4 floats at once using a single instruction, which reduces complexity and
536+
improves the performance of our kernels.
537+
538+
However, this assumption doesn't always exist in real-world applications. To make our implementation more robust, we need to adapt our
539+
kernels to handle cases where the *M* and *K* dimensions are not multiples of 4. Therefore Neon supports loading 4, 2, or 1 float(s) at a
540+
time, which enables us to manage these edge cases.
501541

502542
1. matmul_14_6_64
503543
^^^^^^^^^^^^^^^^^
504544

505545
**Task**: Implement a kernel that computes C+=AB for M=14, N=6 and K=64. Wrap your kernel in the ``matmul_14_6_64`` function.
506546

547+
We first have a look at the case where we have a *M* dimension of 14. Data management can be done by loading/storing three columns of 4
548+
floats and one column of 2 floats.
549+
507550
File: ``neon_4_1.s``
508551

509-
For this kernel ``matmul_14_6_64`` we adapt the already implemented kernel ``matmul_16_6_64``. The only change is that we now use 3 ``fmla`` instructions that operate on 4 scalars, and one ``fmla`` instruction that only uses 2 scalars: :math:`4 \cdot 3 + 1 \cdot 2 = 14`.
552+
For this kernel ``matmul_14_6_64`` we adapt the already implemented kernel ``matmul_16_6_64``. The only change is that we now use 3
553+
``fmla`` instructions that operate on 4 scalars, and one ``fmla`` instruction that only uses 2 scalars: :math:`4 \cdot 3 + 1 \cdot 2 = 14`.
510554

511555
We load the full 16 floats and ignore the last 2:
512556

@@ -563,7 +607,8 @@ Next the loop over K:
563607
fmla v20.2s, v3.2s, v4.s[0]
564608
...
565609
566-
We store the full 16 computed floats back to memory but only add an offset of 14 floats because the last two floats aren't used. The last 14 values are exactly stored (8+4+2).
610+
We store the full 16 computed floats back to memory but only add an offset of 14 floats because the last two floats aren't used.
611+
The last 14 values we have to save back to memory are exactly stored (8+4+2) to not right into memory we maybe not own.
567612

568613
.. code-block:: asm
569614
:linenos:
@@ -590,9 +635,13 @@ We store the full 16 computed floats back to memory but only add an offset of 14
590635

591636
**Task**: Implement a kernel that computes C+=AB for M=15, N=6 and K=64. Wrap your kernel in the ``matmul_15_6_64`` function.
592637

638+
The second edge case we manage is the case where we have a *M* dimension of 15. Data management can be done by loading/storing three columns
639+
of 4 floats, one column of 2 floats, and one time 1 float.
640+
593641
File: ``neon_4_2.s``
594642

595-
For this kernel ``matmul_15_6_64`` we adapt the already implemented kernel ``matmul_16_6_64``. The only change is that we ignore the last computed float value from the 4 ``fmla`` instructions when saving back to memory.
643+
For this kernel ``matmul_15_6_64`` we adapt the already implemented kernel ``matmul_16_6_64``. The only change is that we ignore the last
644+
computed float value from the four ``fmla`` instructions when saving back to memory.
596645

597646
We load the full 16 floats and ignore the last one:
598647

@@ -652,7 +701,8 @@ Next the loop over K:
652701
fmla v20.4s, v3.4s, v4.s[0]
653702
...
654703
655-
We store the full 16 computed floats back to memory but only add an offset of 15 floats because the last float isn't used. The last 15 values are exactly stored (8+4+2+1).
704+
We store the full 16 computed floats back to memory but only add an offset of 15 floats because the last float isn't used. However, the last 15
705+
values are exactly stored (8+4+2+1) back to memory to not write into memory we maybe not own.
656706

657707
.. code-block:: asm
658708
:linenos:
@@ -681,6 +731,9 @@ We store the full 16 computed floats back to memory but only add an offset of 15
681731

682732
**Task**: Test and optimize the kernels. Report your performance in GFLOP
683733

734+
Since we already optimized the base kernel ``matmul_16_6_1`` in task :ref:`neon_2_optimization`, we do not found any further
735+
optimizations for the kernels ``matmul_14_6_64`` and ``matmul_15_6_64``.
736+
684737
Optimized benchmark results:
685738

686739
.. code-block::
@@ -702,8 +755,8 @@ Optimized benchmark results:
702755
- **matmul_14_6_64** kernel: :math:`113.8` GFLOPS
703756
- **matmul_15_6_64** kernel: :math:`121.1` GFLOPS
704757

705-
Accumulator Block Shapes
706-
------------------------
758+
Accumulator Shapes
759+
------------------
707760

708761
This section considers a matrix-matrix multiplication where a high-performance implementation may require accumulator blocks with different shapes.
709762

@@ -714,7 +767,8 @@ This section considers a matrix-matrix multiplication where a high-performance i
714767

715768
File: ``neon_5_1.s``
716769

717-
For this kernel ``matmul_64_64_64`` we adapt the already implemented kernel ``matmul_64_48_64``. The only changes is that we removed two ``fmla`` blocks from the inner loop:
770+
For this kernel ``matmul_64_64_64`` we adapt the already implemented kernel ``matmul_64_48_64``. The only changes is that we removed
771+
two ``fmla`` blocks from the inner loop:
718772

719773
.. code-block:: asm
720774
:linenos:
@@ -782,7 +836,7 @@ For this kernel ``matmul_64_64_64`` we adapt the already implemented kernel ``ma
782836
cbnz x15, matmul_loop_over_K
783837
...
784838
785-
Then changed the number of loops over M to four :math:`4 \cdot 16 = 64`:
839+
Then changed the number of loops over M to four to achieve :math:`4 \cdot 16 = 64`:
786840

787841
.. code-block:: asm
788842
:linenos:
@@ -826,7 +880,7 @@ And finaly changed the number of loops over N to 16 :math:`16 \cdot 4 = 64`:
826880

827881
**Task**: Test and optimize the kernel. Report your performance in GFLOPS.
828882

829-
Optimized benchmark result:
883+
After experimenting with different loop orders, we stay with the current order of loops over N, M, and K. The benchmark results are listed below.
830884

831885
.. code-block::
832886
@@ -844,16 +898,20 @@ Optimized benchmark result:
844898
Batch-Reduce GEMM
845899
-----------------
846900

847-
This section considers a batch-reduce matrix-matrix multiplication that has a fourth dimension in addition to the known M, N, and K dimensions.
901+
This section examines a batch-reduced matrix-matrix multiplication that introduces a fourth dimension *C* alongside the knwon
902+
*M*, *N*, and *K* dimensions. A batch-reduced matrix-matrix multiplication (BRGEMM or BRMM) is an operation where multiple pairs
903+
of matrices are multiplied, and their results are accumulated into a single output matrix. This operation is commonly used in
904+
machine learning to efficiently perform repeated matrix multiplications with summation across a batch dimension.
848905

849906
1. matmul_64_48_64_16
850907
^^^^^^^^^^^^^^^^^^^^^
851908

852-
**Task**: mplement a kernel that computes C+=∑AᵢBᵢ for M=64, N=48 and K=64 and a batch-reduce dimension size of 16. Wrap your kernel in the ``matmul_64_48_64_16`` function.
909+
**Task**: Implement a kernel that computes C+=∑AᵢBᵢ for M=64, N=48 and K=64 and a batch-reduce dimension size of 16. Wrap your kernel
910+
in the ``matmul_64_48_64_16`` function.
853911

854-
File: ``neon_6_1.s``
912+
- File: ``neon_6_1.s``
855913

856-
We started by implementing a kernel ``matmul_64_48_64`` with a batch dimension of one which is in the file ``neon_6_1_batch1.s``.
914+
We started by using our ``matmul_64_48_64`` from :ref:`neon_3_loop_over_N` kernel with a batch dimension of one which is in the file ``neon_6_1_batch1.s``.
857915

858916
.. code-block:: asm
859917
:linenos:
@@ -891,7 +949,7 @@ We started by implementing a kernel ``matmul_64_48_64`` with a batch dimension o
891949
// Loop back to N
892950
cbnz x17, matmul_loop_over_N
893951
894-
Then we wrapped the ``matmul_64_48_64`` kernel inside another batch loop of size 16:
952+
Then we wrapped the ``matmul_64_48_64`` kernel inside another loop of size 16, representing the batch dimension:
895953

896954
.. code-block:: asm
897955
:linenos:
@@ -947,10 +1005,10 @@ Then we wrapped the ``matmul_64_48_64`` kernel inside another batch loop of size
9471005
**Task**: Test and optimize the kernel. Report your performance in GFLOPS.
9481006

9491007
We tested a variation in which the batch loop was positioned between the M and K loops. This approach achieved around :math:`73` GFLOPS.
950-
We suspect that the reason for this was that the matrices did not fit into the cache.
951-
We do not follow this approach due to the poor performance, and we lost the file due to a false ``rm`` statement.
1008+
We suspect that the reason for this was that the matrices did not fit into the cache. Therefore, we do not follow this approach due to
1009+
the poor performance.
9521010

953-
However, this leads us to assume that our result of putting the batch loop outside is satisfactory.
1011+
However, this leads us to assume that our result of putting the batch loop outside is a good choice. The benchmark results are listed below.
9541012

9551013
.. code-block::
9561014
:emphasize-lines: 4, 8
@@ -974,8 +1032,9 @@ However, this leads us to assume that our result of putting the batch loop outsi
9741032
Transposition
9751033
-------------
9761034

977-
This section develops a kernel that performs the identity operation on the elements of an 8x8 column-major matrix A and stores the
978-
result in row-major format in matrix B.
1035+
The final topic of this chapter covers matrix transposition. Transposing a matrix means swapping its rows and columns which is a common
1036+
operation in many matrix computations. We developed a kernel that performs the identity operation on the elements of an :math:`8 \times 8`
1037+
matrix stored in column-major format matrix A and writes the result in row-major format to matrix B.
9791038

9801039
1. Transpose
9811040
^^^^^^^^^^^^
@@ -984,12 +1043,12 @@ result in row-major format in matrix B.
9841043

9851044
File: ``neon_7_1.s``
9861045

987-
From the lecture, we already know the 4x4 transpose kernel. Therefore, we have the following idea:
1046+
From the lecture, we already know the :math:`4 \times 4` transpose kernel. Therefore, we have the following idea:
9881047

9891048
1. Divide the 8x8 matrix A into four 4x4 sub-matrices
9901049
2. Transpose each 4x4 sub-matrix
9911050
3. Save T(A) and T(D) sub-matrix to matrix B
992-
4. Swap B and C: Save T(B) to bottom-left sub-matrix of B and T(C) to top-right sub-matrix of B
1051+
4. Swap sub-matrix B and C: Save T(B) to bottom-left sub-matrix of B and T(C) to top-right sub-matrix of B
9931052

9941053
.. image:: ../_static/images/report_25_05_22/trans_8_8.png
9951054
:align: left
@@ -1092,6 +1151,8 @@ Code:
10921151

10931152
**Task**: Test and optimize your kernel. Report its performance in GiB/s.
10941153

1154+
We benchmarked the performance of our transpose kernel and achieved the following results:
1155+
10951156
.. code-block::
10961157
:emphasize-lines: 4
10971158
@@ -1104,5 +1165,4 @@ Code:
11041165
Trans8x8Fixture/BT_tran_8_8/min_warmup_time:1.000_cv 0.59 % 0.59 % 10 0.58%
11051166
11061167
1107-
- **tran_8_8** kernel: :math:`50.5` GiB/s
1108-
1168+
- **tran_8_8** kernel: :math:`101.2` GiB/s

0 commit comments

Comments
 (0)