You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
- We have :math:`13.2304\cdot10^{10}` instructions per second.
23
-
That are :math:`13.2304\cdot10^{10} / 8 = 16.538\cdot10^9` instructions per ALU per second.
24
-
This aligns with a **throughput of** :math:`\approx4` **instruction per cycle**, as it is known from benchmarks that the performance cores of the M4 chip have a clock speed of 4.4 GHz.
25
-
31
+
That are :math:`132.304` GFLOPs/sec in total.
26
32
27
33
**Subtask**: ``FMLA`` (vector) with arrangement specifier ``2S``.
- We have :math:`6.65221\cdot10^{10}` instructions per second.
33
-
That are :math:`6.65221\cdot10^{10} / 8 = 8.31526\cdot10^9` instructions per ALU per second.
34
-
This aligns with a **throughput of** :math:`\approx2` **instruction per cycle**, as it is known from benchmarks that the performance cores of the M4 chip have a clock speed of 4.4 GHz.
- We have :math:`1.12728\cdot10^{10}` instructions per second.
43
-
That are :math:`1.12728\cdot10^{10} / 8 = 1.4091\cdot10^9` instructions per ALU per second.
44
-
This aligns with a **throughput of** :math:`\approx1/3` **instruction per cycle**, as it is known from benchmarks that the performance cores of the M4 chip have a clock speed of 4.4 GHz.
47
+
That are :math:`11.2728` GFLOPs/sec in total.
48
+
49
+
**Summary**
50
+
51
+
It can be seen that the usage of SIMD lanes can increase the throughput significantly. From the scala ``FMADD`` instruction to the vector
52
+
``FMLA``instruction with arrangement specifier ``2S`` the throughput increases by a factor of about 6. The throughput of the vector
53
+
``FMLA`` instruction with arrangement specifier ``4S`` is about twice as high as the one with ``2S``, resulting in a factor of about 12 compared to
54
+
the scalar ``FMADD`` instruction. This shows the power of SIMD instructions and how they can be used to increase the throughput.
45
55
46
56
2. Execution Latency
47
57
^^^^^^^^^^^^^^^^^^^^
48
58
49
59
**Task**: Microbenchmark the execution latency of ``FMLA`` (vector) with arrangement specifier ``4s``. Consider the following two cases:
50
60
61
+
Same structure as above, with the file containing the implementation of the subtask, the driver file that runs the assembly code,
62
+
a compilation command to create an executable, and a short description of the results. The results of the microbenchmarks are documented
Usage of already optimized `matmul_16_6_1` from task 2.
495
+
Usage of already optimized `matmul_16_6_1` from task :ref:`neon_2_optimization` to as inner microkernel for the
496
+
loop over K, M, and N.
463
497
464
498
**Subtask**: Benchmarks
465
499
@@ -497,16 +531,26 @@ We run the benchmark with the following command:
497
531
SIMD Lanes
498
532
----------
499
533
500
-
This section considers matrix-matrix multiplications, that require instructions where only a subset of SIMD lanes are active.
534
+
Up to this point, our *M* and *K* dimensions have always been multiples of 4. This allowed us to fully utilize all SIMD lanes when loading
535
+
and storing data from memory. That means we could load or store 4 floats at once using a single instruction, which reduces complexity and
536
+
improves the performance of our kernels.
537
+
538
+
However, this assumption doesn't always exist in real-world applications. To make our implementation more robust, we need to adapt our
539
+
kernels to handle cases where the *M* and *K* dimensions are not multiples of 4. Therefore Neon supports loading 4, 2, or 1 float(s) at a
540
+
time, which enables us to manage these edge cases.
501
541
502
542
1. matmul_14_6_64
503
543
^^^^^^^^^^^^^^^^^
504
544
505
545
**Task**: Implement a kernel that computes C+=AB for M=14, N=6 and K=64. Wrap your kernel in the ``matmul_14_6_64`` function.
506
546
547
+
We first have a look at the case where we have a *M* dimension of 14. Data management can be done by loading/storing three columns of 4
548
+
floats and one column of 2 floats.
549
+
507
550
File: ``neon_4_1.s``
508
551
509
-
For this kernel ``matmul_14_6_64`` we adapt the already implemented kernel ``matmul_16_6_64``. The only change is that we now use 3 ``fmla`` instructions that operate on 4 scalars, and one ``fmla`` instruction that only uses 2 scalars: :math:`4\cdot3 + 1\cdot2 = 14`.
552
+
For this kernel ``matmul_14_6_64`` we adapt the already implemented kernel ``matmul_16_6_64``. The only change is that we now use 3
553
+
``fmla`` instructions that operate on 4 scalars, and one ``fmla`` instruction that only uses 2 scalars: :math:`4\cdot3 + 1\cdot2 = 14`.
510
554
511
555
We load the full 16 floats and ignore the last 2:
512
556
@@ -563,7 +607,8 @@ Next the loop over K:
563
607
fmla v20.2s, v3.2s, v4.s[0]
564
608
...
565
609
566
-
We store the full 16 computed floats back to memory but only add an offset of 14 floats because the last two floats aren't used. The last 14 values are exactly stored (8+4+2).
610
+
We store the full 16 computed floats back to memory but only add an offset of 14 floats because the last two floats aren't used.
611
+
The last 14 values we have to save back to memory are exactly stored (8+4+2) to not right into memory we maybe not own.
567
612
568
613
.. code-block:: asm
569
614
:linenos:
@@ -590,9 +635,13 @@ We store the full 16 computed floats back to memory but only add an offset of 14
590
635
591
636
**Task**: Implement a kernel that computes C+=AB for M=15, N=6 and K=64. Wrap your kernel in the ``matmul_15_6_64`` function.
592
637
638
+
The second edge case we manage is the case where we have a *M* dimension of 15. Data management can be done by loading/storing three columns
639
+
of 4 floats, one column of 2 floats, and one time 1 float.
640
+
593
641
File: ``neon_4_2.s``
594
642
595
-
For this kernel ``matmul_15_6_64`` we adapt the already implemented kernel ``matmul_16_6_64``. The only change is that we ignore the last computed float value from the 4 ``fmla`` instructions when saving back to memory.
643
+
For this kernel ``matmul_15_6_64`` we adapt the already implemented kernel ``matmul_16_6_64``. The only change is that we ignore the last
644
+
computed float value from the four ``fmla`` instructions when saving back to memory.
596
645
597
646
We load the full 16 floats and ignore the last one:
598
647
@@ -652,7 +701,8 @@ Next the loop over K:
652
701
fmla v20.4s, v3.4s, v4.s[0]
653
702
...
654
703
655
-
We store the full 16 computed floats back to memory but only add an offset of 15 floats because the last float isn't used. The last 15 values are exactly stored (8+4+2+1).
704
+
We store the full 16 computed floats back to memory but only add an offset of 15 floats because the last float isn't used. However, the last 15
705
+
values are exactly stored (8+4+2+1) back to memory to not write into memory we maybe not own.
656
706
657
707
.. code-block:: asm
658
708
:linenos:
@@ -681,6 +731,9 @@ We store the full 16 computed floats back to memory but only add an offset of 15
681
731
682
732
**Task**: Test and optimize the kernels. Report your performance in GFLOP
683
733
734
+
Since we already optimized the base kernel ``matmul_16_6_1`` in task :ref:`neon_2_optimization`, we do not found any further
735
+
optimizations for the kernels ``matmul_14_6_64`` and ``matmul_15_6_64``.
736
+
684
737
Optimized benchmark results:
685
738
686
739
.. code-block::
@@ -702,8 +755,8 @@ Optimized benchmark results:
702
755
- **matmul_14_6_64** kernel: :math:`113.8` GFLOPS
703
756
- **matmul_15_6_64** kernel: :math:`121.1` GFLOPS
704
757
705
-
Accumulator Block Shapes
706
-
------------------------
758
+
Accumulator Shapes
759
+
------------------
707
760
708
761
This section considers a matrix-matrix multiplication where a high-performance implementation may require accumulator blocks with different shapes.
709
762
@@ -714,7 +767,8 @@ This section considers a matrix-matrix multiplication where a high-performance i
714
767
715
768
File: ``neon_5_1.s``
716
769
717
-
For this kernel ``matmul_64_64_64`` we adapt the already implemented kernel ``matmul_64_48_64``. The only changes is that we removed two ``fmla`` blocks from the inner loop:
770
+
For this kernel ``matmul_64_64_64`` we adapt the already implemented kernel ``matmul_64_48_64``. The only changes is that we removed
771
+
two ``fmla`` blocks from the inner loop:
718
772
719
773
.. code-block:: asm
720
774
:linenos:
@@ -782,7 +836,7 @@ For this kernel ``matmul_64_64_64`` we adapt the already implemented kernel ``ma
782
836
cbnz x15, matmul_loop_over_K
783
837
...
784
838
785
-
Then changed the number of loops over M to four :math:`4\cdot16 = 64`:
839
+
Then changed the number of loops over M to four to achieve :math:`4\cdot16 = 64`:
786
840
787
841
.. code-block:: asm
788
842
:linenos:
@@ -826,7 +880,7 @@ And finaly changed the number of loops over N to 16 :math:`16 \cdot 4 = 64`:
826
880
827
881
**Task**: Test and optimize the kernel. Report your performance in GFLOPS.
828
882
829
-
Optimized benchmark result:
883
+
After experimenting with different loop orders, we stay with the current order of loops over N, M, and K. The benchmark results are listed below.
830
884
831
885
.. code-block::
832
886
@@ -844,16 +898,20 @@ Optimized benchmark result:
844
898
Batch-Reduce GEMM
845
899
-----------------
846
900
847
-
This section considers a batch-reduce matrix-matrix multiplication that has a fourth dimension in addition to the known M, N, and K dimensions.
901
+
This section examines a batch-reduced matrix-matrix multiplication that introduces a fourth dimension *C* alongside the knwon
902
+
*M*, *N*, and *K* dimensions. A batch-reduced matrix-matrix multiplication (BRGEMM or BRMM) is an operation where multiple pairs
903
+
of matrices are multiplied, and their results are accumulated into a single output matrix. This operation is commonly used in
904
+
machine learning to efficiently perform repeated matrix multiplications with summation across a batch dimension.
848
905
849
906
1. matmul_64_48_64_16
850
907
^^^^^^^^^^^^^^^^^^^^^
851
908
852
-
**Task**: mplement a kernel that computes C+=∑AᵢBᵢ for M=64, N=48 and K=64 and a batch-reduce dimension size of 16. Wrap your kernel in the ``matmul_64_48_64_16`` function.
909
+
**Task**: Implement a kernel that computes C+=∑AᵢBᵢ for M=64, N=48 and K=64 and a batch-reduce dimension size of 16. Wrap your kernel
910
+
in the ``matmul_64_48_64_16`` function.
853
911
854
-
File: ``neon_6_1.s``
912
+
- File: ``neon_6_1.s``
855
913
856
-
We started by implementing a kernel ``matmul_64_48_64`` with a batch dimension of one which is in the file ``neon_6_1_batch1.s``.
914
+
We started by using our ``matmul_64_48_64`` from :ref:`neon_3_loop_over_N` kernel with a batch dimension of one which is in the file ``neon_6_1_batch1.s``.
857
915
858
916
.. code-block:: asm
859
917
:linenos:
@@ -891,7 +949,7 @@ We started by implementing a kernel ``matmul_64_48_64`` with a batch dimension o
891
949
// Loop back to N
892
950
cbnz x17, matmul_loop_over_N
893
951
894
-
Then we wrapped the ``matmul_64_48_64`` kernel inside another batch loop of size 16:
952
+
Then we wrapped the ``matmul_64_48_64`` kernel inside another loop of size 16, representing the batch dimension:
895
953
896
954
.. code-block:: asm
897
955
:linenos:
@@ -947,10 +1005,10 @@ Then we wrapped the ``matmul_64_48_64`` kernel inside another batch loop of size
947
1005
**Task**: Test and optimize the kernel. Report your performance in GFLOPS.
948
1006
949
1007
We tested a variation in which the batch loop was positioned between the M and K loops. This approach achieved around :math:`73` GFLOPS.
950
-
We suspect that the reason for this was that the matrices did not fit into the cache.
951
-
We do not follow this approach due to the poor performance, and we lost the file due to a false ``rm`` statement.
1008
+
We suspect that the reason for this was that the matrices did not fit into the cache. Therefore, we do not follow this approach due to
1009
+
the poor performance.
952
1010
953
-
However, this leads us to assume that our result of putting the batch loop outside is satisfactory.
1011
+
However, this leads us to assume that our result of putting the batch loop outside is a good choice. The benchmark results are listed below.
954
1012
955
1013
.. code-block::
956
1014
:emphasize-lines: 4, 8
@@ -974,8 +1032,9 @@ However, this leads us to assume that our result of putting the batch loop outsi
974
1032
Transposition
975
1033
-------------
976
1034
977
-
This section develops a kernel that performs the identity operation on the elements of an 8x8 column-major matrix A and stores the
978
-
result in row-major format in matrix B.
1035
+
The final topic of this chapter covers matrix transposition. Transposing a matrix means swapping its rows and columns which is a common
1036
+
operation in many matrix computations. We developed a kernel that performs the identity operation on the elements of an :math:`8\times8`
1037
+
matrix stored in column-major format matrix A and writes the result in row-major format to matrix B.
979
1038
980
1039
1. Transpose
981
1040
^^^^^^^^^^^^
@@ -984,12 +1043,12 @@ result in row-major format in matrix B.
984
1043
985
1044
File: ``neon_7_1.s``
986
1045
987
-
From the lecture, we already know the 4x4 transpose kernel. Therefore, we have the following idea:
1046
+
From the lecture, we already know the :math:`4\times4` transpose kernel. Therefore, we have the following idea:
988
1047
989
1048
1. Divide the 8x8 matrix A into four 4x4 sub-matrices
990
1049
2. Transpose each 4x4 sub-matrix
991
1050
3. Save T(A) and T(D) sub-matrix to matrix B
992
-
4. Swap B and C: Save T(B) to bottom-left sub-matrix of B and T(C) to top-right sub-matrix of B
1051
+
4. Swap sub-matrix B and C: Save T(B) to bottom-left sub-matrix of B and T(C) to top-right sub-matrix of B
0 commit comments