doc: neon 1

Integer-Ctrl · Integer-Ctrl · commit cb32d1624411 · 2025-04-29T20:02:34.000+02:00
diff --git a/docs_sphinx/index.rst b/docs_sphinx/index.rst
@@ -21,6 +21,7 @@ Machine Learning Compilers
 
    submissions/report_25_04_17.rst
    submissions/report_25_04_24.rst
+   submissions/report_25_05_01.rst
 
 .. toctree::
    :maxdepth: 4
diff --git a/docs_sphinx/submissions/report_25_05_01.rst b/docs_sphinx/submissions/report_25_05_01.rst
@@ -12,31 +12,34 @@ This section microbenchmarks the execution throughput and latency of FP32 Neon i
 .. image:: ../_static/images/report_25_05_01/neon_1_1.png
     :align: center
 
-FMLA (vector) with arrangement specifier ``4S``
+**FMLA (vector) with arrangement specifier ``4S``**
+
 - File: ``submissions/submission_25_05_01/neon_1_1.s``
 - Driver: ``submissions/submission_25_05_01/neon_1_1_driver.cpp``
 - Compilation: ``g++ -o neon_1_1.exe neon_1_1_driver.cpp neon_1_1.s``
 - We have :math:`13.2304 \cdot 10^10` instructions per second.
   That are :math:`13.2304 \cdot 10^10 / 8 = 16.538 \cdot 10^9` instructions per ALU per second.
-  This aligns with a **throughput of :math:`\approx 4` instruction per cycle**, as it is known from benchmarks that the performance cores of the M4 chip have a clock speed of 4.4 GHz.
+  This aligns with a **throughput of** :math:`\approx 4` **instruction per cycle**, as it is known from benchmarks that the performance cores of the M4 chip have a clock speed of 4.4 GHz.
+
 
+**FMLA (vector) with arrangement specifier ``2S``**
 
-FMLA (vector) with arrangement specifier ``2S``
 - File: ``submissions/submission_25_05_01/neon_1_1.s``
 - Driver: ``submissions/submission_25_05_01/neon_1_1_driver.cpp``
 - Compilation: ``g++ -o neon_1_1.exe neon_1_1_driver.cpp neon_1_1.s``
 - We have :math:`6.65221 \cdot 10^10` instructions per second.
   That are :math:`6.65221 \cdot 10^10 / 8 = 8.31526 \cdot 10^9` instructions per ALU per second.
-  This aligns with a **throughput of :math:`\approx 2` instruction per cycle**, as it is known from benchmarks that the performance cores of the M4 chip have a clock speed of 4.4 GHz.
+  This aligns with a **throughput of** :math:`\approx 2` **instruction per cycle**, as it is known from benchmarks that the performance cores of the M4 chip have a clock speed of 4.4 GHz.
+
 
+**FMADD (scalar), single-precision variant**
 
-FMADD (scalar), single-precision variant
 - File: ``submissions/submission_25_05_01/neon_1_1.s``
 - Driver: ``submissions/submission_25_05_01/neon_1_1_driver.cpp``
 - Compilation: ``g++ -o neon_1_1.exe neon_1_1_driver.cpp neon_1_1.s``
 - We have :math:`1.12728 \cdot 10^10` instructions per second.
   That are :math:`1.12728 \cdot 10^10 / 8 = 1.4091 \cdot 10^9` instructions per ALU per second.
-  This aligns with a **throughput of :math:`\approx 1/3` instruction per cycle**, as it is known from benchmarks that the performance cores of the M4 chip have a clock speed of 4.4 GHz.
+  This aligns with a **throughput of** :math:`\approx 1/3` **instruction per cycle**, as it is known from benchmarks that the performance cores of the M4 chip have a clock speed of 4.4 GHz.
 
 
 1. Microbenchmark the execution latency of FMLA (vector) with arrangement specifier 4S. Consider the following two cases:
@@ -45,16 +48,18 @@ FMADD (scalar), single-precision variant
 .. image:: ../_static/images/report_25_05_01/neon_1_2.png
     :align: center
 
-Dependency on one of the source registers
+**Dependency on one of the source registers**
+
 - File: ``submissions/submission_25_05_01/neon_1_2.s``
 - Driver: ``submissions/submission_25_05_01/neon_1_2_driver.cpp``
 - Compilation: ``g++ -o neon_1_2.exe neon_1_2_driver.cpp neon_1_2.s``
 - We have :math:`11.4961 \cdot 10^9` instruction per seconds in a single ALU.
-  Resulting in a **latency of :math:`\approx 1` cycle** for the known clock speed of 4.4 GHz.
+  Resulting in a **latency of** :math:`\approx 1` **cycle** for the known clock speed of 4.4 GHz.
+
+**Dependency on the destination register only**
 
-Dependency on the destination register only
 - File: ``submissions/submission_25_05_01/neon_1_2.s``
 - Driver: ``submissions/submission_25_05_01/neon_1_2_driver.cpp``
 - Compilation: ``g++ -o neon_1_2.exe neon_1_2_driver.cpp neon_1_2.s``
 - We have :math:`11.7019 \cdot 10^9` instruction per seconds in a single ALU.
-  Resulting in a **latency of :math:`\approx 3 cycle`** for the known clock speed of 4.4 GHz.
+  Resulting in a **latency of** :math:`\approx 3` **cycle** for the known clock speed of 4.4 GHz.