Skip to content

Commit cb32d16

Browse files
committed
doc: neon 1
1 parent 1767df1 commit cb32d16

File tree

2 files changed

+16
-10
lines changed

2 files changed

+16
-10
lines changed

docs_sphinx/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -21,6 +21,7 @@ Machine Learning Compilers
2121

2222
submissions/report_25_04_17.rst
2323
submissions/report_25_04_24.rst
24+
submissions/report_25_05_01.rst
2425

2526
.. toctree::
2627
:maxdepth: 4

docs_sphinx/submissions/report_25_05_01.rst

Lines changed: 15 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -12,31 +12,34 @@ This section microbenchmarks the execution throughput and latency of FP32 Neon i
1212
.. image:: ../_static/images/report_25_05_01/neon_1_1.png
1313
:align: center
1414

15-
FMLA (vector) with arrangement specifier ``4S``
15+
**FMLA (vector) with arrangement specifier ``4S``**
16+
1617
- File: ``submissions/submission_25_05_01/neon_1_1.s``
1718
- Driver: ``submissions/submission_25_05_01/neon_1_1_driver.cpp``
1819
- Compilation: ``g++ -o neon_1_1.exe neon_1_1_driver.cpp neon_1_1.s``
1920
- We have :math:`13.2304 \cdot 10^10` instructions per second.
2021
That are :math:`13.2304 \cdot 10^10 / 8 = 16.538 \cdot 10^9` instructions per ALU per second.
21-
This aligns with a **throughput of :math:`\approx 4` instruction per cycle**, as it is known from benchmarks that the performance cores of the M4 chip have a clock speed of 4.4 GHz.
22+
This aligns with a **throughput of** :math:`\approx 4` **instruction per cycle**, as it is known from benchmarks that the performance cores of the M4 chip have a clock speed of 4.4 GHz.
23+
2224

25+
**FMLA (vector) with arrangement specifier ``2S``**
2326

24-
FMLA (vector) with arrangement specifier ``2S``
2527
- File: ``submissions/submission_25_05_01/neon_1_1.s``
2628
- Driver: ``submissions/submission_25_05_01/neon_1_1_driver.cpp``
2729
- Compilation: ``g++ -o neon_1_1.exe neon_1_1_driver.cpp neon_1_1.s``
2830
- We have :math:`6.65221 \cdot 10^10` instructions per second.
2931
That are :math:`6.65221 \cdot 10^10 / 8 = 8.31526 \cdot 10^9` instructions per ALU per second.
30-
This aligns with a **throughput of :math:`\approx 2` instruction per cycle**, as it is known from benchmarks that the performance cores of the M4 chip have a clock speed of 4.4 GHz.
32+
This aligns with a **throughput of** :math:`\approx 2` **instruction per cycle**, as it is known from benchmarks that the performance cores of the M4 chip have a clock speed of 4.4 GHz.
33+
3134

35+
**FMADD (scalar), single-precision variant**
3236

33-
FMADD (scalar), single-precision variant
3437
- File: ``submissions/submission_25_05_01/neon_1_1.s``
3538
- Driver: ``submissions/submission_25_05_01/neon_1_1_driver.cpp``
3639
- Compilation: ``g++ -o neon_1_1.exe neon_1_1_driver.cpp neon_1_1.s``
3740
- We have :math:`1.12728 \cdot 10^10` instructions per second.
3841
That are :math:`1.12728 \cdot 10^10 / 8 = 1.4091 \cdot 10^9` instructions per ALU per second.
39-
This aligns with a **throughput of :math:`\approx 1/3` instruction per cycle**, as it is known from benchmarks that the performance cores of the M4 chip have a clock speed of 4.4 GHz.
42+
This aligns with a **throughput of** :math:`\approx 1/3` **instruction per cycle**, as it is known from benchmarks that the performance cores of the M4 chip have a clock speed of 4.4 GHz.
4043

4144

4245
1. Microbenchmark the execution latency of FMLA (vector) with arrangement specifier 4S. Consider the following two cases:
@@ -45,16 +48,18 @@ FMADD (scalar), single-precision variant
4548
.. image:: ../_static/images/report_25_05_01/neon_1_2.png
4649
:align: center
4750

48-
Dependency on one of the source registers
51+
**Dependency on one of the source registers**
52+
4953
- File: ``submissions/submission_25_05_01/neon_1_2.s``
5054
- Driver: ``submissions/submission_25_05_01/neon_1_2_driver.cpp``
5155
- Compilation: ``g++ -o neon_1_2.exe neon_1_2_driver.cpp neon_1_2.s``
5256
- We have :math:`11.4961 \cdot 10^9` instruction per seconds in a single ALU.
53-
Resulting in a **latency of :math:`\approx 1` cycle** for the known clock speed of 4.4 GHz.
57+
Resulting in a **latency of** :math:`\approx 1` **cycle** for the known clock speed of 4.4 GHz.
58+
59+
**Dependency on the destination register only**
5460

55-
Dependency on the destination register only
5661
- File: ``submissions/submission_25_05_01/neon_1_2.s``
5762
- Driver: ``submissions/submission_25_05_01/neon_1_2_driver.cpp``
5863
- Compilation: ``g++ -o neon_1_2.exe neon_1_2_driver.cpp neon_1_2.s``
5964
- We have :math:`11.7019 \cdot 10^9` instruction per seconds in a single ALU.
60-
Resulting in a **latency of :math:`\approx 3 cycle`** for the known clock speed of 4.4 GHz.
65+
Resulting in a **latency of** :math:`\approx 3` **cycle** for the known clock speed of 4.4 GHz.

0 commit comments

Comments
 (0)