Skip to content

Commit 1767df1

Browse files
committed
doc: neon 1
1 parent d513d4c commit 1767df1

File tree

3 files changed

+49
-3
lines changed

3 files changed

+49
-3
lines changed
84.5 KB
Loading
71.4 KB
Loading

docs_sphinx/submissions/report_25_05_01.rst

Lines changed: 49 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -4,11 +4,57 @@ Submission 2025-05-01
44
Execution Throughput and Latency
55
--------------------------------
66

7-
This section microbenchmarks the execution throughput and latency of FP32 Neon instructio
7+
This section microbenchmarks the execution throughput and latency of FP32 Neon instructions.
88

99
1. Microbenchmark the execution throughput of the following instructions:
1010
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
1111

12+
.. image:: ../_static/images/report_25_05_01/neon_1_1.png
13+
:align: center
14+
1215
FMLA (vector) with arrangement specifier ``4S``
13-
- File ``submissions/submission_25_05_01/neon_throughput_latency.s``
14-
- Driver ``submissions/submission_25_05_01/neon_throughput_latency_driver.cpp``
16+
- File: ``submissions/submission_25_05_01/neon_1_1.s``
17+
- Driver: ``submissions/submission_25_05_01/neon_1_1_driver.cpp``
18+
- Compilation: ``g++ -o neon_1_1.exe neon_1_1_driver.cpp neon_1_1.s``
19+
- We have :math:`13.2304 \cdot 10^10` instructions per second.
20+
That are :math:`13.2304 \cdot 10^10 / 8 = 16.538 \cdot 10^9` instructions per ALU per second.
21+
This aligns with a **throughput of :math:`\approx 4` instruction per cycle**, as it is known from benchmarks that the performance cores of the M4 chip have a clock speed of 4.4 GHz.
22+
23+
24+
FMLA (vector) with arrangement specifier ``2S``
25+
- File: ``submissions/submission_25_05_01/neon_1_1.s``
26+
- Driver: ``submissions/submission_25_05_01/neon_1_1_driver.cpp``
27+
- Compilation: ``g++ -o neon_1_1.exe neon_1_1_driver.cpp neon_1_1.s``
28+
- We have :math:`6.65221 \cdot 10^10` instructions per second.
29+
That are :math:`6.65221 \cdot 10^10 / 8 = 8.31526 \cdot 10^9` instructions per ALU per second.
30+
This aligns with a **throughput of :math:`\approx 2` instruction per cycle**, as it is known from benchmarks that the performance cores of the M4 chip have a clock speed of 4.4 GHz.
31+
32+
33+
FMADD (scalar), single-precision variant
34+
- File: ``submissions/submission_25_05_01/neon_1_1.s``
35+
- Driver: ``submissions/submission_25_05_01/neon_1_1_driver.cpp``
36+
- Compilation: ``g++ -o neon_1_1.exe neon_1_1_driver.cpp neon_1_1.s``
37+
- We have :math:`1.12728 \cdot 10^10` instructions per second.
38+
That are :math:`1.12728 \cdot 10^10 / 8 = 1.4091 \cdot 10^9` instructions per ALU per second.
39+
This aligns with a **throughput of :math:`\approx 1/3` instruction per cycle**, as it is known from benchmarks that the performance cores of the M4 chip have a clock speed of 4.4 GHz.
40+
41+
42+
1. Microbenchmark the execution latency of FMLA (vector) with arrangement specifier 4S. Consider the following two cases:
43+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
44+
45+
.. image:: ../_static/images/report_25_05_01/neon_1_2.png
46+
:align: center
47+
48+
Dependency on one of the source registers
49+
- File: ``submissions/submission_25_05_01/neon_1_2.s``
50+
- Driver: ``submissions/submission_25_05_01/neon_1_2_driver.cpp``
51+
- Compilation: ``g++ -o neon_1_2.exe neon_1_2_driver.cpp neon_1_2.s``
52+
- We have :math:`11.4961 \cdot 10^9` instruction per seconds in a single ALU.
53+
Resulting in a **latency of :math:`\approx 1` cycle** for the known clock speed of 4.4 GHz.
54+
55+
Dependency on the destination register only
56+
- File: ``submissions/submission_25_05_01/neon_1_2.s``
57+
- Driver: ``submissions/submission_25_05_01/neon_1_2_driver.cpp``
58+
- Compilation: ``g++ -o neon_1_2.exe neon_1_2_driver.cpp neon_1_2.s``
59+
- We have :math:`11.7019 \cdot 10^9` instruction per seconds in a single ALU.
60+
Resulting in a **latency of :math:`\approx 3 cycle`** for the known clock speed of 4.4 GHz.

0 commit comments

Comments
 (0)