@@ -13,15 +13,46 @@ git clone https://gitlab.arm.com/architecture/simd-loops simd-loops.git
1313cd simd-loops.git
1414```
1515
16+ Confirm you are using an Arm machine by running:
17+
18+ ``` bash
19+ uname -m
20+ ```
21+
22+ The output on Linux should be:
23+
24+ ``` output
25+ aarch64
26+ ```
27+
28+ And for macOS:
29+
30+ ``` output
31+ arm64
32+ ```
33+
1634## SIMD Loops structure
1735
18- In the SIMD Loops project, the
19- source code for the loops is organized under the loops directory. The complete
20- list of loops is documented in the loops.inc file, which includes a brief
36+ In the SIMD Loops project, the source code for the loops is organized under the loops directory. The complete
37+ list of loops is documented in the ` loops.inc ` file, which includes a brief
2138description and the purpose of each loop. Every loop is associated with a
2239uniquely named source file following the naming pattern ` loop_<NNN>.c ` , where
2340` <NNN> ` represents the loop number.
2441
42+ A subset of the ` loops.inc ` file is below:
43+
44+ ``` output
45+ LOOP(001, "FP32 inner product", "Use of fp32 MLA instruction", STREAMING_COMPATIBLE)
46+ LOOP(002, "UINT32 inner product", "Use of u32 MLA instruction", STREAMING_COMPATIBLE)
47+ LOOP(003, "FP64 inner product", "Use of fp64 MLA instruction", STREAMING_COMPATIBLE)
48+ LOOP(004, "UINT64 inner product", "Use of u64 MLA instruction", STREAMING_COMPATIBLE)
49+ LOOP(005, "strlen short strings", "Use of FF and NF loads instructions")
50+ LOOP(006, "strlen long strings", "Use of FF and NF loads instructions")
51+ LOOP(008, "Precise fp64 add reduction", "Use of FADDA instructions")
52+ LOOP(009, "Pointer chasing", "Use of CTERM and BRK instructions")
53+ LOOP(010, "Conditional reduction (fp)", "Use of CLAST (SIMD&FP scalar) instructions", STREAMING_COMPATIBLE
54+ ```
55+
2556A loop is structured as follows:
2657
2758``` C
@@ -55,23 +86,79 @@ void inner_loop_<NNN>(struct loop_<NNN>_data *data) { ... }
5586
5687Each loop is implemented in several SIMD extension variants, and conditional
5788compilation is used to select one of the optimizations for the
58- ` inner_loop_<NNN> ` function. The native C implementation is written first, and
59- it can be generated either when building natively (HAVE_NATIVE) or through
60- compiler auto-vectorization (HAVE_AUTOVEC). When SIMD ACLE is supported (e.g.,
61- SME, SVE, or NEON), the code is compiled using high-level intrinsics. If ACLE
89+ ` inner_loop_<NNN> ` function.
90+
91+ The native C implementation is written first, and
92+ it can be generated either when building natively with ` -DHAVE_NATIVE ` or through
93+ compiler auto-vectorization ` -DHAVE_AUTOVEC ` .
94+
95+ When SIMD ACLE is supported (SME, SVE, or NEON),
96+ the code is compiled using high-level intrinsics. If ACLE
6297support is not available, the build process falls back to handwritten inline
6398assembly targeting one of the available SIMD extensions, such as SME2.1, SME2,
64- SVE2.1, SVE2, and others. The overall code structure also includes setup and
99+ SVE2.1, SVE2, and others.
100+
101+ The overall code structure also includes setup and
65102cleanup code in the main function, where memory buffers are allocated, the
66103selected loop kernel is executed, and results are verified for correctness.
67104
68105At compile time, you can select which loop optimization to compile, whether it
69106is based on SME or SVE intrinsics, or one of the available inline assembly
70- variants (`make scalar neon sve2 sme2 sve2p1 sme2p1 sve_intrinsics
71- sme_intrinsics` ...).
72-
73- As the result of the build, two types of binaries are generated. The first is a
74- single executable named ` simd_loops ` , which includes all the loop
75- implementations. A specific loop can be selected by passing parameters to the
76- program (e.g., ` simd_loops -k <NNN> -n <iterations> ` ). The second type consists
77- of individual standalone binaries, each corresponding to a specific loop.
107+ variants.
108+
109+ ``` console
110+ make
111+ ```
112+
113+ With no target specified the list of targets is printed:
114+
115+ ``` output
116+ all fmt clean c-scalar scalar autovec-sve autovec-sve2 neon sve sve2 sme2 sme-ssve sve2p1 sme2p1 sve-intrinsics sme-intrinsics
117+ ```
118+
119+ You can build all loops for all targets using:
120+
121+ ``` console
122+ make all
123+ ```
124+
125+ You can build all loops for a single target, such as NEON, using:
126+
127+ ``` console
128+ make neon
129+ ```
130+
131+ As the result of the build, two types of binaries are generated.
132+
133+ The first is a single executable named ` simd_loops ` , which includes all the loop implementations.
134+
135+ A specific loop can be selected by passing parameters to the
136+ program.
137+
138+ For example, to run loop 1 for 5 iterations using the NEON target:
139+
140+ ``` console
141+ build/neon/bin/simd_loops -k 1 -n 5
142+ ```
143+
144+ The output is:
145+
146+ ``` output
147+ Loop 001 - FP32 inner product
148+ - Purpose: Use of fp32 MLA instruction
149+ - Checksum correct.
150+ ```
151+
152+ The second type of binary is an individual loop.
153+
154+ To run loop 1 as a standlone binary:
155+
156+ ``` console
157+ build/neon/standalone/bin/loop_001.elf
158+ ```
159+
160+ The output is:
161+
162+ ``` output
163+ - Checksum correct.
164+ ```
0 commit comments