Skip to content

Commit 88e4fa1

Browse files
Merge pull request #2324 from jasonrandrews/review
Final tech review of SIMD Loops
2 parents 44f6a5e + ad9bed0 commit 88e4fa1

File tree

5 files changed

+122
-30
lines changed

5 files changed

+122
-30
lines changed

content/learning-paths/cross-platform/simd-loops/1-about.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,7 @@ extensions introduce new instructions, more flexible programming models, and
1717
support for concepts like predication, scalable vectors, and streaming modes.
1818
However, they also come with a learning curve.
1919

20-
That is where [SIMD Loops](https://gitlab.arm.com/architecture/simd-loops) becomes a valuable resource, enabling you to quickly and effectively learn how to write high-performance SIMD code.
20+
[SIMD Loops](https://gitlab.arm.com/architecture/simd-loops) is a valuable resource, enabling you to quickly and effectively learn how to write high-performance SIMD code.
2121

2222
SIMD Loops is designed to help
2323
you learn how to write SVE and SME code. It is a collection

content/learning-paths/cross-platform/simd-loops/2-using.md

Lines changed: 103 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -13,15 +13,46 @@ git clone https://gitlab.arm.com/architecture/simd-loops simd-loops.git
1313
cd simd-loops.git
1414
```
1515

16+
Confirm you are using an Arm machine by running:
17+
18+
```bash
19+
uname -m
20+
```
21+
22+
The output on Linux should be:
23+
24+
```output
25+
aarch64
26+
```
27+
28+
And for macOS:
29+
30+
```output
31+
arm64
32+
```
33+
1634
## SIMD Loops structure
1735

18-
In the SIMD Loops project, the
19-
source code for the loops is organized under the loops directory. The complete
20-
list of loops is documented in the loops.inc file, which includes a brief
36+
In the SIMD Loops project, the source code for the loops is organized under the loops directory. The complete
37+
list of loops is documented in the `loops.inc` file, which includes a brief
2138
description and the purpose of each loop. Every loop is associated with a
2239
uniquely named source file following the naming pattern `loop_<NNN>.c`, where
2340
`<NNN>` represents the loop number.
2441

42+
A subset of the `loops.inc` file is below:
43+
44+
```output
45+
LOOP(001, "FP32 inner product", "Use of fp32 MLA instruction", STREAMING_COMPATIBLE)
46+
LOOP(002, "UINT32 inner product", "Use of u32 MLA instruction", STREAMING_COMPATIBLE)
47+
LOOP(003, "FP64 inner product", "Use of fp64 MLA instruction", STREAMING_COMPATIBLE)
48+
LOOP(004, "UINT64 inner product", "Use of u64 MLA instruction", STREAMING_COMPATIBLE)
49+
LOOP(005, "strlen short strings", "Use of FF and NF loads instructions")
50+
LOOP(006, "strlen long strings", "Use of FF and NF loads instructions")
51+
LOOP(008, "Precise fp64 add reduction", "Use of FADDA instructions")
52+
LOOP(009, "Pointer chasing", "Use of CTERM and BRK instructions")
53+
LOOP(010, "Conditional reduction (fp)", "Use of CLAST (SIMD&FP scalar) instructions", STREAMING_COMPATIBLE
54+
```
55+
2556
A loop is structured as follows:
2657

2758
```C
@@ -55,23 +86,79 @@ void inner_loop_<NNN>(struct loop_<NNN>_data *data) { ... }
5586

5687
Each loop is implemented in several SIMD extension variants, and conditional
5788
compilation is used to select one of the optimizations for the
58-
`inner_loop_<NNN>` function. The native C implementation is written first, and
59-
it can be generated either when building natively (HAVE_NATIVE) or through
60-
compiler auto-vectorization (HAVE_AUTOVEC). When SIMD ACLE is supported (e.g.,
61-
SME, SVE, or NEON), the code is compiled using high-level intrinsics. If ACLE
89+
`inner_loop_<NNN>` function.
90+
91+
The native C implementation is written first, and
92+
it can be generated either when building natively with `-DHAVE_NATIVE` or through
93+
compiler auto-vectorization `-DHAVE_AUTOVEC`.
94+
95+
When SIMD ACLE is supported (SME, SVE, or NEON),
96+
the code is compiled using high-level intrinsics. If ACLE
6297
support is not available, the build process falls back to handwritten inline
6398
assembly targeting one of the available SIMD extensions, such as SME2.1, SME2,
64-
SVE2.1, SVE2, and others. The overall code structure also includes setup and
99+
SVE2.1, SVE2, and others.
100+
101+
The overall code structure also includes setup and
65102
cleanup code in the main function, where memory buffers are allocated, the
66103
selected loop kernel is executed, and results are verified for correctness.
67104

68105
At compile time, you can select which loop optimization to compile, whether it
69106
is based on SME or SVE intrinsics, or one of the available inline assembly
70-
variants (`make scalar neon sve2 sme2 sve2p1 sme2p1 sve_intrinsics
71-
sme_intrinsics` ...).
72-
73-
As the result of the build, two types of binaries are generated. The first is a
74-
single executable named `simd_loops`, which includes all the loop
75-
implementations. A specific loop can be selected by passing parameters to the
76-
program (e.g., `simd_loops -k <NNN> -n <iterations>`). The second type consists
77-
of individual standalone binaries, each corresponding to a specific loop.
107+
variants.
108+
109+
```console
110+
make
111+
```
112+
113+
With no target specified the list of targets is printed:
114+
115+
```output
116+
all fmt clean c-scalar scalar autovec-sve autovec-sve2 neon sve sve2 sme2 sme-ssve sve2p1 sme2p1 sve-intrinsics sme-intrinsics
117+
```
118+
119+
You can build all loops for all targets using:
120+
121+
```console
122+
make all
123+
```
124+
125+
You can build all loops for a single target, such as NEON, using:
126+
127+
```console
128+
make neon
129+
```
130+
131+
As the result of the build, two types of binaries are generated.
132+
133+
The first is a single executable named `simd_loops`, which includes all the loop implementations.
134+
135+
A specific loop can be selected by passing parameters to the
136+
program.
137+
138+
For example, to run loop 1 for 5 iterations using the NEON target:
139+
140+
```console
141+
build/neon/bin/simd_loops -k 1 -n 5
142+
```
143+
144+
The output is:
145+
146+
```output
147+
Loop 001 - FP32 inner product
148+
- Purpose: Use of fp32 MLA instruction
149+
- Checksum correct.
150+
```
151+
152+
The second type of binary is an individual loop.
153+
154+
To run loop 1 as a standlone binary:
155+
156+
```console
157+
build/neon/standalone/bin/loop_001.elf
158+
```
159+
160+
The output is:
161+
162+
```output
163+
- Checksum correct.
164+
```

content/learning-paths/cross-platform/simd-loops/3-example.md

Lines changed: 11 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -6,12 +6,16 @@ weight: 5
66
layout: learningpathall
77
---
88

9-
To illustrate the structure and design principles of simd-loops, consider loop
10-
202 as an example. `inner_loop_202` is defined at lines 69-79 in file
9+
To illustrate the structure and design principles of SIMD Loops, consider loop
10+
202 as an example.
11+
12+
Use a text editor to look at the file `loops/loop_202.c`
13+
14+
The function `inner_loop_202()` is defined at lines 60-70 in file
1115
`loops/loops_202.c` and calls the `matmul_fp32` routine defined in
1216
`matmul_fp32.c`.
1317

14-
Open `loops/matmul_fp32.c`.
18+
Use a text editor to look at the file `loops/matmul_fp32.c`
1519

1620
This loop implements a single precision floating point matrix multiplication of
1721
the form:
@@ -39,10 +43,10 @@ struct loop_202_data {
3943
```
4044

4145
For this loop:
42-
- The first input matrix (A) is stored in column-major format in memory.
46+
- The first input matrix (a) is stored in column-major format in memory.
4347
- The second input matrix (b) is stored in row-major format in memory.
44-
- None of the memory area designated by `a`, `b` anf `c` alias (i.e. they
45-
overlap in some way) --- as indicated by the `restrict` keyword.
48+
- None of the memory area designated by `a`, `b` and `c` alias (they
49+
overlap in some way) as indicated by the `restrict` keyword.
4650

4751
This layout choice helps optimize memory access patterns for all the targeted
4852
SIMD architectures.
@@ -59,7 +63,7 @@ This design enables portability across different SIMD extensions.
5963

6064
## Function implementation
6165

62-
The `matmul_fp32` function from file `loops/matmul_fp32.c` provides several
66+
The `matmul_fp32()` function from file `loops/matmul_fp32.c` provides several
6367
optimizations of the single-precision floating-point matrix multiplication,
6468
including the ACLE intrinsics-based code, and the assembly hand-optimized code.
6569

content/learning-paths/cross-platform/simd-loops/4-conclusion.md

Lines changed: 7 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -7,20 +7,22 @@ layout: learningpathall
77
---
88

99
SIMD Loops is an invaluable
10-
resource for developers looking to learn or master the intricacies of SVE and
11-
SME on modern Arm architectures. By providing practical, hands-on examples, it
10+
resource for developers looking to learn the intricacies of SVE and
11+
SME on a variety of Arm architectures. By providing practical, hands-on examples, it
1212
bridges the gap between the architecture specification and real-world
13-
application. Whether you're transitioning from NEON or starting fresh with SVE
13+
application.
14+
15+
Whether you're transitioning from NEON or starting fresh with SVE
1416
and SME, SIMD Loops offers a comprehensive toolkit to enhance your understanding
1517
and proficiency.
1618

1719
With its extensive collection of loop kernels, detailed documentation, and
18-
flexible build options, SIMD Loops empowers you to explore
20+
flexible build options, SIMD Loops helps you to explore
1921
and leverage the full potential of Arm's advanced vector extensions. Dive into
2022
the project, experiment with the examples, and take your high-performance coding
2123
skills for Arm to the next level.
2224

2325
For more information and to get started, visit the GitLab project and refer
2426
to the
2527
[README.md](https://gitlab.arm.com/architecture/simd-loops/-/blob/main/README.md)
26-
for instructions on building and running the code.
28+
for the latest instructions on building and running the code.

content/learning-paths/cross-platform/simd-loops/_index.md

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -31,7 +31,6 @@ operatingsystems:
3131
tools_software_languages:
3232
- GCC
3333
- Clang
34-
- FVP
3534

3635
shared_path: true
3736
shared_between:

0 commit comments

Comments
 (0)