Skip to content

Commit 3c2ff7d

Browse files
Merge pull request #2328 from madeline-underwood/simd_loops
Simd loops_JA to review
2 parents d7c1a28 + 5f96715 commit 3c2ff7d

File tree

5 files changed

+117
-223
lines changed

5 files changed

+117
-223
lines changed
Lines changed: 12 additions & 47 deletions
Original file line numberDiff line numberDiff line change
@@ -1,70 +1,35 @@
11
---
2-
title: About single instruction, multiple data (SIMD) loops
3-
weight: 3
2+
title: About Single Instruction, Multiple Data loops
3+
weight: 2
44

55
### FIXED, DO NOT MODIFY
66
layout: learningpathall
77
---
88

9-
Writing high-performance software for Arm processors often involves delving into
10-
SIMD technologies. For many developers, that journey started with NEON, a
11-
familiar, fixed-width vector extension that has been around for many years. But as
12-
Arm architectures continue to evolve, so do their SIMD technologies.
9+
## Introduction to SIMD on Arm and why it matters for performance on Arm CPUs
1310

14-
Enter the world of Scalable Vector Extension (SVE) and Scalable Matrix Extension (SME): two powerful, scalable vector extensions designed for modern
15-
workloads. Unlike NEON, they are not just wider; they are fundamentally different. These
16-
extensions introduce new instructions, more flexible programming models, and
17-
support for concepts like predication, scalable vectors, and streaming modes.
18-
However, they also come with a learning curve.
11+
Writing high-performance software on Arm often means using single-instruction, multiple-data (SIMD) technologies. Many developers start with NEON, a familiar fixed-width vector extension. As Arm architectures evolve, so do the SIMD capabilities available to you.
1912

20-
[SIMD Loops](https://gitlab.arm.com/architecture/simd-loops) is a valuable resource, enabling you to quickly and effectively learn how to write high-performance SIMD code.
13+
This Learning Path uses the Scalable Vector Extension (SVE) and the Scalable Matrix Extension (SME) to demonstrate modern SIMD patterns. They are two powerful, scalable vector extensions designed for modern workloads. Unlike NEON, these architecture extensions are not just wider; they are fundamentally different. They introduce predication, vector-length-agnostic (VLA) programming, gather/scatter, streaming modes, and tile-based compute with ZA state. The result is more power and flexibility, but there can be a learning curve to match.
2114

22-
SIMD Loops is designed to help
23-
you learn how to write SVE and SME code. It is a collection
24-
of self-contained, real-world loop kernels written in a mix of C, Arm C Language Extensions (ACLE)
25-
intrinsics, and inline assembly. These kernels target tasks ranging from simple arithmetic
26-
to matrix multiplication, sorting, and string processing. You can compile them,
27-
run them, step through them, and use them as a foundation for your own SIMD
28-
work.
15+
## What is the SIMD Loops project?
2916

30-
If you are familiar with NEON intrinsics, you can use SIMD Loops to learn and explore SVE and SME.
17+
The SIMD Loops project offers a hands-on way to climb the learning curve. It is a public codebase of self-contained, real loop kernels written in C, Arm C Language Extensions (ACLE) intrinsics, and selected inline assembly. Kernels span tasks such as matrix multiply, sorting, and string processing. You can build them, run them, step through them, and adapt them for your own SIMD workloads.
3118

32-
## What is SIMD Loops?
19+
Visit the [SIMD Loops Repo](https://gitlab.arm.com/architecture/simd-loops).
3320

34-
SIMD Loops is an open-source
35-
project, licensed under BSD 3-Clause, built to help you learn how to write SIMD code for modern Arm
36-
architectures, specifically using SVE and SME.
37-
It is designed for programmers who already know
38-
their way around NEON intrinsics but are now facing the more powerful and
39-
complex world of SVE and SME.
21+
This open-source project (BSD-3-Clause) teaches SIMD development on modern Arm CPUs with SVE, SVE2, SME, and SME2. It’s aimed at developers who know NEON intrinsics and want to explore newer extensions. The goal of SIMD Loops is to provide working, readable examples that demonstrate how to use the full range of features available in SVE, SVE2, and SME2. Each example is a self-contained loop kernel - a small piece of code that performs a specific task like matrix multiplication, vector reduction, histogram, or memory copy. These examples show how that task can be implemented across different vector instruction sets.
4022

41-
The goal of SIMD Loops is to provide working, readable examples that demonstrate
42-
how to use the full range of features available in SVE, SVE2, and SME2. Each
43-
example is a self-contained loop kernel, a small piece of code that performs
44-
a specific task like matrix multiplication, vector reduction, histogram, or
45-
memory copy. These examples show how that task can be implemented across different
46-
vector instruction sets.
47-
48-
Unlike a cookbook that tries to provide a recipe for every problem, SIMD Loops
49-
takes the opposite approach. It aims to showcase the architecture rather than
50-
the problem. The loop kernels are chosen to be realistic and meaningful, but the
51-
main goal is to demonstrate how specific features and instructions work in
52-
practice. If you are trying to understand scalability, predication,
53-
gather/scatter, streaming mode, ZA storage, compact instructions, or the
54-
mechanics of matrix tiles, this is where you will see them in action.
23+
Unlike a cookbook that attempts to provide a recipe for every problem, SIMD Loops takes the opposite approach. It aims to showcase the architecture rather than the problem itself. The loop kernels are chosen to be realistic and meaningful, but the main goal is to demonstrate how specific features and instructions work in practice. If you are trying to understand scalability, predication, gather/scatter, streaming mode, ZA storage, compact instructions, or the mechanics of matrix tiles, this is where you can see them in action.
5524

5625
The project includes:
57-
- Dozens of numbered loop kernels, each focused on a specific feature or pattern
26+
- Many numbered loop kernels, each focused on a specific feature or pattern
5827
- Reference C implementations to establish expected behavior
5928
- Inline assembly and/or intrinsics for scalar, NEON, SVE, SVE2, SVE2.1, SME2, and SME2.1
6029
- Build support for different instruction sets, with runtime validation
6130
- A simple command-line runner to execute any loop interactively
6231
- Optional standalone binaries for bare-metal and simulator use
6332

64-
You do not need to worry about auto-vectorization, compiler flags, or tooling
65-
quirks. Each loop is hand-written and annotated to make the use of SIMD features
66-
clear. The intent is that you can study, modify, and run each loop as a learning
67-
exercise, and use the project as a foundation for your own exploration of
68-
Arm’s vector extensions.
33+
You do not need to rely on auto-vectorization or guess at compiler flags. Each loop is handwritten and annotated to make the intended use of SIMD features clear. Study a kernel, modify it, rebuild, and observe the effect - this is the core learning loop.
6934

7035

content/learning-paths/cross-platform/simd-loops/2-using.md

Lines changed: 28 additions & 45 deletions
Original file line numberDiff line numberDiff line change
@@ -1,43 +1,41 @@
11
---
22
title: Using SIMD Loops
3-
weight: 4
3+
weight: 3
44

55
### FIXED, DO NOT MODIFY
66
layout: learningpathall
77
---
88

9-
To get started, clone the SIMD Loops project and change current directory:
9+
## Set up your development environment
10+
11+
To get started, clone the SIMD Loops project and change to the project directory:
1012

1113
```bash
1214
git clone https://gitlab.arm.com/architecture/simd-loops simd-loops.git
1315
cd simd-loops.git
1416
```
1517

16-
Confirm you are using an Arm machine by running:
18+
Confirm that you are using an Arm machine:
1719

1820
```bash
1921
uname -m
2022
```
2123

22-
The output on Linux should be:
24+
Expected output on Linux:
2325

2426
```output
2527
aarch64
2628
```
2729

28-
And for macOS:
30+
Expected output on macOS:
2931

3032
```output
3133
arm64
3234
```
3335

3436
## SIMD Loops structure
3537

36-
In the SIMD Loops project, the source code for the loops is organized under the loops directory. The complete
37-
list of loops is documented in the `loops.inc` file, which includes a brief
38-
description and the purpose of each loop. Every loop is associated with a
39-
uniquely named source file following the naming pattern `loop_<NNN>.c`, where
40-
`<NNN>` represents the loop number.
38+
In the SIMD Loops project, the source code for the loops is organized under the `loops` directory. The complete list of loops is documented in the `loops.inc` file, which includes a brief description and the purpose of each loop. Every loop is associated with a uniquely named source file following the pattern `loop_<NNN>.c`, where `<NNN>` represents the loop number.
4139

4240
A subset of the `loops.inc` file is below:
4341

@@ -50,27 +48,27 @@ LOOP(005, "strlen short strings", "Use of FF and NF loads instructi
5048
LOOP(006, "strlen long strings", "Use of FF and NF loads instructions")
5149
LOOP(008, "Precise fp64 add reduction", "Use of FADDA instructions")
5250
LOOP(009, "Pointer chasing", "Use of CTERM and BRK instructions")
53-
LOOP(010, "Conditional reduction (fp)", "Use of CLAST (SIMD&FP scalar) instructions", STREAMING_COMPATIBLE
51+
LOOP(010, "Conditional reduction (fp)", "Use of CLAST (SIMD&FP scalar) instructions", STREAMING_COMPATIBLE)
5452
```
5553

5654
A loop is structured as follows:
5755

58-
```C
56+
```c
5957
// Includes and loop_<NNN>_data structure definition
6058

6159
#if defined(HAVE_NATIVE) || defined(HAVE_AUTOVEC)
6260

63-
// C code
61+
// C reference or auto-vectorized version
6462
void inner_loop_<NNN>(struct loop_<NNN>_data *data) { ... }
6563

6664
#if defined(HAVE_xxx_INTRINSICS)
6765

68-
// Intrinsics versions: xxx = SME, SVE, or SIMD (NEON) versions
66+
// Intrinsics versions: xxx = SME, SVE, or SIMD (NEON)
6967
void inner_loop_<NNN>(struct loop_<NNN>_data *data) { ... }
7068

7169
#elif defined(<ASM_COND>)
7270

73-
// Hand-written inline assembly :
71+
// Hand-written inline assembly
7472
// <ASM_COND> = __ARM_FEATURE_SME2p1, __ARM_FEATURE_SME2, __ARM_FEATURE_SVE2p1,
7573
// __ARM_FEATURE_SVE2, __ARM_FEATURE_SVE, or __ARM_NEON
7674
void inner_loop_<NNN>(struct loop_<NNN>_data *data) { ... }
@@ -81,67 +79,52 @@ void inner_loop_<NNN>(struct loop_<NNN>_data *data) { ... }
8179

8280
#endif
8381

84-
// Main of loop: Buffers allocations, loop function call, result functional checking
82+
// Main of loop: buffer allocation, loop function call, result checking
8583
```
8684

87-
Each loop is implemented in several SIMD extension variants, and conditional
88-
compilation is used to select one of the optimizations for the
89-
`inner_loop_<NNN>` function.
85+
Each loop is implemented in several SIMD extension variants. Conditional compilation selects one of the implementations for the `inner_loop_<NNN>` function.
9086

91-
The native C implementation is written first, and
92-
it can be generated either when building natively with `-DHAVE_NATIVE` or through
93-
compiler auto-vectorization `-DHAVE_AUTOVEC`.
87+
The native C implementation is written first, and it can be generated either when building natively with `-DHAVE_NATIVE` or through compiler auto-vectorization with `-DHAVE_AUTOVEC`.
9488

95-
When SIMD ACLE is supported (SME, SVE, or NEON),
96-
the code is compiled using high-level intrinsics. If ACLE
97-
support is not available, the build process falls back to handwritten inline
98-
assembly targeting one of the available SIMD extensions, such as SME2.1, SME2,
99-
SVE2.1, SVE2, and others.
89+
When SIMD ACLE is supported (SME, SVE, or NEON), the code is compiled using high-level intrinsics. If ACLE support is not available, the build process falls back to handwritten inline assembly targeting one of the available SIMD extensions, such as SME2.1, SME2, SVE2.1, SVE2, and others.
10090

101-
The overall code structure also includes setup and
102-
cleanup code in the main function, where memory buffers are allocated, the
103-
selected loop kernel is executed, and results are verified for correctness.
91+
The overall code structure also includes setup and cleanup code in the main function, where memory buffers are allocated, the selected loop kernel is executed, and results are verified for correctness.
10492

105-
At compile time, you can select which loop optimization to compile, whether it
106-
is based on SME or SVE intrinsics, or one of the available inline assembly
107-
variants.
93+
At compile time, you can select which loop optimization to compile, whether it is based on SME or SVE intrinsics, or one of the available inline assembly variants.
10894

10995
```console
11096
make
11197
```
11298

113-
With no target specified the list of targets is printed:
99+
With no target specified, the list of targets is printed:
114100

115101
```output
116102
all fmt clean c-scalar scalar autovec-sve autovec-sve2 neon sve sve2 sme2 sme-ssve sve2p1 sme2p1 sve-intrinsics sme-intrinsics
117103
```
118104

119-
You can build all loops for all targets using:
105+
Build all loops for all targets:
120106

121107
```console
122108
make all
123109
```
124110

125-
You can build all loops for a single target, such as NEON, using:
111+
Build all loops for a single target, such as NEON:
126112

127113
```console
128114
make neon
129115
```
130116

131-
As the result of the build, two types of binaries are generated.
132-
133-
The first is a single executable named `simd_loops`, which includes all the loop implementations.
117+
As a result of the build, two types of binaries are generated.
134118

135-
A specific loop can be selected by passing parameters to the
136-
program.
119+
The first is a single executable named `simd_loops`, which includes all loop implementations.
137120

138-
For example, to run loop 1 for 5 iterations using the NEON target:
121+
Select a specific loop by passing parameters to the program. For example, to run loop 1 for 5 iterations using the NEON target:
139122

140123
```console
141124
build/neon/bin/simd_loops -k 1 -n 5
142125
```
143126

144-
The output is:
127+
Example output:
145128

146129
```output
147130
Loop 001 - FP32 inner product
@@ -151,13 +134,13 @@ Loop 001 - FP32 inner product
151134

152135
The second type of binary is an individual loop.
153136

154-
To run loop 1 as a standlone binary:
137+
To run loop 1 as a standalone binary:
155138

156139
```console
157140
build/neon/standalone/bin/loop_001.elf
158141
```
159142

160-
The output is:
143+
Example output:
161144

162145
```output
163146
- Checksum correct.

0 commit comments

Comments
 (0)