You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
title: About single instruction, multiple data (SIMD) loops
3
-
weight: 3
2
+
title: About Single Instruction, Multiple Data loops
3
+
weight: 2
4
4
5
5
### FIXED, DO NOT MODIFY
6
6
layout: learningpathall
7
7
---
8
8
9
-
Writing high-performance software for Arm processors often involves delving into
10
-
SIMD technologies. For many developers, that journey started with NEON, a
11
-
familiar, fixed-width vector extension that has been around for many years. But as
12
-
Arm architectures continue to evolve, so do their SIMD technologies.
9
+
## Introduction to SIMD on Arm and why it matters for performance on Arm CPUs
13
10
14
-
Enter the world of Scalable Vector Extension (SVE) and Scalable Matrix Extension (SME): two powerful, scalable vector extensions designed for modern
15
-
workloads. Unlike NEON, they are not just wider; they are fundamentally different. These
16
-
extensions introduce new instructions, more flexible programming models, and
17
-
support for concepts like predication, scalable vectors, and streaming modes.
18
-
However, they also come with a learning curve.
11
+
Writing high-performance software on Arm often means using single-instruction, multiple-data (SIMD) technologies. Many developers start with NEON, a familiar fixed-width vector extension. As Arm architectures evolve, so do the SIMD capabilities available to you.
19
12
20
-
[SIMD Loops](https://gitlab.arm.com/architecture/simd-loops) is a valuable resource, enabling you to quickly and effectively learn how to write high-performance SIMD code.
13
+
This Learning Path uses the Scalable Vector Extension (SVE) and the Scalable Matrix Extension (SME) to demonstrate modern SIMD patterns. They are two powerful, scalable vector extensions designed for modern workloads. Unlike NEON, these architecture extensions are not just wider; they are fundamentally different. They introduce predication, vector-length-agnostic (VLA) programming, gather/scatter, streaming modes, and tile-based compute with ZA state. The result is more power and flexibility, but there can be a learning curve to match.
21
14
22
-
SIMD Loops is designed to help
23
-
you learn how to write SVE and SME code. It is a collection
24
-
of self-contained, real-world loop kernels written in a mix of C, Arm C Language Extensions (ACLE)
25
-
intrinsics, and inline assembly. These kernels target tasks ranging from simple arithmetic
26
-
to matrix multiplication, sorting, and string processing. You can compile them,
27
-
run them, step through them, and use them as a foundation for your own SIMD
28
-
work.
15
+
## What is the SIMD Loops project?
29
16
30
-
If you are familiar with NEON intrinsics, you can use SIMD Loops to learn and explore SVE and SME.
17
+
The SIMD Loops project offers a hands-on way to climb the learning curve. It is a public codebase of self-contained, real loop kernels written in C, Arm C Language Extensions (ACLE) intrinsics, and selected inline assembly. Kernels span tasks such as matrix multiply, sorting, and string processing. You can build them, run them, step through them, and adapt them for your own SIMD workloads.
31
18
32
-
## What is SIMD Loops?
19
+
Visit the [SIMD Loops Repo](https://gitlab.arm.com/architecture/simd-loops).
33
20
34
-
SIMD Loops is an open-source
35
-
project, licensed under BSD 3-Clause, built to help you learn how to write SIMD code for modern Arm
36
-
architectures, specifically using SVE and SME.
37
-
It is designed for programmers who already know
38
-
their way around NEON intrinsics but are now facing the more powerful and
39
-
complex world of SVE and SME.
21
+
This open-source project (BSD-3-Clause) teaches SIMD development on modern Arm CPUs with SVE, SVE2, SME, and SME2. It’s aimed at developers who know NEON intrinsics and want to explore newer extensions. The goal of SIMD Loops is to provide working, readable examples that demonstrate how to use the full range of features available in SVE, SVE2, and SME2. Each example is a self-contained loop kernel - a small piece of code that performs a specific task like matrix multiplication, vector reduction, histogram, or memory copy. These examples show how that task can be implemented across different vector instruction sets.
40
22
41
-
The goal of SIMD Loops is to provide working, readable examples that demonstrate
42
-
how to use the full range of features available in SVE, SVE2, and SME2. Each
43
-
example is a self-contained loop kernel, a small piece of code that performs
44
-
a specific task like matrix multiplication, vector reduction, histogram, or
45
-
memory copy. These examples show how that task can be implemented across different
46
-
vector instruction sets.
47
-
48
-
Unlike a cookbook that tries to provide a recipe for every problem, SIMD Loops
49
-
takes the opposite approach. It aims to showcase the architecture rather than
50
-
the problem. The loop kernels are chosen to be realistic and meaningful, but the
51
-
main goal is to demonstrate how specific features and instructions work in
52
-
practice. If you are trying to understand scalability, predication,
53
-
gather/scatter, streaming mode, ZA storage, compact instructions, or the
54
-
mechanics of matrix tiles, this is where you will see them in action.
23
+
Unlike a cookbook that attempts to provide a recipe for every problem, SIMD Loops takes the opposite approach. It aims to showcase the architecture rather than the problem itself. The loop kernels are chosen to be realistic and meaningful, but the main goal is to demonstrate how specific features and instructions work in practice. If you are trying to understand scalability, predication, gather/scatter, streaming mode, ZA storage, compact instructions, or the mechanics of matrix tiles, this is where you can see them in action.
55
24
56
25
The project includes:
57
-
-Dozens of numbered loop kernels, each focused on a specific feature or pattern
26
+
-Many numbered loop kernels, each focused on a specific feature or pattern
58
27
- Reference C implementations to establish expected behavior
59
28
- Inline assembly and/or intrinsics for scalar, NEON, SVE, SVE2, SVE2.1, SME2, and SME2.1
60
29
- Build support for different instruction sets, with runtime validation
61
30
- A simple command-line runner to execute any loop interactively
62
31
- Optional standalone binaries for bare-metal and simulator use
63
32
64
-
You do not need to worry about auto-vectorization, compiler flags, or tooling
65
-
quirks. Each loop is hand-written and annotated to make the use of SIMD features
66
-
clear. The intent is that you can study, modify, and run each loop as a learning
67
-
exercise, and use the project as a foundation for your own exploration of
68
-
Arm’s vector extensions.
33
+
You do not need to rely on auto-vectorization or guess at compiler flags. Each loop is handwritten and annotated to make the intended use of SIMD features clear. Study a kernel, modify it, rebuild, and observe the effect - this is the core learning loop.
In the SIMD Loops project, the source code for the loops is organized under the loops directory. The complete
37
-
list of loops is documented in the `loops.inc` file, which includes a brief
38
-
description and the purpose of each loop. Every loop is associated with a
39
-
uniquely named source file following the naming pattern `loop_<NNN>.c`, where
40
-
`<NNN>` represents the loop number.
38
+
In the SIMD Loops project, the source code for the loops is organized under the `loops` directory. The complete list of loops is documented in the `loops.inc` file, which includes a brief description and the purpose of each loop. Every loop is associated with a uniquely named source file following the pattern `loop_<NNN>.c`, where `<NNN>` represents the loop number.
41
39
42
40
A subset of the `loops.inc` file is below:
43
41
@@ -50,27 +48,27 @@ LOOP(005, "strlen short strings", "Use of FF and NF loads instructi
50
48
LOOP(006, "strlen long strings", "Use of FF and NF loads instructions")
51
49
LOOP(008, "Precise fp64 add reduction", "Use of FADDA instructions")
52
50
LOOP(009, "Pointer chasing", "Use of CTERM and BRK instructions")
// Main of loop: Buffers allocations, loop function call, result functional checking
82
+
// Main of loop: buffer allocation, loop function call, result checking
85
83
```
86
84
87
-
Each loop is implemented in several SIMD extension variants, and conditional
88
-
compilation is used to select one of the optimizations for the
89
-
`inner_loop_<NNN>` function.
85
+
Each loop is implemented in several SIMD extension variants. Conditional compilation selects one of the implementations for the `inner_loop_<NNN>` function.
90
86
91
-
The native C implementation is written first, and
92
-
it can be generated either when building natively with `-DHAVE_NATIVE` or through
93
-
compiler auto-vectorization `-DHAVE_AUTOVEC`.
87
+
The native C implementation is written first, and it can be generated either when building natively with `-DHAVE_NATIVE` or through compiler auto-vectorization with `-DHAVE_AUTOVEC`.
94
88
95
-
When SIMD ACLE is supported (SME, SVE, or NEON),
96
-
the code is compiled using high-level intrinsics. If ACLE
97
-
support is not available, the build process falls back to handwritten inline
98
-
assembly targeting one of the available SIMD extensions, such as SME2.1, SME2,
99
-
SVE2.1, SVE2, and others.
89
+
When SIMD ACLE is supported (SME, SVE, or NEON), the code is compiled using high-level intrinsics. If ACLE support is not available, the build process falls back to handwritten inline assembly targeting one of the available SIMD extensions, such as SME2.1, SME2, SVE2.1, SVE2, and others.
100
90
101
-
The overall code structure also includes setup and
102
-
cleanup code in the main function, where memory buffers are allocated, the
103
-
selected loop kernel is executed, and results are verified for correctness.
91
+
The overall code structure also includes setup and cleanup code in the main function, where memory buffers are allocated, the selected loop kernel is executed, and results are verified for correctness.
104
92
105
-
At compile time, you can select which loop optimization to compile, whether it
106
-
is based on SME or SVE intrinsics, or one of the available inline assembly
107
-
variants.
93
+
At compile time, you can select which loop optimization to compile, whether it is based on SME or SVE intrinsics, or one of the available inline assembly variants.
108
94
109
95
```console
110
96
make
111
97
```
112
98
113
-
With no target specified the list of targets is printed:
99
+
With no target specified, the list of targets is printed:
114
100
115
101
```output
116
102
all fmt clean c-scalar scalar autovec-sve autovec-sve2 neon sve sve2 sme2 sme-ssve sve2p1 sme2p1 sve-intrinsics sme-intrinsics
117
103
```
118
104
119
-
You can build all loops for all targets using:
105
+
Build all loops for all targets:
120
106
121
107
```console
122
108
make all
123
109
```
124
110
125
-
You can build all loops for a single target, such as NEON, using:
111
+
Build all loops for a single target, such as NEON:
126
112
127
113
```console
128
114
make neon
129
115
```
130
116
131
-
As the result of the build, two types of binaries are generated.
132
-
133
-
The first is a single executable named `simd_loops`, which includes all the loop implementations.
117
+
As a result of the build, two types of binaries are generated.
134
118
135
-
A specific loop can be selected by passing parameters to the
136
-
program.
119
+
The first is a single executable named `simd_loops`, which includes all loop implementations.
137
120
138
-
For example, to run loop 1 for 5 iterations using the NEON target:
121
+
Select a specific loop by passing parameters to the program. For example, to run loop 1 for 5 iterations using the NEON target:
0 commit comments