You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
title: About single instruction, multiple data (SIMD) loops
3
-
weight: 3
2
+
title: About Single Instruction, Multiple Data loops
3
+
weight: 2
4
4
5
5
### FIXED, DO NOT MODIFY
6
6
layout: learningpathall
7
7
---
8
8
9
-
Writing high-performance software for Arm processors often involves delving into
10
-
SIMD technologies. For many developers, that journey started with NEON, a
11
-
familiar, fixed-width vector extension that has been around for many years. But as
12
-
Arm architectures continue to evolve, so do their SIMD technologies.
9
+
## Introduction to SIMD on Arm and why it matters for performance on Arm CPUs
13
10
14
-
Enter the world of Scalable Vector Extension (SVE) and Scalable Matrix Extension (SME): two powerful, scalable vector extensions designed for modern
15
-
workloads. Unlike NEON, they are not just wider; they are fundamentally different. These
16
-
extensions introduce new instructions, more flexible programming models, and
17
-
support for concepts like predication, scalable vectors, and streaming modes.
18
-
However, they also come with a learning curve.
11
+
Writing high-performance software on Arm often means using single-instruction, multiple-data (SIMD) technologies. Many developers start with NEON, a familiar fixed-width vector extension. As Arm architectures evolve, so do the SIMD capabilities available to you.
19
12
20
-
That is where [SIMD Loops](https://gitlab.arm.com/architecture/simd-loops) becomes a valuable resource, enabling you to quickly and effectively learn how to write high-performance SIMD code.
13
+
This Learning Path uses the Scalable Vector Extension (SVE) and the Scalable Matrix Extension (SME) to demonstrate modern SIMD patterns. They are two powerful, scalable vector extensions designed for modern workloads. Unlike NEON, these architecture extensions are not just wider; they are fundamentally different. They introduce predication, vector-length-agnostic (VLA) programming, gather/scatter, streaming modes, and tile-based compute with ZA state. The result is more power and flexibility, but there can be a learning curve to match.
21
14
22
-
SIMD Loops is designed to help
23
-
you learn how to write SVE and SME code. It is a collection
24
-
of self-contained, real-world loop kernels written in a mix of C, Arm C Language Extensions (ACLE)
25
-
intrinsics, and inline assembly. These kernels target tasks ranging from simple arithmetic
26
-
to matrix multiplication, sorting, and string processing. You can compile them,
27
-
run them, step through them, and use them as a foundation for your own SIMD
28
-
work.
15
+
## What is the SIMD Loops project?
29
16
30
-
If you are familiar with NEON intrinsics, you can use SIMD Loops to learn and explore SVE and SME.
17
+
The SIMD Loops project offers a hands-on way to climb the learning curve. It is a public codebase of self-contained, real loop kernels written in C, Arm C Language Extensions (ACLE) intrinsics, and selected inline assembly. Kernels span tasks such as matrix multiply, sorting, and string processing. You can build them, run them, step through them, and adapt them for your own SIMD workloads.
31
18
32
-
## What is SIMD Loops?
19
+
Visit the [SIMD Loops Repo](https://gitlab.arm.com/architecture/simd-loops).
33
20
34
-
SIMD Loops is an open-source
35
-
project, licensed under BSD 3-Clause, built to help you learn how to write SIMD code for modern Arm
36
-
architectures, specifically using SVE and SME.
37
-
It is designed for programmers who already know
38
-
their way around NEON intrinsics but are now facing the more powerful and
39
-
complex world of SVE and SME.
21
+
This open-source project (BSD-3-Clause) teaches SIMD development on modern Arm CPUs with SVE, SVE2, SME, and SME2. It’s aimed at developers who know NEON intrinsics and want to explore newer extensions. The goal of SIMD Loops is to provide working, readable examples that demonstrate how to use the full range of features available in SVE, SVE2, and SME2. Each example is a self-contained loop kernel - a small piece of code that performs a specific task like matrix multiplication, vector reduction, histogram, or memory copy. These examples show how that task can be implemented across different vector instruction sets.
40
22
41
-
The goal of SIMD Loops is to provide working, readable examples that demonstrate
42
-
how to use the full range of features available in SVE, SVE2, and SME2. Each
43
-
example is a self-contained loop kernel, a small piece of code that performs
44
-
a specific task like matrix multiplication, vector reduction, histogram, or
45
-
memory copy. These examples show how that task can be implemented across different
46
-
vector instruction sets.
47
-
48
-
Unlike a cookbook that tries to provide a recipe for every problem, SIMD Loops
49
-
takes the opposite approach. It aims to showcase the architecture rather than
50
-
the problem. The loop kernels are chosen to be realistic and meaningful, but the
51
-
main goal is to demonstrate how specific features and instructions work in
52
-
practice. If you are trying to understand scalability, predication,
53
-
gather/scatter, streaming mode, ZA storage, compact instructions, or the
54
-
mechanics of matrix tiles, this is where you will see them in action.
23
+
Unlike a cookbook that attempts to provide a recipe for every problem, SIMD Loops takes the opposite approach. It aims to showcase the architecture rather than the problem itself. The loop kernels are chosen to be realistic and meaningful, but the main goal is to demonstrate how specific features and instructions work in practice. If you are trying to understand scalability, predication, gather/scatter, streaming mode, ZA storage, compact instructions, or the mechanics of matrix tiles, this is where you can see them in action.
55
24
56
25
The project includes:
57
-
-Dozens of numbered loop kernels, each focused on a specific feature or pattern
26
+
-Many numbered loop kernels, each focused on a specific feature or pattern
58
27
- Reference C implementations to establish expected behavior
59
28
- Inline assembly and/or intrinsics for scalar, NEON, SVE, SVE2, SVE2.1, SME2, and SME2.1
60
29
- Build support for different instruction sets, with runtime validation
61
30
- A simple command-line runner to execute any loop interactively
62
31
- Optional standalone binaries for bare-metal and simulator use
63
32
64
-
You do not need to worry about auto-vectorization, compiler flags, or tooling
65
-
quirks. Each loop is hand-written and annotated to make the use of SIMD features
66
-
clear. The intent is that you can study, modify, and run each loop as a learning
67
-
exercise, and use the project as a foundation for your own exploration of
68
-
Arm’s vector extensions.
33
+
You do not need to rely on auto-vectorization or guess at compiler flags. Each loop is handwritten and annotated to make the intended use of SIMD features clear. Study a kernel, modify it, rebuild, and observe the effect - this is the core learning loop.
source code for the loops is organized under the loops directory. The complete
20
-
list of loops is documented in the loops.inc file, which includes a brief
21
-
description and the purpose of each loop. Every loop is associated with a
22
-
uniquely named source file following the naming pattern `loop_<NNN>.c`, where
23
-
`<NNN>` represents the loop number.
38
+
In the SIMD Loops project, the source code for the loops is organized under the `loops` directory. The complete list of loops is documented in the `loops.inc` file, which includes a brief description and the purpose of each loop. Every loop is associated with a uniquely named source file following the pattern `loop_<NNN>.c`, where `<NNN>` represents the loop number.
39
+
40
+
A subset of the `loops.inc` file is below:
41
+
42
+
```output
43
+
LOOP(001, "FP32 inner product", "Use of fp32 MLA instruction", STREAMING_COMPATIBLE)
44
+
LOOP(002, "UINT32 inner product", "Use of u32 MLA instruction", STREAMING_COMPATIBLE)
45
+
LOOP(003, "FP64 inner product", "Use of fp64 MLA instruction", STREAMING_COMPATIBLE)
46
+
LOOP(004, "UINT64 inner product", "Use of u64 MLA instruction", STREAMING_COMPATIBLE)
47
+
LOOP(005, "strlen short strings", "Use of FF and NF loads instructions")
48
+
LOOP(006, "strlen long strings", "Use of FF and NF loads instructions")
49
+
LOOP(008, "Precise fp64 add reduction", "Use of FADDA instructions")
50
+
LOOP(009, "Pointer chasing", "Use of CTERM and BRK instructions")
// Main of loop: Buffers allocations, loop function call, result functional checking
82
+
// Main of loop: buffer allocation, loop function call, result checking
83
+
```
84
+
85
+
Each loop is implemented in several SIMD extension variants. Conditional compilation selects one of the implementations for the `inner_loop_<NNN>` function.
86
+
87
+
The native C implementation is written first, and it can be generated either when building natively with `-DHAVE_NATIVE` or through compiler auto-vectorization with `-DHAVE_AUTOVEC`.
88
+
89
+
When SIMD ACLE is supported (SME, SVE, or NEON), the code is compiled using high-level intrinsics. If ACLE support is not available, the build process falls back to handwritten inline assembly targeting one of the available SIMD extensions, such as SME2.1, SME2, SVE2.1, SVE2, and others.
90
+
91
+
The overall code structure also includes setup and cleanup code in the main function, where memory buffers are allocated, the selected loop kernel is executed, and results are verified for correctness.
92
+
93
+
At compile time, you can select which loop optimization to compile, whether it is based on SME or SVE intrinsics, or one of the available inline assembly variants.
94
+
95
+
```console
96
+
make
97
+
```
98
+
99
+
With no target specified, the list of targets is printed:
100
+
101
+
```output
102
+
all fmt clean c-scalar scalar autovec-sve autovec-sve2 neon sve sve2 sme2 sme-ssve sve2p1 sme2p1 sve-intrinsics sme-intrinsics
103
+
```
104
+
105
+
Build all loops for all targets:
106
+
107
+
```console
108
+
make all
109
+
```
110
+
111
+
Build all loops for a single target, such as NEON:
112
+
113
+
```console
114
+
make neon
115
+
```
116
+
117
+
As a result of the build, two types of binaries are generated.
118
+
119
+
The first is a single executable named `simd_loops`, which includes all loop implementations.
120
+
121
+
Select a specific loop by passing parameters to the program. For example, to run loop 1 for 5 iterations using the NEON target:
122
+
123
+
```console
124
+
build/neon/bin/simd_loops -k 1 -n 5
125
+
```
126
+
127
+
Example output:
128
+
129
+
```output
130
+
Loop 001 - FP32 inner product
131
+
- Purpose: Use of fp32 MLA instruction
132
+
- Checksum correct.
54
133
```
55
134
56
-
Each loop is implemented in several SIMD extension variants, and conditional
57
-
compilation is used to select one of the optimizations for the
58
-
`inner_loop_<NNN>` function. The native C implementation is written first, and
59
-
it can be generated either when building natively (HAVE_NATIVE) or through
60
-
compiler auto-vectorization (HAVE_AUTOVEC). When SIMD ACLE is supported (e.g.,
61
-
SME, SVE, or NEON), the code is compiled using high-level intrinsics. If ACLE
62
-
support is not available, the build process falls back to handwritten inline
63
-
assembly targeting one of the available SIMD extensions, such as SME2.1, SME2,
64
-
SVE2.1, SVE2, and others. The overall code structure also includes setup and
65
-
cleanup code in the main function, where memory buffers are allocated, the
66
-
selected loop kernel is executed, and results are verified for correctness.
67
-
68
-
At compile time, you can select which loop optimization to compile, whether it
69
-
is based on SME or SVE intrinsics, or one of the available inline assembly
0 commit comments