Skip to content

Commit b40fcc4

Browse files
Merge pull request #2330 from ArmDeveloperEcosystem/main
production update
2 parents 290232d + d3c5f3b commit b40fcc4

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

55 files changed

+2296
-465
lines changed

content/learning-paths/automotive/_index.md

Lines changed: 10 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -12,20 +12,24 @@ title: Automotive
1212
weight: 4
1313
subjects_filter:
1414
- Containers and Virtualization: 3
15-
- Performance and Architecture: 2
15+
- Performance and Architecture: 5
1616
operatingsystems_filter:
1717
- Baremetal: 1
18-
- Linux: 4
18+
- Linux: 7
19+
- macOS: 1
1920
- RTOS: 1
2021
tools_software_languages_filter:
21-
- Automotive: 1
22-
- C: 1
22+
- Arm Development Studio: 1
23+
- Arm Zena CSS: 1
24+
- C: 2
25+
- C++: 1
26+
- Clang: 2
2327
- DDS: 1
2428
- Docker: 2
29+
- GCC: 2
2530
- Python: 2
2631
- Raspberry Pi: 1
27-
- ROS 2: 1
28-
- ROS2: 2
32+
- ROS 2: 3
2933
- Rust: 1
3034
- Zenoh: 1
3135
---
Lines changed: 12 additions & 47 deletions
Original file line numberDiff line numberDiff line change
@@ -1,70 +1,35 @@
11
---
2-
title: About single instruction, multiple data (SIMD) loops
3-
weight: 3
2+
title: About Single Instruction, Multiple Data loops
3+
weight: 2
44

55
### FIXED, DO NOT MODIFY
66
layout: learningpathall
77
---
88

9-
Writing high-performance software for Arm processors often involves delving into
10-
SIMD technologies. For many developers, that journey started with NEON, a
11-
familiar, fixed-width vector extension that has been around for many years. But as
12-
Arm architectures continue to evolve, so do their SIMD technologies.
9+
## Introduction to SIMD on Arm and why it matters for performance on Arm CPUs
1310

14-
Enter the world of Scalable Vector Extension (SVE) and Scalable Matrix Extension (SME): two powerful, scalable vector extensions designed for modern
15-
workloads. Unlike NEON, they are not just wider; they are fundamentally different. These
16-
extensions introduce new instructions, more flexible programming models, and
17-
support for concepts like predication, scalable vectors, and streaming modes.
18-
However, they also come with a learning curve.
11+
Writing high-performance software on Arm often means using single-instruction, multiple-data (SIMD) technologies. Many developers start with NEON, a familiar fixed-width vector extension. As Arm architectures evolve, so do the SIMD capabilities available to you.
1912

20-
That is where [SIMD Loops](https://gitlab.arm.com/architecture/simd-loops) becomes a valuable resource, enabling you to quickly and effectively learn how to write high-performance SIMD code.
13+
This Learning Path uses the Scalable Vector Extension (SVE) and the Scalable Matrix Extension (SME) to demonstrate modern SIMD patterns. They are two powerful, scalable vector extensions designed for modern workloads. Unlike NEON, these architecture extensions are not just wider; they are fundamentally different. They introduce predication, vector-length-agnostic (VLA) programming, gather/scatter, streaming modes, and tile-based compute with ZA state. The result is more power and flexibility, but there can be a learning curve to match.
2114

22-
SIMD Loops is designed to help
23-
you learn how to write SVE and SME code. It is a collection
24-
of self-contained, real-world loop kernels written in a mix of C, Arm C Language Extensions (ACLE)
25-
intrinsics, and inline assembly. These kernels target tasks ranging from simple arithmetic
26-
to matrix multiplication, sorting, and string processing. You can compile them,
27-
run them, step through them, and use them as a foundation for your own SIMD
28-
work.
15+
## What is the SIMD Loops project?
2916

30-
If you are familiar with NEON intrinsics, you can use SIMD Loops to learn and explore SVE and SME.
17+
The SIMD Loops project offers a hands-on way to climb the learning curve. It is a public codebase of self-contained, real loop kernels written in C, Arm C Language Extensions (ACLE) intrinsics, and selected inline assembly. Kernels span tasks such as matrix multiply, sorting, and string processing. You can build them, run them, step through them, and adapt them for your own SIMD workloads.
3118

32-
## What is SIMD Loops?
19+
Visit the [SIMD Loops Repo](https://gitlab.arm.com/architecture/simd-loops).
3320

34-
SIMD Loops is an open-source
35-
project, licensed under BSD 3-Clause, built to help you learn how to write SIMD code for modern Arm
36-
architectures, specifically using SVE and SME.
37-
It is designed for programmers who already know
38-
their way around NEON intrinsics but are now facing the more powerful and
39-
complex world of SVE and SME.
21+
This open-source project (BSD-3-Clause) teaches SIMD development on modern Arm CPUs with SVE, SVE2, SME, and SME2. It’s aimed at developers who know NEON intrinsics and want to explore newer extensions. The goal of SIMD Loops is to provide working, readable examples that demonstrate how to use the full range of features available in SVE, SVE2, and SME2. Each example is a self-contained loop kernel - a small piece of code that performs a specific task like matrix multiplication, vector reduction, histogram, or memory copy. These examples show how that task can be implemented across different vector instruction sets.
4022

41-
The goal of SIMD Loops is to provide working, readable examples that demonstrate
42-
how to use the full range of features available in SVE, SVE2, and SME2. Each
43-
example is a self-contained loop kernel, a small piece of code that performs
44-
a specific task like matrix multiplication, vector reduction, histogram, or
45-
memory copy. These examples show how that task can be implemented across different
46-
vector instruction sets.
47-
48-
Unlike a cookbook that tries to provide a recipe for every problem, SIMD Loops
49-
takes the opposite approach. It aims to showcase the architecture rather than
50-
the problem. The loop kernels are chosen to be realistic and meaningful, but the
51-
main goal is to demonstrate how specific features and instructions work in
52-
practice. If you are trying to understand scalability, predication,
53-
gather/scatter, streaming mode, ZA storage, compact instructions, or the
54-
mechanics of matrix tiles, this is where you will see them in action.
23+
Unlike a cookbook that attempts to provide a recipe for every problem, SIMD Loops takes the opposite approach. It aims to showcase the architecture rather than the problem itself. The loop kernels are chosen to be realistic and meaningful, but the main goal is to demonstrate how specific features and instructions work in practice. If you are trying to understand scalability, predication, gather/scatter, streaming mode, ZA storage, compact instructions, or the mechanics of matrix tiles, this is where you can see them in action.
5524

5625
The project includes:
57-
- Dozens of numbered loop kernels, each focused on a specific feature or pattern
26+
- Many numbered loop kernels, each focused on a specific feature or pattern
5827
- Reference C implementations to establish expected behavior
5928
- Inline assembly and/or intrinsics for scalar, NEON, SVE, SVE2, SVE2.1, SME2, and SME2.1
6029
- Build support for different instruction sets, with runtime validation
6130
- A simple command-line runner to execute any loop interactively
6231
- Optional standalone binaries for bare-metal and simulator use
6332

64-
You do not need to worry about auto-vectorization, compiler flags, or tooling
65-
quirks. Each loop is hand-written and annotated to make the use of SIMD features
66-
clear. The intent is that you can study, modify, and run each loop as a learning
67-
exercise, and use the project as a foundation for your own exploration of
68-
Arm’s vector extensions.
33+
You do not need to rely on auto-vectorization or guess at compiler flags. Each loop is handwritten and annotated to make the intended use of SIMD features clear. Study a kernel, modify it, rebuild, and observe the effect - this is the core learning loop.
6934

7035

Lines changed: 105 additions & 35 deletions
Original file line numberDiff line numberDiff line change
@@ -1,45 +1,74 @@
11
---
22
title: Using SIMD Loops
3-
weight: 4
3+
weight: 3
44

55
### FIXED, DO NOT MODIFY
66
layout: learningpathall
77
---
88

9-
To get started, clone the SIMD Loops project and change current directory:
9+
## Set up your development environment
10+
11+
To get started, clone the SIMD Loops project and change to the project directory:
1012

1113
```bash
1214
git clone https://gitlab.arm.com/architecture/simd-loops simd-loops.git
1315
cd simd-loops.git
1416
```
1517

18+
Confirm that you are using an Arm machine:
19+
20+
```bash
21+
uname -m
22+
```
23+
24+
Expected output on Linux:
25+
26+
```output
27+
aarch64
28+
```
29+
30+
Expected output on macOS:
31+
32+
```output
33+
arm64
34+
```
35+
1636
## SIMD Loops structure
1737

18-
In the SIMD Loops project, the
19-
source code for the loops is organized under the loops directory. The complete
20-
list of loops is documented in the loops.inc file, which includes a brief
21-
description and the purpose of each loop. Every loop is associated with a
22-
uniquely named source file following the naming pattern `loop_<NNN>.c`, where
23-
`<NNN>` represents the loop number.
38+
In the SIMD Loops project, the source code for the loops is organized under the `loops` directory. The complete list of loops is documented in the `loops.inc` file, which includes a brief description and the purpose of each loop. Every loop is associated with a uniquely named source file following the pattern `loop_<NNN>.c`, where `<NNN>` represents the loop number.
39+
40+
A subset of the `loops.inc` file is below:
41+
42+
```output
43+
LOOP(001, "FP32 inner product", "Use of fp32 MLA instruction", STREAMING_COMPATIBLE)
44+
LOOP(002, "UINT32 inner product", "Use of u32 MLA instruction", STREAMING_COMPATIBLE)
45+
LOOP(003, "FP64 inner product", "Use of fp64 MLA instruction", STREAMING_COMPATIBLE)
46+
LOOP(004, "UINT64 inner product", "Use of u64 MLA instruction", STREAMING_COMPATIBLE)
47+
LOOP(005, "strlen short strings", "Use of FF and NF loads instructions")
48+
LOOP(006, "strlen long strings", "Use of FF and NF loads instructions")
49+
LOOP(008, "Precise fp64 add reduction", "Use of FADDA instructions")
50+
LOOP(009, "Pointer chasing", "Use of CTERM and BRK instructions")
51+
LOOP(010, "Conditional reduction (fp)", "Use of CLAST (SIMD&FP scalar) instructions", STREAMING_COMPATIBLE)
52+
```
2453

2554
A loop is structured as follows:
2655

27-
```C
56+
```c
2857
// Includes and loop_<NNN>_data structure definition
2958

3059
#if defined(HAVE_NATIVE) || defined(HAVE_AUTOVEC)
3160

32-
// C code
61+
// C reference or auto-vectorized version
3362
void inner_loop_<NNN>(struct loop_<NNN>_data *data) { ... }
3463

3564
#if defined(HAVE_xxx_INTRINSICS)
3665

37-
// Intrinsics versions: xxx = SME, SVE, or SIMD (NEON) versions
66+
// Intrinsics versions: xxx = SME, SVE, or SIMD (NEON)
3867
void inner_loop_<NNN>(struct loop_<NNN>_data *data) { ... }
3968

4069
#elif defined(<ASM_COND>)
4170

42-
// Hand-written inline assembly :
71+
// Hand-written inline assembly
4372
// <ASM_COND> = __ARM_FEATURE_SME2p1, __ARM_FEATURE_SME2, __ARM_FEATURE_SVE2p1,
4473
// __ARM_FEATURE_SVE2, __ARM_FEATURE_SVE, or __ARM_NEON
4574
void inner_loop_<NNN>(struct loop_<NNN>_data *data) { ... }
@@ -50,28 +79,69 @@ void inner_loop_<NNN>(struct loop_<NNN>_data *data) { ... }
5079

5180
#endif
5281

53-
// Main of loop: Buffers allocations, loop function call, result functional checking
82+
// Main of loop: buffer allocation, loop function call, result checking
83+
```
84+
85+
Each loop is implemented in several SIMD extension variants. Conditional compilation selects one of the implementations for the `inner_loop_<NNN>` function.
86+
87+
The native C implementation is written first, and it can be generated either when building natively with `-DHAVE_NATIVE` or through compiler auto-vectorization with `-DHAVE_AUTOVEC`.
88+
89+
When SIMD ACLE is supported (SME, SVE, or NEON), the code is compiled using high-level intrinsics. If ACLE support is not available, the build process falls back to handwritten inline assembly targeting one of the available SIMD extensions, such as SME2.1, SME2, SVE2.1, SVE2, and others.
90+
91+
The overall code structure also includes setup and cleanup code in the main function, where memory buffers are allocated, the selected loop kernel is executed, and results are verified for correctness.
92+
93+
At compile time, you can select which loop optimization to compile, whether it is based on SME or SVE intrinsics, or one of the available inline assembly variants.
94+
95+
```console
96+
make
97+
```
98+
99+
With no target specified, the list of targets is printed:
100+
101+
```output
102+
all fmt clean c-scalar scalar autovec-sve autovec-sve2 neon sve sve2 sme2 sme-ssve sve2p1 sme2p1 sve-intrinsics sme-intrinsics
103+
```
104+
105+
Build all loops for all targets:
106+
107+
```console
108+
make all
109+
```
110+
111+
Build all loops for a single target, such as NEON:
112+
113+
```console
114+
make neon
115+
```
116+
117+
As a result of the build, two types of binaries are generated.
118+
119+
The first is a single executable named `simd_loops`, which includes all loop implementations.
120+
121+
Select a specific loop by passing parameters to the program. For example, to run loop 1 for 5 iterations using the NEON target:
122+
123+
```console
124+
build/neon/bin/simd_loops -k 1 -n 5
125+
```
126+
127+
Example output:
128+
129+
```output
130+
Loop 001 - FP32 inner product
131+
- Purpose: Use of fp32 MLA instruction
132+
- Checksum correct.
54133
```
55134

56-
Each loop is implemented in several SIMD extension variants, and conditional
57-
compilation is used to select one of the optimizations for the
58-
`inner_loop_<NNN>` function. The native C implementation is written first, and
59-
it can be generated either when building natively (HAVE_NATIVE) or through
60-
compiler auto-vectorization (HAVE_AUTOVEC). When SIMD ACLE is supported (e.g.,
61-
SME, SVE, or NEON), the code is compiled using high-level intrinsics. If ACLE
62-
support is not available, the build process falls back to handwritten inline
63-
assembly targeting one of the available SIMD extensions, such as SME2.1, SME2,
64-
SVE2.1, SVE2, and others. The overall code structure also includes setup and
65-
cleanup code in the main function, where memory buffers are allocated, the
66-
selected loop kernel is executed, and results are verified for correctness.
67-
68-
At compile time, you can select which loop optimization to compile, whether it
69-
is based on SME or SVE intrinsics, or one of the available inline assembly
70-
variants (`make scalar neon sve2 sme2 sve2p1 sme2p1 sve_intrinsics
71-
sme_intrinsics` ...).
72-
73-
As the result of the build, two types of binaries are generated. The first is a
74-
single executable named `simd_loops`, which includes all the loop
75-
implementations. A specific loop can be selected by passing parameters to the
76-
program (e.g., `simd_loops -k <NNN> -n <iterations>`). The second type consists
77-
of individual standalone binaries, each corresponding to a specific loop.
135+
The second type of binary is an individual loop.
136+
137+
To run loop 1 as a standalone binary:
138+
139+
```console
140+
build/neon/standalone/bin/loop_001.elf
141+
```
142+
143+
Example output:
144+
145+
```output
146+
- Checksum correct.
147+
```

0 commit comments

Comments
 (0)