You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: content/learning-paths/cross-platform/simd-loops/1-about.md
+8-6Lines changed: 8 additions & 6 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,22 +1,24 @@
1
1
---
2
-
title: About single instruction, multiple data (SIMD) loops
2
+
title: About Single Instruction, Multiple Data loops
3
3
weight: 2
4
4
5
5
### FIXED, DO NOT MODIFY
6
6
layout: learningpathall
7
7
---
8
8
9
+
## Introduction to SIMD on Arm and why it matters for performance on Arm CPUs
10
+
9
11
Writing high-performance software on Arm often means using single-instruction, multiple-data (SIMD) technologies. Many developers start with NEON, a familiar fixed-width vector extension. As Arm architectures evolve, so do the SIMD capabilities available to you.
10
12
11
-
This Learning Path uses the **Scalable Vector Extension (SVE)** and the **Scalable Matrix Extension (SME)** to demonstrate modern SIMD patterns. They are two powerful, scalable vector extensions designed for modern workloads. Unlike NEON, these architecture extensions are not just wider; they are fundamentally different. They introduce predication, vector-length-agnostic (VLA) programming, gather/scatter, streaming modes, and tile-based compute with ZA state. The result is more power and flexibility, but there can be a learning curve to match.
13
+
This Learning Path uses the Scalable Vector Extension (SVE) and the Scalable Matrix Extension (SME) to demonstrate modern SIMD patterns. They are two powerful, scalable vector extensions designed for modern workloads. Unlike NEON, these architecture extensions are not just wider; they are fundamentally different. They introduce predication, vector-length-agnostic (VLA) programming, gather/scatter, streaming modes, and tile-based compute with ZA state. The result is more power and flexibility, but there can be a learning curve to match.
12
14
13
-
The **SIMD Loops** project offers a hands-on way to climb the learning curve. It is a public codebase of self-contained, real loop kernels written in C, Arm C Language Extensions (ACLE) intrinsics, and selected inline assembly. Kernels span tasks such as matrix multiply, sorting, and string processing. You can build them, run them, step through them, and adapt them for your own SIMD workloads.
The SIMD Loops project offers a hands-on way to climb the learning curve. It is a public codebase of self-contained, real loop kernels written in C, Arm C Language Extensions (ACLE) intrinsics, and selected inline assembly. Kernels span tasks such as matrix multiply, sorting, and string processing. You can build them, run them, step through them, and adapt them for your own SIMD workloads.
16
18
17
-
SIMD Loops is an open-source project, licensed under BSD 3-Clause, built to help you learn how to write SIMD code for modern Arm architectures, specifically using SVE and SME. It is designed for programmers who already know their way around NEON intrinsics but are now facing the more powerful and complex world of SVE and SME.
19
+
Visit the [SIMD Loops Repo](https://gitlab.arm.com/architecture/simd-loops).
18
20
19
-
The goal of SIMD Loops is to provide working, readable examples that demonstrate how to use the full range of features available in SVE, SVE2, and SME2. Each example is a self-contained loop kernel, a small piece of code that performs a specific task like matrix multiplication, vector reduction, histogram, or memory copy. These examples show how that task can be implemented across different vector instruction sets.
21
+
This open-source project (BSD-3-Clause) teaches SIMD development on modern Arm CPUs with SVE, SVE2, SME, and SME2. It’s aimed at developers who know NEON intrinsics and want to explore newer extensions. The goal of SIMD Loops is to provide working, readable examples that demonstrate how to use the full range of features available in SVE, SVE2, and SME2. Each example is a self-contained loop kernel - a small piece of code that performs a specific task like matrix multiplication, vector reduction, histogram, or memory copy. These examples show how that task can be implemented across different vector instruction sets.
20
22
21
23
Unlike a cookbook that attempts to provide a recipe for every problem, SIMD Loops takes the opposite approach. It aims to showcase the architecture rather than the problem itself. The loop kernels are chosen to be realistic and meaningful, but the main goal is to demonstrate how specific features and instructions work in practice. If you are trying to understand scalability, predication, gather/scatter, streaming mode, ZA storage, compact instructions, or the mechanics of matrix tiles, this is where you can see them in action.
Copy file name to clipboardExpand all lines: content/learning-paths/cross-platform/simd-loops/3-example.md
+14-18Lines changed: 14 additions & 18 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -6,6 +6,8 @@ weight: 4
6
6
layout: learningpathall
7
7
---
8
8
9
+
## Overview: loop 202 matrix multiplication example
10
+
9
11
To illustrate the structure and design principles of SIMD Loops, consider loop 202 as an example.
10
12
11
13
Use a text editor to open `loops/loop_202.c`.
@@ -22,7 +24,7 @@ You can view matrix multiplication in two equivalent ways:
22
24
- As the dot product between each row of `A` and each column of `B`
23
25
- As the sum of outer products between the columns of `A` and the rows of `B`
24
26
25
-
## Data structure
27
+
## Data structure definition
26
28
27
29
The loop begins by defining a data structure that captures the matrix dimensions (`M`, `K`, `N`) along with input and output buffers:
28
30
@@ -44,15 +46,15 @@ For this loop:
44
46
45
47
This layout helps optimize memory access patterns across the targeted SIMD architectures.
46
48
47
-
## Loop attributes
49
+
## Loop attributes by architecture
48
50
49
51
Loop attributes are specified per target architecture:
50
52
-**SME targets** — `inner_loop_202` is invoked with the `__arm_streaming` attribute and uses a shared `ZA` register context (`__arm_inout("za")`). These attributes are wrapped in the `LOOP_ATTR` macro
51
53
-**SVE or NEON targets** — no additional attributes are required
52
54
53
55
This design enables portability across SIMD extensions.
54
56
55
-
## Function implementation
57
+
## Function implementation in loops/matmul_fp32.c
56
58
57
59
`loops/matmul_fp32.c` provides several optimizations of matrix multiplication, including ACLE intrinsics and hand-optimized assembly.
58
60
@@ -208,33 +210,27 @@ A snippet of the loop is shown below:
208
210
```
209
211
210
212
Within the SME2 intrinsics code (lines 91–106), the innermost loop iterates across
211
-
the `K` dimension—columns of `A` and rows of `B`
213
+
the `K` dimension - columns of `A` and rows of `B`.
212
214
213
215
In each iteration:
214
216
- Two consecutive vectors are loaded from `A` and two from `B` (`vec_a*`, `vec_b*`) using multi-vector load intrinsics
215
217
- `fmopa`, wrapped by `MOPA_TILE`, computes the outer product
216
218
- Partial results accumulate in four 32-bit `ZA` tiles
217
219
218
-
After all `K` iterations, results are written back in a store loop (lines 111–124)
220
+
After all `K` iterations, results are written back in a store loop (lines 111–124).
219
221
220
-
During this phase, rows of `ZA` tiles are read into `Z` vectors using `svread_hor_za8_u8_vg4` (or `svreadz_hor_za8_u8_vg4` on SME2.1). Vectors are then stored to the output buffer using SME multi-vector `st1w` stores using `STORE_PAIR`
222
+
During this phase, rows of `ZA` tiles are read into `Z` vectors using `svread_hor_za8_u8_vg4` (or `svreadz_hor_za8_u8_vg4` on SME2.1). Vectors are then stored to the output buffer using SME multi-vector `st1w` stores using `STORE_PAIR`.
221
223
222
-
The equivalent SME2 hand-optimized assembly appears around lines 229–340
224
+
The equivalent SME2 hand-optimized assembly appears around lines 229–340.
223
225
224
-
For instruction semantics and SME/SME2 optimization guidance, see the [SME Programmer's Guide](https://developer.arm.com/documentation/109246/latest/)
226
+
For instruction semantics and SME/SME2 optimization guidance, see the [SME Programmer's Guide](https://developer.arm.com/documentation/109246/latest/).
225
227
226
228
## Other optimizations
227
229
228
-
Beyond the SME2 and SVE implementations, this loop also includes additional optimized versions that leverage architecture-specific features.
229
-
230
-
### NEON
231
-
232
-
The NEON version (lines 612–710) uses structure load/store combined with indexed `fmla` to vectorize the computation.
233
-
234
-
### SVE2.1
230
+
Beyond the SME2 and SVE implementations, this loop also includes additional optimized versions that leverage architecture-specific features:
235
231
236
-
The SVE2.1 version (lines 355–462) extends the base SVE approach using multi-vector loads and stores.
232
+
- **NEON**: the NEON version (lines 612–710) uses structure load/store combined with indexed `fmla` to vectorize the computation.
237
233
238
-
### SME2.1
234
+
- **SVE2.1**: the SVE2.1 version (lines 355–462) extends the base SVE approach using multi-vector loads and stores.
239
235
240
-
The SME2.1 version uses `movaz`/`svreadz_hor_za8_u8_vg4` to reinitialize `ZA` tile accumulators while moving data out to registers.
236
+
- **SME2.1**: the SME2.1 version uses `movaz`/`svreadz_hor_za8_u8_vg4` to reinitialize `ZA` tile accumulators while moving data out to registers.
Copy file name to clipboardExpand all lines: content/learning-paths/cross-platform/simd-loops/4-conclusion.md
+3-1Lines changed: 3 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,11 +1,13 @@
1
1
---
2
-
title: Conclusion
2
+
title: How to learn with SIMD Loops
3
3
weight: 5
4
4
5
5
### FIXED, DO NOT MODIFY
6
6
layout: learningpathall
7
7
---
8
8
9
+
## Bridging the gap between specs and real code
10
+
9
11
SIMD Loops is a practical way to learn the intricacies of SVE and SME across modern Arm architectures. By providing small, runnable loop kernels with reference code and optimized variants, it closes the gap between architectural specifications and real applications.
10
12
11
13
Whether you are moving from NEON or starting directly with SVE and SME, the project offers:
Copy file name to clipboardExpand all lines: content/learning-paths/cross-platform/simd-loops/_index.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -6,7 +6,7 @@ minutes_to_complete: 30
6
6
who_is_this_for: This is an advanced topic for software developers who want to learn how to use the full range of features available in SVE, SVE2, and SME2 to improve software performance on Arm processors.
7
7
8
8
learning_objectives:
9
-
- Improve SIMD code performance using Scalable Vector Extension (SVE) and Scalable Matrix Extension (SME).
9
+
- Improve SIMD code performance using Scalable Vector Extension (SVE) and Scalable Matrix Extension (SME)
10
10
- Explore how SVE indexed fmla and SME2 fmopa instructions accelerate matrix multiplication
11
11
- Understand how SME2 kernels use ZA tiles and streaming attributes
12
12
- Describe what SIMD Loops contains and how kernels are organized across scalar, NEON, SVE,SVE2, and SME2 variants
0 commit comments