Skip to content

Commit c31d636

Browse files
Tweaks
1 parent 9653244 commit c31d636

File tree

5 files changed

+28
-26
lines changed

5 files changed

+28
-26
lines changed

content/learning-paths/cross-platform/simd-loops/1-about.md

Lines changed: 8 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,22 +1,24 @@
11
---
2-
title: About single instruction, multiple data (SIMD) loops
2+
title: About Single Instruction, Multiple Data loops
33
weight: 2
44

55
### FIXED, DO NOT MODIFY
66
layout: learningpathall
77
---
88

9+
## Introduction to SIMD on Arm and why it matters for performance on Arm CPUs
10+
911
Writing high-performance software on Arm often means using single-instruction, multiple-data (SIMD) technologies. Many developers start with NEON, a familiar fixed-width vector extension. As Arm architectures evolve, so do the SIMD capabilities available to you.
1012

11-
This Learning Path uses the **Scalable Vector Extension (SVE)** and the **Scalable Matrix Extension (SME)** to demonstrate modern SIMD patterns. They are two powerful, scalable vector extensions designed for modern workloads. Unlike NEON, these architecture extensions are not just wider; they are fundamentally different. They introduce predication, vector-length-agnostic (VLA) programming, gather/scatter, streaming modes, and tile-based compute with ZA state. The result is more power and flexibility, but there can be a learning curve to match.
13+
This Learning Path uses the Scalable Vector Extension (SVE) and the Scalable Matrix Extension (SME) to demonstrate modern SIMD patterns. They are two powerful, scalable vector extensions designed for modern workloads. Unlike NEON, these architecture extensions are not just wider; they are fundamentally different. They introduce predication, vector-length-agnostic (VLA) programming, gather/scatter, streaming modes, and tile-based compute with ZA state. The result is more power and flexibility, but there can be a learning curve to match.
1214

13-
The **SIMD Loops** project offers a hands-on way to climb the learning curve. It is a public codebase of self-contained, real loop kernels written in C, Arm C Language Extensions (ACLE) intrinsics, and selected inline assembly. Kernels span tasks such as matrix multiply, sorting, and string processing. You can build them, run them, step through them, and adapt them for your own SIMD workloads.
15+
## What is the SIMD Loops project?
1416

15-
> Repo: [SIMD Loops](https://gitlab.arm.com/architecture/simd-loops)
17+
The SIMD Loops project offers a hands-on way to climb the learning curve. It is a public codebase of self-contained, real loop kernels written in C, Arm C Language Extensions (ACLE) intrinsics, and selected inline assembly. Kernels span tasks such as matrix multiply, sorting, and string processing. You can build them, run them, step through them, and adapt them for your own SIMD workloads.
1618

17-
SIMD Loops is an open-source project, licensed under BSD 3-Clause, built to help you learn how to write SIMD code for modern Arm architectures, specifically using SVE and SME. It is designed for programmers who already know their way around NEON intrinsics but are now facing the more powerful and complex world of SVE and SME.
19+
Visit the [SIMD Loops Repo](https://gitlab.arm.com/architecture/simd-loops).
1820

19-
The goal of SIMD Loops is to provide working, readable examples that demonstrate how to use the full range of features available in SVE, SVE2, and SME2. Each example is a self-contained loop kernel, a small piece of code that performs a specific task like matrix multiplication, vector reduction, histogram, or memory copy. These examples show how that task can be implemented across different vector instruction sets.
21+
This open-source project (BSD-3-Clause) teaches SIMD development on modern Arm CPUs with SVE, SVE2, SME, and SME2. It’s aimed at developers who know NEON intrinsics and want to explore newer extensions. The goal of SIMD Loops is to provide working, readable examples that demonstrate how to use the full range of features available in SVE, SVE2, and SME2. Each example is a self-contained loop kernel - a small piece of code that performs a specific task like matrix multiplication, vector reduction, histogram, or memory copy. These examples show how that task can be implemented across different vector instruction sets.
2022

2123
Unlike a cookbook that attempts to provide a recipe for every problem, SIMD Loops takes the opposite approach. It aims to showcase the architecture rather than the problem itself. The loop kernels are chosen to be realistic and meaningful, but the main goal is to demonstrate how specific features and instructions work in practice. If you are trying to understand scalability, predication, gather/scatter, streaming mode, ZA storage, compact instructions, or the mechanics of matrix tiles, this is where you can see them in action.
2224

content/learning-paths/cross-platform/simd-loops/2-using.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,8 @@ weight: 3
66
layout: learningpathall
77
---
88

9+
## Set up your development environment
10+
911
To get started, clone the SIMD Loops project and change to the project directory:
1012

1113
```bash

content/learning-paths/cross-platform/simd-loops/3-example.md

Lines changed: 14 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,8 @@ weight: 4
66
layout: learningpathall
77
---
88

9+
## Overview: loop 202 matrix multiplication example
10+
911
To illustrate the structure and design principles of SIMD Loops, consider loop 202 as an example.
1012

1113
Use a text editor to open `loops/loop_202.c`.
@@ -22,7 +24,7 @@ You can view matrix multiplication in two equivalent ways:
2224
- As the dot product between each row of `A` and each column of `B`
2325
- As the sum of outer products between the columns of `A` and the rows of `B`
2426

25-
## Data structure
27+
## Data structure definition
2628

2729
The loop begins by defining a data structure that captures the matrix dimensions (`M`, `K`, `N`) along with input and output buffers:
2830

@@ -44,15 +46,15 @@ For this loop:
4446

4547
This layout helps optimize memory access patterns across the targeted SIMD architectures.
4648

47-
## Loop attributes
49+
## Loop attributes by architecture
4850

4951
Loop attributes are specified per target architecture:
5052
- **SME targets**`inner_loop_202` is invoked with the `__arm_streaming` attribute and uses a shared `ZA` register context (`__arm_inout("za")`). These attributes are wrapped in the `LOOP_ATTR` macro
5153
- **SVE or NEON targets** — no additional attributes are required
5254

5355
This design enables portability across SIMD extensions.
5456

55-
## Function implementation
57+
## Function implementation in loops/matmul_fp32.c
5658

5759
`loops/matmul_fp32.c` provides several optimizations of matrix multiplication, including ACLE intrinsics and hand-optimized assembly.
5860

@@ -208,33 +210,27 @@ A snippet of the loop is shown below:
208210
```
209211
210212
Within the SME2 intrinsics code (lines 91–106), the innermost loop iterates across
211-
the `K` dimensioncolumns of `A` and rows of `B`
213+
the `K` dimension - columns of `A` and rows of `B`.
212214
213215
In each iteration:
214216
- Two consecutive vectors are loaded from `A` and two from `B` (`vec_a*`, `vec_b*`) using multi-vector load intrinsics
215217
- `fmopa`, wrapped by `MOPA_TILE`, computes the outer product
216218
- Partial results accumulate in four 32-bit `ZA` tiles
217219
218-
After all `K` iterations, results are written back in a store loop (lines 111–124)
220+
After all `K` iterations, results are written back in a store loop (lines 111–124).
219221
220-
During this phase, rows of `ZA` tiles are read into `Z` vectors using `svread_hor_za8_u8_vg4` (or `svreadz_hor_za8_u8_vg4` on SME2.1). Vectors are then stored to the output buffer using SME multi-vector `st1w` stores using `STORE_PAIR`
222+
During this phase, rows of `ZA` tiles are read into `Z` vectors using `svread_hor_za8_u8_vg4` (or `svreadz_hor_za8_u8_vg4` on SME2.1). Vectors are then stored to the output buffer using SME multi-vector `st1w` stores using `STORE_PAIR`.
221223
222-
The equivalent SME2 hand-optimized assembly appears around lines 229–340
224+
The equivalent SME2 hand-optimized assembly appears around lines 229–340.
223225
224-
For instruction semantics and SME/SME2 optimization guidance, see the [SME Programmer's Guide](https://developer.arm.com/documentation/109246/latest/)
226+
For instruction semantics and SME/SME2 optimization guidance, see the [SME Programmer's Guide](https://developer.arm.com/documentation/109246/latest/).
225227
226228
## Other optimizations
227229
228-
Beyond the SME2 and SVE implementations, this loop also includes additional optimized versions that leverage architecture-specific features.
229-
230-
### NEON
231-
232-
The NEON version (lines 612–710) uses structure load/store combined with indexed `fmla` to vectorize the computation.
233-
234-
### SVE2.1
230+
Beyond the SME2 and SVE implementations, this loop also includes additional optimized versions that leverage architecture-specific features:
235231
236-
The SVE2.1 version (lines 355–462) extends the base SVE approach using multi-vector loads and stores.
232+
- **NEON**: the NEON version (lines 612–710) uses structure load/store combined with indexed `fmla` to vectorize the computation.
237233
238-
### SME2.1
234+
- **SVE2.1**: the SVE2.1 version (lines 355–462) extends the base SVE approach using multi-vector loads and stores.
239235
240-
The SME2.1 version uses `movaz`/`svreadz_hor_za8_u8_vg4` to reinitialize `ZA` tile accumulators while moving data out to registers.
236+
- **SME2.1**: the SME2.1 version uses `movaz`/`svreadz_hor_za8_u8_vg4` to reinitialize `ZA` tile accumulators while moving data out to registers.

content/learning-paths/cross-platform/simd-loops/4-conclusion.md

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,13 @@
11
---
2-
title: Conclusion
2+
title: How to learn with SIMD Loops
33
weight: 5
44

55
### FIXED, DO NOT MODIFY
66
layout: learningpathall
77
---
88

9+
## Bridging the gap between specs and real code
10+
911
SIMD Loops is a practical way to learn the intricacies of SVE and SME across modern Arm architectures. By providing small, runnable loop kernels with reference code and optimized variants, it closes the gap between architectural specifications and real applications.
1012

1113
Whether you are moving from NEON or starting directly with SVE and SME, the project offers:

content/learning-paths/cross-platform/simd-loops/_index.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@ minutes_to_complete: 30
66
who_is_this_for: This is an advanced topic for software developers who want to learn how to use the full range of features available in SVE, SVE2, and SME2 to improve software performance on Arm processors.
77

88
learning_objectives:
9-
- Improve SIMD code performance using Scalable Vector Extension (SVE) and Scalable Matrix Extension (SME).
9+
- Improve SIMD code performance using Scalable Vector Extension (SVE) and Scalable Matrix Extension (SME)
1010
- Explore how SVE indexed fmla and SME2 fmopa instructions accelerate matrix multiplication
1111
- Understand how SME2 kernels use ZA tiles and streaming attributes
1212
- Describe what SIMD Loops contains and how kernels are organized across scalar, NEON, SVE,SVE2, and SME2 variants

0 commit comments

Comments
 (0)