Tweaks

madeline-underwood · madeline-underwood · commit c31d6364e8fb · 2025-09-18T20:12:41.000Z
diff --git a/content/learning-paths/cross-platform/simd-loops/1-about.md b/content/learning-paths/cross-platform/simd-loops/1-about.md
@@ -1,22 +1,24 @@
 ---
-title: About single instruction, multiple data (SIMD) loops
+title: About Single Instruction, Multiple Data loops
 weight: 2
 
 ### FIXED, DO NOT MODIFY
 layout: learningpathall
 ---
 
+## Introduction to SIMD on Arm and why it matters for performance on Arm CPUs
+
 Writing high-performance software on Arm often means using single-instruction, multiple-data (SIMD) technologies. Many developers start with NEON, a familiar fixed-width vector extension. As Arm architectures evolve, so do the SIMD capabilities available to you.
 
-This Learning Path uses the **Scalable Vector Extension (SVE)** and the **Scalable Matrix Extension (SME)** to demonstrate modern SIMD patterns. They are two powerful, scalable vector extensions designed for modern workloads. Unlike NEON, these architecture extensions are not just wider; they are fundamentally different. They introduce predication, vector-length-agnostic (VLA) programming, gather/scatter, streaming modes, and tile-based compute with ZA state. The result is more power and flexibility, but there can be a learning curve to match.
+This Learning Path uses the Scalable Vector Extension (SVE) and the Scalable Matrix Extension (SME) to demonstrate modern SIMD patterns. They are two powerful, scalable vector extensions designed for modern workloads. Unlike NEON, these architecture extensions are not just wider; they are fundamentally different. They introduce predication, vector-length-agnostic (VLA) programming, gather/scatter, streaming modes, and tile-based compute with ZA state. The result is more power and flexibility, but there can be a learning curve to match.
 
-The **SIMD Loops** project offers a hands-on way to climb the learning curve. It is a public codebase of self-contained, real loop kernels written in C, Arm C Language Extensions (ACLE) intrinsics, and selected inline assembly. Kernels span tasks such as matrix multiply, sorting, and string processing. You can build them, run them, step through them, and adapt them for your own SIMD workloads.
+## What is the SIMD Loops project?
 
-> Repo: [SIMD Loops](https://gitlab.arm.com/architecture/simd-loops)
+The SIMD Loops project offers a hands-on way to climb the learning curve. It is a public codebase of self-contained, real loop kernels written in C, Arm C Language Extensions (ACLE) intrinsics, and selected inline assembly. Kernels span tasks such as matrix multiply, sorting, and string processing. You can build them, run them, step through them, and adapt them for your own SIMD workloads.
 
-SIMD Loops is an open-source project, licensed under BSD 3-Clause, built to help you learn how to write SIMD code for modern Arm architectures, specifically using SVE and SME. It is designed for programmers who already know their way around NEON intrinsics but are now facing the more powerful and complex world of SVE and SME.
+Visit the [SIMD Loops Repo](https://gitlab.arm.com/architecture/simd-loops).
 
-The goal of SIMD Loops is to provide working, readable examples that demonstrate how to use the full range of features available in SVE, SVE2, and SME2. Each example is a self-contained loop kernel, a small piece of code that performs a specific task like matrix multiplication, vector reduction, histogram, or memory copy. These examples show how that task can be implemented across different vector instruction sets.
+This open-source project (BSD-3-Clause) teaches SIMD development on modern Arm CPUs with SVE, SVE2, SME, and SME2. It’s aimed at developers who know NEON intrinsics and want to explore newer extensions. The goal of SIMD Loops is to provide working, readable examples that demonstrate how to use the full range of features available in SVE, SVE2, and SME2. Each example is a self-contained loop kernel - a small piece of code that performs a specific task like matrix multiplication, vector reduction, histogram, or memory copy. These examples show how that task can be implemented across different vector instruction sets.
 
 Unlike a cookbook that attempts to provide a recipe for every problem, SIMD Loops takes the opposite approach. It aims to showcase the architecture rather than the problem itself. The loop kernels are chosen to be realistic and meaningful, but the main goal is to demonstrate how specific features and instructions work in practice. If you are trying to understand scalability, predication, gather/scatter, streaming mode, ZA storage, compact instructions, or the mechanics of matrix tiles, this is where you can see them in action.
 
diff --git a/content/learning-paths/cross-platform/simd-loops/2-using.md b/content/learning-paths/cross-platform/simd-loops/2-using.md
@@ -6,6 +6,8 @@ weight: 3
 layout: learningpathall
 ---
 
+## Set up your development environment
+
 To get started, clone the SIMD Loops project and change to the project directory:
 
 ```bash
diff --git a/content/learning-paths/cross-platform/simd-loops/3-example.md b/content/learning-paths/cross-platform/simd-loops/3-example.md
@@ -6,6 +6,8 @@ weight: 4
 layout: learningpathall
 ---
 
+## Overview: loop 202 matrix multiplication example
+
 To illustrate the structure and design principles of SIMD Loops, consider loop 202 as an example.
 
 Use a text editor to open `loops/loop_202.c`.
@@ -22,7 +24,7 @@ You can view matrix multiplication in two equivalent ways:
 - As the dot product between each row of `A` and each column of `B`
 - As the sum of outer products between the columns of `A` and the rows of `B`
 
-## Data structure
+## Data structure definition
 
 The loop begins by defining a data structure that captures the matrix dimensions (`M`, `K`, `N`) along with input and output buffers:
 
@@ -44,15 +46,15 @@ For this loop:
 
 This layout helps optimize memory access patterns across the targeted SIMD architectures.
 
-## Loop attributes
+## Loop attributes by architecture
 
 Loop attributes are specified per target architecture:
 - **SME targets** — `inner_loop_202` is invoked with the `__arm_streaming` attribute and uses a shared `ZA` register context (`__arm_inout("za")`). These attributes are wrapped in the `LOOP_ATTR` macro
 - **SVE or NEON targets** — no additional attributes are required
 
 This design enables portability across SIMD extensions.
 
-## Function implementation
+## Function implementation in loops/matmul_fp32.c
 
 `loops/matmul_fp32.c` provides several optimizations of matrix multiplication, including ACLE intrinsics and hand-optimized assembly.
 
@@ -208,33 +210,27 @@ A snippet of the loop is shown below:
 ```
 
 Within the SME2 intrinsics code (lines 91–106), the innermost loop iterates across
-the `K` dimension—columns of `A` and rows of `B`
+the `K` dimension - columns of `A` and rows of `B`.
 
 In each iteration:
 - Two consecutive vectors are loaded from `A` and two from `B` (`vec_a*`, `vec_b*`) using multi-vector load intrinsics
 - `fmopa`, wrapped by `MOPA_TILE`, computes the outer product
 - Partial results accumulate in four 32-bit `ZA` tiles
 
-After all `K` iterations, results are written back in a store loop (lines 111–124)
+After all `K` iterations, results are written back in a store loop (lines 111–124).
 
-During this phase, rows of `ZA` tiles are read into `Z` vectors using `svread_hor_za8_u8_vg4` (or `svreadz_hor_za8_u8_vg4` on SME2.1). Vectors are then stored to the output buffer using SME multi-vector `st1w` stores using `STORE_PAIR`
+During this phase, rows of `ZA` tiles are read into `Z` vectors using `svread_hor_za8_u8_vg4` (or `svreadz_hor_za8_u8_vg4` on SME2.1). Vectors are then stored to the output buffer using SME multi-vector `st1w` stores using `STORE_PAIR`.
 
-The equivalent SME2 hand-optimized assembly appears around lines 229–340
+The equivalent SME2 hand-optimized assembly appears around lines 229–340.
 
-For instruction semantics and SME/SME2 optimization guidance, see the [SME Programmer's Guide](https://developer.arm.com/documentation/109246/latest/)
+For instruction semantics and SME/SME2 optimization guidance, see the [SME Programmer's Guide](https://developer.arm.com/documentation/109246/latest/).
 
 ## Other optimizations
 
-Beyond the SME2 and SVE implementations, this loop also includes additional optimized versions that leverage architecture-specific features.
-
-### NEON
-
-The NEON version (lines 612–710) uses structure load/store combined with indexed `fmla` to vectorize the computation.
-
-### SVE2.1
+Beyond the SME2 and SVE implementations, this loop also includes additional optimized versions that leverage architecture-specific features:
 
-The SVE2.1 version (lines 355–462) extends the base SVE approach using multi-vector loads and stores.
+- **NEON**: the NEON version (lines 612–710) uses structure load/store combined with indexed `fmla` to vectorize the computation.
 
-### SME2.1
+- **SVE2.1**: the SVE2.1 version (lines 355–462) extends the base SVE approach using multi-vector loads and stores.
 
-The SME2.1 version uses `movaz`/`svreadz_hor_za8_u8_vg4` to reinitialize `ZA` tile accumulators while moving data out to registers.
+- **SME2.1**: the SME2.1 version uses `movaz`/`svreadz_hor_za8_u8_vg4` to reinitialize `ZA` tile accumulators while moving data out to registers.
diff --git a/content/learning-paths/cross-platform/simd-loops/4-conclusion.md b/content/learning-paths/cross-platform/simd-loops/4-conclusion.md
@@ -1,11 +1,13 @@
 ---
-title: Conclusion
+title: How to learn with SIMD Loops
 weight: 5
 
 ### FIXED, DO NOT MODIFY
 layout: learningpathall
 ---
 
+## Bridging the gap between specs and real code
+
 SIMD Loops is a practical way to learn the intricacies of SVE and SME across modern Arm architectures. By providing small, runnable loop kernels with reference code and optimized variants, it closes the gap between architectural specifications and real applications.
 
 Whether you are moving from NEON or starting directly with SVE and SME, the project offers:
diff --git a/content/learning-paths/cross-platform/simd-loops/_index.md b/content/learning-paths/cross-platform/simd-loops/_index.md
@@ -6,7 +6,7 @@ minutes_to_complete: 30
 who_is_this_for: This is an advanced topic for software developers who want to learn how to use the full range of features available in SVE, SVE2, and SME2 to improve software performance on Arm processors.
 
 learning_objectives:
-     - Improve SIMD code performance using Scalable Vector Extension (SVE) and Scalable Matrix Extension (SME).
+     - Improve SIMD code performance using Scalable Vector Extension (SVE) and Scalable Matrix Extension (SME)
      - Explore how SVE indexed fmla and SME2 fmopa instructions accelerate matrix multiplication
      - Understand how SME2 kernels use ZA tiles and streaming attributes
      - Describe what SIMD Loops contains and how kernels are organized across scalar, NEON, SVE,SVE2, and SME2 variants