Merge pull request #2324 from jasonrandrews/review

jasonrandrews · web-flow · commit 88e4fa14629a · 2025-09-17T13:48:12.000-05:00
Final tech review of SIMD Loops
diff --git a/content/learning-paths/cross-platform/simd-loops/1-about.md b/content/learning-paths/cross-platform/simd-loops/1-about.md
@@ -17,7 +17,7 @@ extensions introduce new instructions, more flexible programming models, and
 support for concepts like predication, scalable vectors, and streaming modes.
 However, they also come with a learning curve.
 
-That is where [SIMD Loops](https://gitlab.arm.com/architecture/simd-loops) becomes a valuable resource, enabling you to quickly and effectively learn how to write high-performance SIMD code.
+[SIMD Loops](https://gitlab.arm.com/architecture/simd-loops) is a valuable resource, enabling you to quickly and effectively learn how to write high-performance SIMD code.
 
 SIMD Loops is designed to help
 you learn how to write SVE and SME code. It is a collection
diff --git a/content/learning-paths/cross-platform/simd-loops/2-using.md b/content/learning-paths/cross-platform/simd-loops/2-using.md
@@ -13,15 +13,46 @@ git clone https://gitlab.arm.com/architecture/simd-loops simd-loops.git
 cd simd-loops.git
 ```
 
+Confirm you are using an Arm machine by running:
+
+```bash
+uname -m
+```
+
+The output on Linux should be:
+
+```output
+aarch64
+```
+
+And for macOS:
+
+```output
+arm64
+```
+
 ## SIMD Loops structure
 
-In the SIMD Loops project, the
-source code for the loops is organized under the loops directory. The complete
-list of loops is documented in the loops.inc file, which includes a brief
+In the SIMD Loops project, the source code for the loops is organized under the loops directory. The complete
+list of loops is documented in the `loops.inc` file, which includes a brief
 description and the purpose of each loop. Every loop is associated with a
 uniquely named source file following the naming pattern `loop_<NNN>.c`, where
 `<NNN>`  represents the loop number.
 
+A subset of the `loops.inc` file is below:
+
+```output
+LOOP(001, "FP32 inner product",                "Use of fp32 MLA instruction", STREAMING_COMPATIBLE)
+LOOP(002, "UINT32 inner product",              "Use of u32 MLA instruction", STREAMING_COMPATIBLE)
+LOOP(003, "FP64 inner product",                "Use of fp64 MLA instruction", STREAMING_COMPATIBLE)
+LOOP(004, "UINT64 inner product",              "Use of u64 MLA instruction", STREAMING_COMPATIBLE)
+LOOP(005, "strlen short strings",              "Use of FF and NF loads instructions")
+LOOP(006, "strlen long strings",               "Use of FF and NF loads instructions")
+LOOP(008, "Precise fp64 add reduction",        "Use of FADDA instructions")
+LOOP(009, "Pointer chasing",                   "Use of CTERM and BRK instructions")
+LOOP(010, "Conditional reduction (fp)",        "Use of CLAST (SIMD&FP scalar) instructions", STREAMING_COMPATIBLE
+```
+
 A loop is structured as follows:
 
 ```C
@@ -55,23 +86,79 @@ void inner_loop_<NNN>(struct loop_<NNN>_data *data) { ... }
 
 Each loop is implemented in several SIMD extension variants, and conditional
 compilation is used to select one of the optimizations for the
-`inner_loop_<NNN>` function. The native C implementation is written first, and
-it can be generated either when building natively (HAVE_NATIVE) or through
-compiler auto-vectorization (HAVE_AUTOVEC). When SIMD ACLE is supported (e.g.,
-SME, SVE, or NEON), the code is compiled using high-level intrinsics. If ACLE
+`inner_loop_<NNN>` function. 
+
+The native C implementation is written first, and
+it can be generated either when building natively with `-DHAVE_NATIVE` or through
+compiler auto-vectorization `-DHAVE_AUTOVEC`. 
+
+When SIMD ACLE is supported (SME, SVE, or NEON), 
+the code is compiled using high-level intrinsics. If ACLE
 support is not available, the build process falls back to handwritten inline
 assembly targeting one of the available SIMD extensions, such as SME2.1, SME2,
-SVE2.1, SVE2, and others. The overall code structure also includes setup and
+SVE2.1, SVE2, and others. 
+
+The overall code structure also includes setup and
 cleanup code in the main function, where memory buffers are allocated, the
 selected loop kernel is executed, and results are verified for correctness.
 
 At compile time, you can select which loop optimization to compile, whether it
 is based on SME or SVE intrinsics, or one of the available inline assembly
-variants (`make scalar neon sve2 sme2 sve2p1 sme2p1 sve_intrinsics
-sme_intrinsics` ...).
-
-As the result of the build, two types of binaries are generated. The first is a
-single executable named `simd_loops`, which includes all the loop
-implementations. A specific loop can be selected by passing parameters to the
-program (e.g., `simd_loops -k <NNN> -n <iterations>`). The second type consists
-of individual standalone binaries, each corresponding to a specific loop.
+variants.
+
+```console
+make
+```
+
+With no target specified the list of targets is printed:
+
+```output
+all fmt clean c-scalar scalar autovec-sve autovec-sve2 neon sve sve2 sme2 sme-ssve sve2p1 sme2p1 sve-intrinsics sme-intrinsics
+```
+
+You can build all loops for all targets using:
+
+```console
+make all
+```
+
+You can build all loops for a single target, such as NEON, using:
+
+```console
+make neon
+```
+
+As the result of the build, two types of binaries are generated. 
+
+The first is a single executable named `simd_loops`, which includes all the loop implementations. 
+
+A specific loop can be selected by passing parameters to the
+program.
+
+For example, to run loop 1 for 5 iterations using the NEON target: 
+
+```console
+build/neon/bin/simd_loops -k 1 -n 5
+```
+
+The output is:
+
+```output
+Loop 001 - FP32 inner product
+ - Purpose: Use of fp32 MLA instruction
+ - Checksum correct.
+```
+
+The second type of binary is an individual loop.
+
+To run loop 1 as a standlone binary:
+
+```console
+build/neon/standalone/bin/loop_001.elf
+```
+
+The output is:
+
+```output
+ - Checksum correct.
+```
diff --git a/content/learning-paths/cross-platform/simd-loops/3-example.md b/content/learning-paths/cross-platform/simd-loops/3-example.md
@@ -6,12 +6,16 @@ weight: 5
 layout: learningpathall
 ---
 
-To illustrate the structure and design principles of simd-loops, consider loop
-202 as an example. `inner_loop_202` is defined at lines 69-79 in file
+To illustrate the structure and design principles of SIMD Loops, consider loop
+202 as an example. 
+
+Use a text editor to look at the file `loops/loop_202.c`
+
+The function `inner_loop_202()` is defined at lines 60-70 in file
 `loops/loops_202.c` and calls the `matmul_fp32` routine defined in
 `matmul_fp32.c`.
 
-Open `loops/matmul_fp32.c`.
+Use a text editor to look at the file `loops/matmul_fp32.c`
 
 This loop implements a single precision floating point matrix multiplication of
 the form:
@@ -39,10 +43,10 @@ struct loop_202_data {
 ```
 
 For this loop:
-- The first input matrix (A) is stored in column-major format in memory.
+- The first input matrix (a) is stored in column-major format in memory.
 - The second input matrix (b) is stored in row-major format in memory.
-- None of the memory area designated by `a`, `b` anf `c` alias (i.e. they
-  overlap in some way) --- as indicated by the `restrict` keyword.
+- None of the memory area designated by `a`, `b` and `c` alias (they
+  overlap in some way) as indicated by the `restrict` keyword.
 
 This layout choice helps optimize memory access patterns for all the targeted
 SIMD architectures.
@@ -59,7 +63,7 @@ This design enables portability across different SIMD extensions.
 
 ## Function implementation
 
-The `matmul_fp32` function from file `loops/matmul_fp32.c` provides several
+The `matmul_fp32()` function from file `loops/matmul_fp32.c` provides several
 optimizations of the single-precision floating-point matrix multiplication,
 including the ACLE intrinsics-based code, and the assembly hand-optimized code.
 
diff --git a/content/learning-paths/cross-platform/simd-loops/4-conclusion.md b/content/learning-paths/cross-platform/simd-loops/4-conclusion.md
@@ -7,20 +7,22 @@ layout: learningpathall
 ---
 
 SIMD Loops is an invaluable
-resource for developers looking to learn or master the intricacies of SVE and
-SME on modern Arm architectures. By providing practical, hands-on examples, it
+resource for developers looking to learn the intricacies of SVE and
+SME on a variety of Arm architectures. By providing practical, hands-on examples, it
 bridges the gap between the architecture specification and real-world
-application. Whether you're transitioning from NEON or starting fresh with SVE
+application. 
+
+Whether you're transitioning from NEON or starting fresh with SVE
 and SME, SIMD Loops offers a comprehensive toolkit to enhance your understanding
 and proficiency.
 
 With its extensive collection of loop kernels, detailed documentation, and
-flexible build options, SIMD Loops empowers you to explore
+flexible build options, SIMD Loops helps you to explore
 and leverage the full potential of Arm's advanced vector extensions. Dive into
 the project, experiment with the examples, and take your high-performance coding
 skills for Arm to the next level.
 
 For more information and to get started, visit the GitLab project and refer
 to the
 [README.md](https://gitlab.arm.com/architecture/simd-loops/-/blob/main/README.md)
-for instructions on building and running the code. 
+for the latest instructions on building and running the code. 
diff --git a/content/learning-paths/cross-platform/simd-loops/_index.md b/content/learning-paths/cross-platform/simd-loops/_index.md
@@ -31,7 +31,6 @@ operatingsystems:
 tools_software_languages:
     - GCC
     - Clang
-    - FVP
 
 shared_path: true
 shared_between: