Tweaks

madeline-underwood · madeline-underwood · commit 69e0aa82f5a9 · 2025-07-07T20:24:59.000Z
diff --git a/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/10-going-further.md b/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/10-going-further.md
@@ -1,12 +1,12 @@
 ---
-title: Beyond this implementation
+title: Going further
 weight: 12
 
 ### FIXED, DO NOT MODIFY
 layout: learningpathall
 ---
 
-## Going further
+## Beyond this implementation
 
 There are many different ways that you can extend and optimize the matrix multiplication algorithm beyond the specific SME2 implementation that you've explored in this Learning Path. While the current approach is tuned for performance on a specific hardware target, further improvements can make your code more general, more efficient, and better suited to a wider range of applications. 
 
@@ -22,9 +22,9 @@ Some ideas of improvements that you might like to test out include:
 
 ## Generalize the algorithm for different data types
 
-So far, you've focused on multiplying floating-point matrices.  In practice, matrix operations often involve integer types as well.
+So far, you've focused on multiplying floating-point matrices. In practice, matrix operations often involve integer types as well.
 
-The structure of the algorithm remains consistent across data types. It uses preprocessing with tiling and outer product–based multiplication. To adapt it for other data types, you only need to change how values are:
+The structure of the algorithm (The core logic - tiling, outer product, and accumulation) remains consistent across data types. It uses preprocessing with tiling and outer product–based multiplication. To adapt it for other data types, you only need to change how values are:
 
 * Loaded from memory
 * Accumulated (often with widening)
@@ -35,10 +35,10 @@ Languages that support [generic programming](https://en.wikipedia.org/wiki/Gener
 Templates allow you to:
 
 * Swap data types flexibly
-* Handle accumulation in a wider format (a common requirement)
+* Handle accumulation in a wider format when needed
 * Reuse algorithm logic across multiple matrix types
 
-By expressing the algorithm generically, you benefit from the compiler generating multiple variants, allowing you the opportunity to focus on:
+By expressing the algorithm generically, you benefit from the compiler generating multiple optimized variants, allowing you the opportunity to focus on:
 
 - Creating efficient algorithm design
 - Testing and verification
@@ -55,41 +55,23 @@ svld1_x2(...); // Load two vectors at once
 ```
 Loading two vectors at a time enables the simultaneous computing of more tiles.  Since the matrices are already laid out efficiently in memory, consecutive loading is fast. Implementing this approach can make improvements to the ``macc`` to load ``ratio``.
 
-In order to check your understanding of SME2, you can try to implement this
-unrolling yourself in the intrinsic version (the assembly version already has this
-optimization). You can check your work by comparing your results to the expected
-reference values.
+In order to check your understanding of SME2, you can try to implement this unrolling yourself in the intrinsic version (the assembly version already has this optimization). You can check your work by comparing your results to the expected reference values.
 
-## Apply strategies
+## Optimize for special matrix shapes
 
-One method for optimization is to use strategies that are flexible depending on
-the matrices' dimensions. This is especially easy to set up when working in C or
-C++, rather than directly in assembly language.
+One method for optimization is to use strategies that are flexible depending on the matrices' dimensions. This is especially easy to set up when working in C or C++, rather than directly in assembly language.
 
-By playing with the mathematical properties of matrix multiplication and the
-outer product, it is possible to minimize data movement as well as reduce the
-overall number of operations to perform.
+By playing with the mathematical properties of matrix multiplication and the outer product, it is possible to minimize data movement as well as reduce the overall number of operations to perform.
 
-For example, it is common that one of the matrices is actually a vector, meaning
-that it has a single row or column, and then it becomes advantageous to
-transpose it. Can you see why?
+For example, it is common that one of the matrices is actually a vector, meaning that it has a single row or column, and then it becomes advantageous to transpose it. Can you see why?
 
-The answer is that as the elements are stored contiguously in memory, an ``Nx1``
-and ``1xN`` matrices have the exact same memory layout. The transposition
-becomes a no-op, and the matrix elements stay in the same place in memory.
+The answer is that as the elements are stored contiguously in memory, an ``Nx1``and ``1xN`` matrices have the exact same memory layout. The transposition becomes a no-op, and the matrix elements stay in the same place in memory.
 
-An even more *degenerated* case that is easy to manage is when one of the
-matrices is essentially a scalar, which means that it is a matrix with one row
-and one column.
+An even more *degenerated* case that is easy to manage is when one of the matrices is essentially a scalar, which means that it is a matrix with one row and one column.
 
-Although our current code handles it correctly from a results point of view, a
-different algorithm and use of instructions might be more efficient. Can you
-think of another way?
+Although the current code used here handles it correctly from a results point of view, a different algorithm and use of instructions might be more efficient. Can you think of another way?
 
 
-In order to check your understanding of SME2, you can try to implement this
-unrolling yourself in the intrinsic version (the asm version already has this
-optimization). You can check your work by comparing your results to the expected
-reference values.
+In order to check your understanding of SME2, you can try to implement thisunrolling yourself in the intrinsic version (the asm version already has this optimization). You can check your work by comparing your results to the expected reference values. 
 
 
diff --git a/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/6-sme2-matmul-asm.md b/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/6-sme2-matmul-asm.md
@@ -30,10 +30,8 @@ The key changes include:
 * Avoiding register `x18`, which is reserved as a platform register
 
 Here:
-- The `preprocess` function is named `preprocess_l_asm` and is defined in
-  `preprocess_l_asm.S`
-- The outer product-based matrix multiplication is named `matmul_asm_impl`and
-  is defined in `matmul_asm_impl.S`
+- The `preprocess` function is named `preprocess_l_asm` and is defined in `preprocess_l_asm.S`
+- The outer product-based matrix multiplication is named `matmul_asm_impl` and is defined in `matmul_asm_impl.S`
 
 Both functions are declared in `matmul.h`:
 
@@ -49,13 +47,9 @@ void matmul_asm_impl(
     float *restrict matResult) __arm_streaming __arm_inout("za");
 ```
 
-You can see that they have been marked with two attributes: `__arm_streaming`
-and `__arm_inout("za")`. This instructs the compiler that these functions
-expect the streaming mode to be active, and that they don't need to save or restore the ZA storage.
+Both functions are annotated with the `__arm_streaming` and `__arm_inout("za")` attributes. These indicate that the function expects streaming mode to be active and does not need to save or restore the ZA storage.
 
-These two functions are stitched together in `matmul_asm.c` with the
-same prototype as the reference implementation of matrix multiplication, so that
-a top-level `matmul_asm` can be called from the `main` function:
+These two functions are stitched together in `matmul_asm.c` with the same prototype as the reference implementation of matrix multiplication, so that a top-level `matmul_asm` can be called from the `main` function:
 
 ```C
 __arm_new("za") __arm_locally_streaming void matmul_asm(
@@ -68,15 +62,15 @@ __arm_new("za") __arm_locally_streaming void matmul_asm(
 }
 ```
 
-You can see that `matmul_asm` has been annotated with two attributes: `__arm_new("za")` and `__arm_locally_streaming`. These attributes instruct the compiler to enable streaming mode and manage ZA state on entry and return
+You can see that `matmul_asm` is annotated with two attributes: `__arm_new("za")` and `__arm_locally_streaming`. These attributes instruct the compiler to enable streaming mode and manage ZA state on entry and return.
 
 ## How it integrates with the main function
 
 The same `main.c` file supports both the intrinsic and assembly implementations. The implementation to use is selected at compile time via the `IMPL` macro. This design reduces duplication and simplifies maintenance.
 
 ## Execution modes
 
-- on a baremetal platform, the program only works in *verification mode*, where it compares the results of the assembly-based (resp. intrinsic-based) matrix multiplication with the vanilla reference implementation. When targeting a non-baremetal platform, a *benchmarking mode* is also available.
+- On a baremetal platform, the program runs in *verification mode*, where it compares the results of the assembly-based matrix multiplication with the vanilla reference implementation. When targeting a non-baremetal platform, a *benchmarking mode* is also available.
 
 ```C { line_numbers="true" }
 #ifndef __ARM_FEATURE_SME2
@@ -227,8 +221,8 @@ int main(int argc, char **argv) {
     float *matResult_ref = (float *)malloc(M * N * sizeof(float));
 
     // Initialize matrices. Input matrices are initialized with random values in
-    // non debug mode. In debug mode, all matrices are initialized with linear
-    // or known values values for easier debugging.
+    // non-debug mode. In debug mode, all matrices are initialized with linear
+    // or known values for easier debugging.
 #ifdef DEBUG
     initialize_matrix(matLeft, M * K, LINEAR_INIT);
     initialize_matrix(matRight, K * N, LINEAR_INIT);
@@ -327,36 +321,19 @@ int main(int argc, char **argv) {
 }
 ```
 
-The same `main.c` file is used for the assembly and intrinsic-based versions
-of the matrix multiplication. It first sets the `M`, `K` and `N`
-parameters, to either the arguments supplied on the command line (lines 93-95)
-or uses the default value (lines 73-75). In non-baremetal mode, it also accepts
-(lines 82-89 and lines 98-108), as first parameter, an iteration count `I`
+The same `main.c` file is used for the assembly and intrinsic-based versions of the matrix multiplication. It first sets the `M`, `K` and `N` parameters, to either the arguments supplied on the command line (lines 93-95) or uses the default value (lines 73-75). In non-baremetal mode, it also accepts (lines 82-89 and lines 98-108), as first parameter, an iteration count `I`
 used for benchmarking.
 
-Depending on the `M`, `K`, `N` dimension parameters, `main` allocates
-memory for all the matrices and initializes `matLeft` and `matRight` with
-random data. The actual matrix multiplication implementation is provided through
-the `IMPL` macro.
+Depending on the `M`, `K`, `N` dimension parameters, `main` allocates memory for all the matrices and initializes `matLeft` and `matRight` with random data. The actual matrix multiplication implementation is provided through the `IMPL` macro.
 
-In *verification mode*, it then runs the matrix multiplication from `IMPL`
-(line 167) and computes the reference values for the preprocessed matrix as well
-as the result matrix (lines 170 and 171). It then compares the actual values to
-the reference values and reports errors, if there are any (lines 173-177).
-Finally, all the memory is deallocated (lines 236-243) before exiting the
+In *verification mode*, it then runs the matrix multiplication from `IMPL` (line 167) and computes the reference values for the preprocessed matrix as well as the result matrix (lines 170 and 171). It then compares the actual values to the reference values and reports errors, if there are any (lines 173-177). Finally, all the memory is deallocated (lines 236-243) before exiting the
 program with a success or failure return code at line 245.
 
-In *benchmarking mode*, it will first run the vanilla reference matrix
-multiplication (resp. assembly- or intrinsic-based matrix multiplication) 10
-times without measuring elapsed time to warm-up the CPU. It will then measure
-the elapsed execution time of the vanilla reference matrix multiplication (resp.
-assembly- or intrinsic-based matrix multiplication) `I` times and then compute
+In *benchmarking mode*, it will first run the vanilla reference matrix multiplication (resp. assembly- or intrinsic-based matrix multiplication) 10 times without measuring elapsed time to warm-up the CPU. It will then measure the elapsed execution time of the vanilla reference matrix multiplication (resp.assembly- or intrinsic-based matrix multiplication) `I` times and then compute
 and report the minimum, maximum and average execution times.
 
 {{% notice Note %}}
-Benchmarking and profiling are not simple tasks. The purpose of this Learning Path
-is to provide some basic guidelines on the performance improvement that can be
-obtained with SME2.
+Benchmarking and profiling are not simple tasks. The purpose of this Learning Path is to provide some basic guidelines on the performance improvement that can be obtained with SME2.
 {{% /notice %}}
 
 ### Compile and run it
@@ -401,7 +378,7 @@ whether the preprocessing and matrix multiplication passed (`PASS`) or failed
 (`FAILED`) the comparison the vanilla reference implementation.
 
 {{% notice Tip %}}
-The example above uses the default values for the `M` (125), `K`(25) and `N`(70)
+The example above uses the default values for the `M` (125), `K`(70) and `N`(70)
 parameters. You can override this and provide your own values on the command line:
 
 {{< tabpane code=true >}}
@@ -414,5 +391,5 @@ parameters. You can override this and provide your own values on the command lin
   {{< /tab >}}
 {{< /tabpane >}}
 
-Here the values `M=7`, `K=8` and `N=9` are used instead.
+In this example, `M=7`, `K=8`, and `N=9` are used.
 {{% /notice %}}
diff --git a/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/7-sme2-matmul-intr.md b/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/7-sme2-matmul-intr.md
@@ -1,30 +1,22 @@
 ---
-title: SME2 intrinsics matrix multiplication
+title: Matrix multiplication using SME2 intrinsics in C
 weight: 9
 
 ### FIXED, DO NOT MODIFY
 layout: learningpathall
 ---
 
-In this section, you will write an SME2 optimized matrix multiplication in C
-using the intrinsics that the compiler provides.
+In this section, you will write an SME2-optimized matrix multiplication routine in C using the intrinsics that the compiler provides.
 
-## Matrix multiplication with SME2 intrinsics
+## What are instrinsics?
 
-*Intrinsics*, also know known as *compiler intrinsics* or *intrinsic functions*,
-are the functions available to application developers that the compiler has an
-intimate knowledge of. This enables the compiler to either translate the
-function to a specific instruction or to perform specific optimizations, or
-both.
+*Intrinsics*, also known as *compiler intrinsics* or *intrinsic functions*, are the functions available to application developers that the compiler has intimate knowledge of. This enables the compiler to either translate the function to a specific instruction or to perform specific optimizations, or both.
 
 You can learn more about intrinsics in this [Wikipedia
 Article on Intrinsic Function](https://en.wikipedia.org/wiki/Intrinsic_function).
 
 Using intrinsics allows the programmer to use the specific instructions required
-to achieve the required performance while writing in C all the
-typically-required standard code, such as loops. This produces performance close
-to what can be reached with hand-written assembly whilst being significantly
-more maintainable and portable.
+to achieve the required performance while writing in C all the typically-required standard code, such as loops. This produces performance close to what can be reached with hand-written assembly whilst being significantly more maintainable and portable.
 
 All Arm-specific intrinsics are specified in the
 [ACLE](https://github.com/ARM-software/acle), which is the Arm C Language Extension. ACLE
@@ -51,10 +43,7 @@ Note the `__arm_new("za")` and `__arm_locally_streaming` at line 1 that will
 make the compiler save the ZA storage so we can use it without destroying its
 content if it was still in use by one of the callers.
 
-`SVL`, the dimension of the ZA storage, is requested from the underlying
-hardware with the `svcntsw()` function call at line 5, and passed down to the
-`preprocess_l_intr` and `matmul_intr_impl` functions. `svcntsw()` is a
-function provided be the ACLE library.
+`SVL`, the dimension of the ZA storage, is requested from the underlying hardware with the `svcntsw()` function call at line 5, and passed down to the `preprocess_l_intr` and `matmul_intr_impl` functions. `svcntsw()` is a function provided by the ACLE library.
 
 ### Matrix preprocessing
 
diff --git a/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/8-benchmarking.md b/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/8-benchmarking.md
@@ -39,11 +39,7 @@ Reference implementation: min time = 101 us, max time = 438 us, avg time = 139.4
 SME2 implementation *intr*: min time = 1 us, max time = 8 us, avg time = 1.82 us
 ```
 
-The execution time is reported in microseconds. A wide spread between the
-minimum and maximum figures can be noted and is expected as the way of doing the
-benchmarking is simplified for the purpose of simplicity. You will, however,
-note that the intrinsic version of the matrix multiplication brings on average a
-76x execution time reduction.
+The execution time is reported in microseconds. A wide spread between the minimum and maximum figures can be noted and is expected as the way of doing the benchmarking is simplified for the purpose of simplicity. You will, however, note that the intrinsic version of the matrix multiplication brings on average a 76x execution time reduction.
 
 {{% notice Tip %}}
 You can override the default values for `M` (125), `K` (25), and `N` (70) and
diff --git a/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/9-debugging.md b/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/9-debugging.md
@@ -105,5 +105,5 @@ Tracing is disabled by default because it significantly slows down simulation an
 
 ## Use debug mode for matrix inspection
 
-It can be helpful when debugging to understand where an element in the Tile is coming from. The current code base allows you to do that in `debug` mode, when `-DDEBUG` is passed to the compiler in the `Makefile`. If you look into `main.c`, you will notice that the matrix initialization is no
+It can be helpful when debugging to understand where an element in the tile is coming from. The current code base allows you to do that in `debug` mode, when `-DDEBUG` is passed to the compiler in the `Makefile`. If you look into `main.c`, you will notice that the matrix initialization is no
 longer random, but instead initializes each element with its linear index. This makes it *easier* to find where the matrix elements are loaded in the tile in tarmac trace, for example.