ArmDeveloperEcosystem
diff --git a/‎content/learning-paths/cross-platform/multiplying-matrices-with-sme2/1-get-started.md‎
Lines changed: 204 additions & 117 deletions b/‎content/learning-paths/cross-platform/multiplying-matrices-with-sme2/1-get-started.md‎
Lines changed: 204 additions & 117 deletions
diff --git a/‎content/learning-paths/cross-platform/multiplying-matrices-with-sme2/10-going-further.md‎
Lines changed: 73 additions & 0 deletions b/‎content/learning-paths/cross-platform/multiplying-matrices-with-sme2/10-going-further.md‎
Lines changed: 73 additions & 0 deletions
diff --git a/‎content/learning-paths/cross-platform/multiplying-matrices-with-sme2/2-check-your-environment.md‎
Lines changed: 185 additions & 107 deletions b/‎content/learning-paths/cross-platform/multiplying-matrices-with-sme2/2-check-your-environment.md‎
Lines changed: 185 additions & 107 deletions
diff --git a/‎content/learning-paths/cross-platform/multiplying-matrices-with-sme2/3-streaming-mode.md‎
Lines changed: 78 additions & 0 deletions b/‎content/learning-paths/cross-platform/multiplying-matrices-with-sme2/3-streaming-mode.md‎
Lines changed: 78 additions & 0 deletions
diff --git a/‎content/learning-paths/cross-platform/multiplying-matrices-with-sme2/4-outer-product.md‎
Lines changed: 0 additions & 108 deletions b/‎content/learning-paths/cross-platform/multiplying-matrices-with-sme2/4-outer-product.md‎
Lines changed: 0 additions & 108 deletions
diff --git a/‎content/learning-paths/cross-platform/multiplying-matrices-with-sme2/3-vanilla-matmul.md‎ renamed to ‎content/learning-paths/cross-platform/multiplying-matrices-with-sme2/4-vanilla-matmul.md‎
Lines changed: 25 additions & 29 deletions b/‎content/learning-paths/cross-platform/multiplying-matrices-with-sme2/3-vanilla-matmul.md‎ renamed to ‎content/learning-paths/cross-platform/multiplying-matrices-with-sme2/4-vanilla-matmul.md‎
Lines changed: 25 additions & 29 deletions
@@ -0,0 +1,73 @@
+---
+title: Going further
+weight: 12
+
+### FIXED, DO NOT MODIFY
+layout: learningpathall
+---
+
+In this section, you will learn about the many different optimizations that are
+available to you.
+
+## Generalize the algorithms
+
+In this Learning Path, you focused on using SME2 for matrix multiplication with
+floating point numbers. However in practice, any library or framework supporting
+matrix multiplication should also handle various integer types.
+
+You can see that the algorithm structure for matrix preprocessing as well as
+multiplication with the outer product does not change at all for other data
+types - they only need to be adapted.
+
+This is suitable for languages with [generic
+programming](https://en.wikipedia.org/wiki/Generic_programming) like C++ with
+templates. You can even make the template manage a case where the value
+accumulated during the product uses a larger type than the input matrices. SME2
+has the instructions to deal efficiently with this common case scenario.
+
+This enables the library developer to focus on the algorithm, testing, and
+optimizations, while allowing the compiler to generate multiple variants.
+
+## Unroll further
+
+You might have noticed that ``matmul_intr_impl`` computes only one tile at a
+time, for the sake of simplicity.
+
+SME2 does support multi-vector instructions, and some were used in
+``preprocess_l_intr``, for example, ``svld1_x2``.
+
+Loading two vectors at a time enables the simultaneous computing of more tiles,
+and as the input matrices have been laid out in memory in a neat way, the
+consecutive loading of the data is efficient. Implementing this approach can
+make improvements to the ``macc`` to load ``ratio``.
+
+In order to check your understanding of SME2, you can try to implement this
+unrolling yourself in the intrinsic version (the asm version already has this
+optimization). You can check your work by comparing your results to the expected
+reference values.
+
+## Apply strategies
+
+One method for optimization is to use strategies that are flexible depending on
+the matrices' dimensions. This is especially easy to set up when working in C or
+C++, rather than directly in assembly language.
+
+By playing with the mathematical properties of matrix multiplication and the
+outer product, it is possible to minimize data movement as well as reduce the
+overall number of operations to perform.
+
+For example, it is common that one of the matrices is actually a vector, meaning
+that it has a single row or column, and then it becomes advantageous to
+transpose it. Can you see why?
+
+The answer is that as the elements are stored contiguously in memory, an ``Nx1``
+and ``1xN`` matrices have the exact same memory layout. The transposition
+becomes a no-op, and the matrix elements stay in the same place in memory.
+
+An even more *degenerated* case that is easy to manage is when one of the
+matrices is essentially a scalar, which means that it is a matrix with one row
+and one column.
+
+Although our current code handles it correctly from a results point of view, a
+different algorithm and use of instructions might be more efficient. Can you
+think of another way?
@@ -0,0 +1,78 @@
+---
+title: Streaming mode
+weight: 5
+
+### FIXED, DO NOT MODIFY
+layout: learningpathall
+---
+
+In real-world large-scale software, a program moves back and forth from
+streaming mode, and some streaming mode routines call other streaming mode
+routines, which means that some state needs to be saved and restored. This
+includes the ZA storage. This is defined in the ACLE and supported by the
+compiler: the programmer *just* has to annotate the functions with some keywords
+and let the compiler automatically perform the low-level tasks of managing the
+streaming mode. This frees the developer from a tedious and error-prone task.
+See [Introduction to streaming and non-streaming
+mode](https://arm-software.github.io/acle/main/acle.html#controlling-the-use-of-streaming-mode)
+for further information. The rest of this section references information from
+the ACLE.
+
+## About streaming mode
+
+The AArch64 architecture defines a concept called *streaming mode*, controlled
+by a processor state bit called `PSTATE.SM`. At any given point in time, the
+processor is either in streaming mode (`PSTATE.SM==1`) or in non-streaming mode
+(`PSTATE.SM==0`). There is an instruction called `SMSTART` to enter streaming mode
+and an instruction called `SMSTOP` to return to non-streaming mode.
+
+Streaming mode has three main effects on C and C++ code:
+
+- It can change the length of SVE vectors and predicates: the length of an SVE
+  vector in streaming mode is called the “streaming vector length” (SVL), which
+  might be different from the normal non-streaming vector length. See
+  [Effect of streaming mode on VL](https://arm-software.github.io/acle/main/acle.html#effect-of-streaming-mode-on-vl)
+  for more details.
+- Some instructions can only be executed in streaming mode, which means that
+  their associated ACLE intrinsics can only be used in streaming mode. These
+  intrinsics are called “streaming intrinsics”.
+- Some other instructions can only be executed in non-streaming mode, which
+  means that their associated ACLE intrinsics can only be used in non-streaming
+  mode. These intrinsics are called “non-streaming intrinsics”.
+
+The C and C++ standards define the behavior of programs in terms of an *abstract
+machine*. As an extension, the ACLE specification applies the distinction
+between streaming mode and non-streaming mode to this abstract machine: at any
+given point in time, the abstract machine is either in streaming mode or in
+non-streaming mode.
+
+This distinction between processor mode and abstract machine mode is mostly just
+a specification detail. However, the usual “as if” rule applies: the
+processor's actual mode at runtime can be different from the abstract machine's
+mode, provided that this does not alter the behavior of the program. One
+practical consequence of this is that C and C++ code does not specify the exact
+placement of `SMSTART` and `SMSTOP` instructions; the source code simply places
+limits on where such instructions go. For example, when stepping through a
+program in a debugger, the processor mode might sometimes be different from the
+one implied by the source code.
+
+ACLE provides attributes that specify whether the abstract machine executes statements:
+
+- In non-streaming mode, in which case they are called *non-streaming statements*.
+- In streaming mode, in which case they are called *streaming statements*.
+- In either mode, in which case they are called *streaming-compatible statements*.
+
+SME provides an area of storage called ZA, of size `SVL.B` x `SVL.B` bytes. It
+also provides a processor state bit called `PSTATE.ZA` to control whether ZA
+is enabled.
+
+In C and C++ code, access to ZA is controlled at function granularity: a
+function either uses ZA or it does not. Another way to say this is that a
+function either “has ZA state” or it does not.
+
+If a function does have ZA state, the function can either share that ZA state
+with the function's caller or create new ZA state “from scratch”. In the latter
+case, it is the compiler's responsibility to free up ZA so that the function can
+use it; see the description of the lazy saving scheme in
+[AAPCS64](https://arm-software.github.io/acle/main/acle.html#AAPCS64) for details
+about how the compiler does this.
@@ -1,16 +1,14 @@
 ---
 title: Vanilla matrix multiplication
-weight: 5
+weight: 6
 
 ### FIXED, DO NOT MODIFY
 layout: learningpathall
 ---
 
-## Vanilla matrix multiplication
-
 In this section, you will learn about an example of standard matrix multiplication in C.
 
-### Algorithm description
+## Vanilla matrix multiplication algorithm
 
 The vanilla matrix multiplication operation takes two input matrices, A [Ar
 rows x Ac columns] and B [Br rows x Bc columns], to produce an output matrix C
@@ -22,7 +20,6 @@ element in the B column then summing all these products, as Figure 2 shows.
 
 This implies that the A, B, and C matrices have some constraints on their
 dimensions:
-
 - A's number of columns must match B's number of rows: Ac == Br.
 - C has the dimensions Cr == Ar and Cc == Bc.
 
@@ -31,22 +28,21 @@ properties and use, by reading this [Wikipedia
 article on Matrix Multiplication](https://en.wikipedia.org/wiki/Matrix_multiplication).
 
 In this Learning Path, you will see the following variable names:
-
-- ``matLeft`` corresponds to the left-hand side argument of the matrix
+- `matLeft` corresponds to the left-hand side argument of the matrix
   multiplication.
-- ``matRight``corresponds to the right-hand side of the matrix multiplication.
-- ``M`` is ``matLeft`` number of rows.
-- ``K`` is ``matLeft`` number of columns (and ``matRight`` number of rows).
-- ``N`` is ``matRight`` number of columns.
-- ``matResult``corresponds to the result of the matrix multiplication, with
-  ``M`` rows and ``N`` columns.
+- `matRight`corresponds to the right-hand side of the matrix multiplication.
+- `M` is `matLeft` number of rows.
+- `K` is `matLeft` number of columns (and `matRight` number of rows).
+- `N` is `matRight` number of columns.
+- `matResult`corresponds to the result of the matrix multiplication, with
+  `M` rows and `N` columns.
 
-### C implementation
+## C implementation
 
 A literal implementation of the textbook matrix multiplication algorithm, as
-described above, can be found in file ``matmul_vanilla.c``:
+described above, can be found in file `matmul_vanilla.c`:
 
-```C
+```C { line_numbers="true" }
 void matmul(uint64_t M, uint64_t K, uint64_t N,
             const float *restrict matLeft, const float *restrict matRight,
             float *restrict matResult) {
@@ -65,16 +61,16 @@ void matmul(uint64_t M, uint64_t K, uint64_t N,
 ```
 
 In this Learning Path, the matrices are laid out in memory as contiguous
-sequences of elements, in [Row-Major
-Order](https://en.wikipedia.org/wiki/Row-_and_column-major_order). The
-``matmul`` function performs the algorithm described above. 
-
-The pointers to ``matLeft``, ``matRight`` and ``matResult`` have been annotated as
-``restrict``, which informs the compiler that the memory areas designated by
-those pointers do not alias. This means that they do not overlap in any way, so that the
-compiler does not need to insert extra instructions to deal with these cases.
-The pointers to ``matLeft`` and ``matRight`` are marked as ``const`` as neither of these two matrices are modified by ``matmul``.
-
-You now have a reference standard matrix multiplication function. You will use it later
-on in this Learning Path to ensure that the assembly version and the intrinsics
-version of the multiplication algorithm do not contain errors.
+sequences of elements, in [Row-Major Order](https://en.wikipedia.org/wiki/Row-_and_column-major_order).
+The `matmul` function performs the algorithm described above.
+
+The pointers to `matLeft`, `matRight` and `matResult` have been annotated
+as `restrict`, which informs the compiler that the memory areas designated by
+those pointers do not alias. This means that they do not overlap in any way, so
+that the compiler does not need to insert extra instructions to deal with these
+cases. The pointers to `matLeft` and `matRight` are marked as `const` as
+neither of these two matrices are modified by `matmul`.
+
+You now have a reference standard matrix multiplication function. You will use
+it later on in this Learning Path to ensure that the assembly version and the
+intrinsics version of the multiplication algorithm do not contain errors.