You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: content/learning-paths/cross-platform/multiplying-matrices-with-sme2/1-get-started.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -71,7 +71,7 @@ Amongst other files, it includes:
71
71
72
72
{{% notice Note %}}
73
73
From this point, all instructions assume that your current directory is
74
-
``code-examples/learning-paths/cross-platform/multiplying-matrices-with-sme2``. So to follow along, ensure that you are in the correct directory before proceeding.
74
+
``code-examples/learning-paths/cross-platform/multiplying-matrices-with-sme2``, so ensure that you are in the correct directory before proceeding.
Copy file name to clipboardExpand all lines: content/learning-paths/cross-platform/multiplying-matrices-with-sme2/4-vanilla-matmul.md
+3-3Lines changed: 3 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -47,7 +47,7 @@ The following variable names are used throughout the Learning Path to represent
47
47
48
48
## C implementation
49
49
50
-
The file matmul_vanilla.c contains a reference implementation of the algorithm:
50
+
Here is the full reference implementation from `matmul_vanilla.c`:
51
51
52
52
```C { line_numbers="true" }
53
53
voidmatmul(uint64_t M, uint64_t K, uint64_t N,
@@ -69,8 +69,8 @@ void matmul(uint64_t M, uint64_t K, uint64_t N,
69
69
70
70
## Memory layout and pointer annotations
71
71
72
-
In this Learning Path, the matrices are laid out in memory as contiguous sequences of elements, in [Row-Major Order](https://en.wikipedia.org/wiki/Row-_and_column-major_order). The `matmul` function performs the algorithm described above.
72
+
In this Learning Path, the matrices are laid out in memory as contiguous sequences of elements, in [row-major order](https://en.wikipedia.org/wiki/Row-_and_column-major_order). The `matmul` function performs the algorithm described above.
73
73
74
74
The pointers to `matLeft`, `matRight` and `matResult` have been annotated as `restrict`, which informs the compiler that the memory areas designated by those pointers do not alias. This means that they do not overlap in any way, so that the compiler does not need to insert extra instructions to deal with these cases. The pointers to `matLeft` and `matRight` are marked as `const` as neither of these two matrices are modified by `matmul`.
75
75
76
-
You now have a working baseline for the matrix multiplication function. You'll use it later on in this Learning Path to ensure that the assembly version and the intrinsics version of the multiplication algorithm do not contain errors.
76
+
This function gives you a working baseline for matrix multiplication. You'll use it later in the Learning Path to verify the correctness of optimized implementations using SME2 intrinsics and assembly.
Copy file name to clipboardExpand all lines: content/learning-paths/cross-platform/multiplying-matrices-with-sme2/5-outer-product.md
+15-14Lines changed: 15 additions & 14 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -6,8 +6,12 @@ weight: 7
6
6
layout: learningpathall
7
7
---
8
8
9
+
## Overview
10
+
9
11
In this section, you'll learn how to improve matrix multiplication performance using the SME engine and outer product operations.
10
12
13
+
This approach increases the number of multiply-accumulate (MACC) operations per memory load, reducing bandwidth pressure and improving overall throughput.
14
+
11
15
## Increase MACC efficiency using outer products
12
16
13
17
In the vanilla implementation, the core multiply-accumulate step looks like this:
@@ -16,14 +20,12 @@ In the vanilla implementation, the core multiply-accumulate step looks like this
16
20
acc += matLeft[m * K + k] * matRight[k * N + n];
17
21
```
18
22
19
-
This translates to one multiply-accumulate operation, known as `macc`, for two
20
-
loads (`matLeft[m * K + k]` and `matRight[k * N + n]`). It therefore has a 1:2
21
-
`macc` to `load` ratio of multiply-accumulate operations (MACCs) to memory loads - one multiply-accumulate and two loads per iteration. This ratio limits efficiency, especially in triple-nested loops where memory bandwidth becomes a bottleneck.
23
+
This translates to one multiply-accumulate operation, known as `macc`, for two loads (`matLeft[m * K + k]` and `matRight[k * N + n]`). It therefore has a 1:2 `macc` to `load` ratio of multiply-accumulate operations (MACCs) to memory loads - one multiply-accumulate and two loads per iteration, which is inefficient. This becomes more pronounced in triple-nested loops and when matrices exceed cache capacity.
22
24
23
-
To make matters worse, large matrices might not fit in cache. To improve matrix multiplication efficiency, the goal is to increase the `macc` to `load` ratio, which means increasing the number of multiply-accumulate operations per load - you can express matrix multiplication as a sum of column-by-row outer products.
25
+
To improve performance, you want to increase the `macc` to `load` ratio, which means increasing the number of multiply-accumulate operations per load - you can express matrix multiplication as a sum of column-by-row outer products.
24
26
25
-
Figure 3 below illustrates how the matrix multiplication of `matLeft` (3 rows, 2
26
-
columns) by `matRight` (2 rows, 3 columns) can be decomposed as the sum of outer
27
+
The diagram below illustrates how the matrix multiplication of `matLeft` (3 rows, 2
28
+
columns) by `matRight` (2 rows, 3 columns) can be decomposed into a sum of column-by-row outer
@@ -45,17 +47,16 @@ row-major order to column-major order.
45
47
This transformation affects only the memory layout. From a mathematical perspective, `matLeft` is not transposed. It is reorganized for better data locality.
46
48
{{% /notice %}}
47
49
48
-
### Transposition in the real world
50
+
### Transposition in practice
49
51
50
-
Just as trees don't reach the sky, the SME engine has physical implementation limits. It operates on *tiles* - 2D blocks of data stored in the ZA storage. SME has dedicated instructions to load, store, and compute on these tiles efficiently.
instruction takes two vectors as inputs and accumulates all the outer products
54
-
into a 2D tile. The tile in ZA storage allows SME to increase the `macc` to
55
-
`load` ratio by loading all the tile elements to be used with the SME outer
52
+
The SME engine operates on tiles - 2D blocks of data stored in the ZA storage. SME provides dedicated instructions to load, store, and compute on tiles efficiently.
53
+
54
+
For example, the [FMOPA](https://developer.arm.com/documentation/ddi0602/latest/SME-Instructions/FMOPA--non-widening---Floating-point-outer-product-and-accumulate-?lang=en) instruction takes two vectors as input and accumulates their outer product into a tile. The tile in ZA storage allows SME to increase the `macc` to`load` ratio by loading all the tile elements to be used with the SME outer
56
55
product instructions.
57
56
58
-
But since ZA storage is finite, you need to you need to preprocess `matLeft` to fit tile dimensions - this includes transposing portions of the matrix and padding where needed.
57
+
But since ZA storage is finite, you need to you need to preprocess `matLeft` to match the tile dimensions - this includes transposing portions of the matrix and padding where needed.
58
+
59
+
### Preprocessing with preprocess_l
59
60
60
61
The following function shows how `preprocess_l` transforms the matrix at the algorithmic level:
0 commit comments