Skip to content

Commit e63fa6d

Browse files
Updates
1 parent fd7704f commit e63fa6d

File tree

3 files changed

+19
-18
lines changed

3 files changed

+19
-18
lines changed

content/learning-paths/cross-platform/multiplying-matrices-with-sme2/1-get-started.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -71,7 +71,7 @@ Amongst other files, it includes:
7171

7272
{{% notice Note %}}
7373
From this point, all instructions assume that your current directory is
74-
``code-examples/learning-paths/cross-platform/multiplying-matrices-with-sme2``. So to follow along, ensure that you are in the correct directory before proceeding.
74+
``code-examples/learning-paths/cross-platform/multiplying-matrices-with-sme2``, so ensure that you are in the correct directory before proceeding.
7575
{{% /notice %}}
7676

7777
## Set up a system with native SME2 support

content/learning-paths/cross-platform/multiplying-matrices-with-sme2/4-vanilla-matmul.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -47,7 +47,7 @@ The following variable names are used throughout the Learning Path to represent
4747

4848
## C implementation
4949

50-
The file matmul_vanilla.c contains a reference implementation of the algorithm:
50+
Here is the full reference implementation from `matmul_vanilla.c`:
5151

5252
```C { line_numbers="true" }
5353
void matmul(uint64_t M, uint64_t K, uint64_t N,
@@ -69,8 +69,8 @@ void matmul(uint64_t M, uint64_t K, uint64_t N,
6969
7070
## Memory layout and pointer annotations
7171
72-
In this Learning Path, the matrices are laid out in memory as contiguous sequences of elements, in [Row-Major Order](https://en.wikipedia.org/wiki/Row-_and_column-major_order). The `matmul` function performs the algorithm described above.
72+
In this Learning Path, the matrices are laid out in memory as contiguous sequences of elements, in [row-major order](https://en.wikipedia.org/wiki/Row-_and_column-major_order). The `matmul` function performs the algorithm described above.
7373
7474
The pointers to `matLeft`, `matRight` and `matResult` have been annotated as `restrict`, which informs the compiler that the memory areas designated by those pointers do not alias. This means that they do not overlap in any way, so that the compiler does not need to insert extra instructions to deal with these cases. The pointers to `matLeft` and `matRight` are marked as `const` as neither of these two matrices are modified by `matmul`.
7575
76-
You now have a working baseline for the matrix multiplication function. You'll use it later on in this Learning Path to ensure that the assembly version and the intrinsics version of the multiplication algorithm do not contain errors.
76+
This function gives you a working baseline for matrix multiplication. You'll use it later in the Learning Path to verify the correctness of optimized implementations using SME2 intrinsics and assembly.

content/learning-paths/cross-platform/multiplying-matrices-with-sme2/5-outer-product.md

Lines changed: 15 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -6,8 +6,12 @@ weight: 7
66
layout: learningpathall
77
---
88

9+
## Overview
10+
911
In this section, you'll learn how to improve matrix multiplication performance using the SME engine and outer product operations.
1012

13+
This approach increases the number of multiply-accumulate (MACC) operations per memory load, reducing bandwidth pressure and improving overall throughput.
14+
1115
## Increase MACC efficiency using outer products
1216

1317
In the vanilla implementation, the core multiply-accumulate step looks like this:
@@ -16,14 +20,12 @@ In the vanilla implementation, the core multiply-accumulate step looks like this
1620
acc += matLeft[m * K + k] * matRight[k * N + n];
1721
```
1822

19-
This translates to one multiply-accumulate operation, known as `macc`, for two
20-
loads (`matLeft[m * K + k]` and `matRight[k * N + n]`). It therefore has a 1:2
21-
`macc` to `load` ratio of multiply-accumulate operations (MACCs) to memory loads - one multiply-accumulate and two loads per iteration. This ratio limits efficiency, especially in triple-nested loops where memory bandwidth becomes a bottleneck.
23+
This translates to one multiply-accumulate operation, known as `macc`, for two loads (`matLeft[m * K + k]` and `matRight[k * N + n]`). It therefore has a 1:2 `macc` to `load` ratio of multiply-accumulate operations (MACCs) to memory loads - one multiply-accumulate and two loads per iteration, which is inefficient. This becomes more pronounced in triple-nested loops and when matrices exceed cache capacity.
2224

23-
To make matters worse, large matrices might not fit in cache. To improve matrix multiplication efficiency, the goal is to increase the `macc` to `load` ratio, which means increasing the number of multiply-accumulate operations per load - you can express matrix multiplication as a sum of column-by-row outer products.
25+
To improve performance, you want to increase the `macc` to `load` ratio, which means increasing the number of multiply-accumulate operations per load - you can express matrix multiplication as a sum of column-by-row outer products.
2426

25-
Figure 3 below illustrates how the matrix multiplication of `matLeft` (3 rows, 2
26-
columns) by `matRight` (2 rows, 3 columns) can be decomposed as the sum of outer
27+
The diagram below illustrates how the matrix multiplication of `matLeft` (3 rows, 2
28+
columns) by `matRight` (2 rows, 3 columns) can be decomposed into a sum of column-by-row outer
2729
products:
2830

2931
![example image alt-text#center](outer_product.png "Figure 3: Outer product-based matrix multiplication.")
@@ -45,17 +47,16 @@ row-major order to column-major order.
4547
This transformation affects only the memory layout. From a mathematical perspective, `matLeft` is not transposed. It is reorganized for better data locality.
4648
{{% /notice %}}
4749

48-
### Transposition in the real world
50+
### Transposition in practice
4951

50-
Just as trees don't reach the sky, the SME engine has physical implementation limits. It operates on *tiles* - 2D blocks of data stored in the ZA storage. SME has dedicated instructions to load, store, and compute on these tiles efficiently.
51-
For example, the
52-
[fmopa](https://developer.arm.com/documentation/ddi0602/latest/SME-Instructions/FMOPA--non-widening---Floating-point-outer-product-and-accumulate-?lang=en)
53-
instruction takes two vectors as inputs and accumulates all the outer products
54-
into a 2D tile. The tile in ZA storage allows SME to increase the `macc` to
55-
`load` ratio by loading all the tile elements to be used with the SME outer
52+
The SME engine operates on tiles - 2D blocks of data stored in the ZA storage. SME provides dedicated instructions to load, store, and compute on tiles efficiently.
53+
54+
For example, the [FMOPA](https://developer.arm.com/documentation/ddi0602/latest/SME-Instructions/FMOPA--non-widening---Floating-point-outer-product-and-accumulate-?lang=en) instruction takes two vectors as input and accumulates their outer product into a tile. The tile in ZA storage allows SME to increase the `macc` to`load` ratio by loading all the tile elements to be used with the SME outer
5655
product instructions.
5756

58-
But since ZA storage is finite, you need to you need to preprocess `matLeft` to fit tile dimensions - this includes transposing portions of the matrix and padding where needed.
57+
But since ZA storage is finite, you need to you need to preprocess `matLeft` to match the tile dimensions - this includes transposing portions of the matrix and padding where needed.
58+
59+
### Preprocessing with preprocess_l
5960

6061
The following function shows how `preprocess_l` transforms the matrix at the algorithmic level:
6162

0 commit comments

Comments
 (0)