Skip to content

Commit b575f5f

Browse files
Clarifying
1 parent 3ddf238 commit b575f5f

File tree

1 file changed

+31
-25
lines changed

1 file changed

+31
-25
lines changed

content/learning-paths/cross-platform/multiplying-matrices-with-sme2/6-sme2-matmul-asm.md

Lines changed: 31 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -5,31 +5,37 @@ weight: 8
55
### FIXED, DO NOT MODIFY
66
layout: learningpathall
77
---
8+
## Overview
89

9-
In this chapter, you will use an SME2-optimized matrix multiplication written
10-
directly in assembly.
10+
In this section, you'll learn how to run an SME2-optimized matrix multiplication implemented directly in assembly.
1111

12-
## About the SME2 assembly implementation
12+
This implementation is based on the algorithm described in [Arm's SME Programmer's
13+
Guide](https://developer.arm.com/documentation/109246/0100/matmul-fp32--Single-precision-matrix-by-matrix-multiplication) and has been adapted to integrate with the existing C and intrinsics-based code in this Learning Path. It demonstrates how to apply low-level optimizations for matrix multiplication using the SME2 instruction set, with a focus on preprocessing and outer-product accumulation.
14+
15+
You'll explore how the assembly implementation works in practice, how it interfaces with C wrappers, and how to verify or benchmark its performance. Whether you're validating correctness or measuring execution speed, this example provides a clear, modular foundation for working with SME2 features in your own codebase.
1316

14-
### Description
17+
By mastering this assembly implementation, you'll gain deeper insight into SME2 execution patterns and how to integrate low-level optimizations in high-performance workloads.
18+
19+
## About the SME2 assembly implementation
1520

16-
This Learning Path reuses the assembly version provided in the [SME Programmer's
21+
This Learning Path reuses the assembly version described in [The SME Programmer's
1722
Guide](https://developer.arm.com/documentation/109246/0100/matmul-fp32--Single-precision-matrix-by-matrix-multiplication)
18-
where you will find a high-level and an in-depth description of the two steps
19-
performed.
23+
where you will find both high-level concepts and in-depth descriptions of the two key steps:
24+
preprocessing and matrix multiplication.
2025

21-
The assembly versions have been modified so they coexist nicely with
22-
the intrinsic versions. The modifications include:
23-
- let the compiler manage the switching back and forth from streaming mode,
24-
- don't use register `x18` which is used as a platform register.
26+
The assembly code has been modified to work seamlessly alongside the intrinsic version.
2527

26-
In this Learning Path:
27-
- the `preprocess` function is named `preprocess_l_asm` and is defined in
28+
The key changes include:
29+
* Delegating streaming mode control to the compiler
30+
* Avoiding register `x18`, which is reserved as a platform register
31+
32+
Here:
33+
- The `preprocess` function is named `preprocess_l_asm` and is defined in
2834
`preprocess_l_asm.S`
29-
- the outer product-based matrix multiplication is named `matmul_asm_impl`and
30-
is defined in `matmul_asm_impl.S`.
35+
- The outer product-based matrix multiplication is named `matmul_asm_impl`and
36+
is defined in `matmul_asm_impl.S`
3137

32-
Those 2 functions are declared in `matmul.h`:
38+
Both functions are declared in `matmul.h`:
3339

3440
```C
3541
// Matrix preprocessing, in assembly.
@@ -43,10 +49,9 @@ void matmul_asm_impl(
4349
float *restrict matResult) __arm_streaming __arm_inout("za");
4450
```
4551
46-
You will note that they have been marked with 2 attributes: `__arm_streaming`
52+
You can see that they have been marked with two attributes: `__arm_streaming`
4753
and `__arm_inout("za")`. This instructs the compiler that these functions
48-
expect the streaming mode to be active, and that they don't new to save /
49-
restore the ZA storage.
54+
expect the streaming mode to be active, and that they don't need to save or restore the ZA storage.
5055
5156
These two functions are stitched together in `matmul_asm.c` with the
5257
same prototype as the reference implementation of matrix multiplication, so that
@@ -63,13 +68,14 @@ __arm_new("za") __arm_locally_streaming void matmul_asm(
6368
}
6469
```
6570

66-
Note that `matmul_asm` has been annotated with 2 attributes:
67-
`__arm_new("za")` and `__arm_locally_streaming`. This instructs the compiler
68-
to swith to streaming mode and save the ZA storage (and restore it when the
69-
function returns).
71+
You can see that `matmul_asm` has been annotated with two attributes: `__arm_new("za")` and `__arm_locally_streaming`. These attributes instruct the compiler to enable streaming mode and manage ZA state on entry and return
72+
73+
## How it integrates with the main function
74+
75+
The same `main.c` file supports both the intrinsic and assembly implementations. The implementation to use is selected at compile time via the `IMPL` macro. This design reduces duplication and simplifies maintenance.
76+
77+
## Execution modes
7078

71-
The high-level `matmul_asm` function is called from `main.c`. This file might look a bit complex at first sight, but fear not, here are some explanations:
72-
- the same `main.c` is used for the assembly- and intrinsic-based versions of the matrix multiplication --- this is parametrized at compilation time with the `IMPL` macro. This avoids code duplication and improves maintenance.
7379
- on a baremetal platform, the program only works in *verification mode*, where it compares the results of the assembly-based (resp. intrinsic-based) matrix multiplication with the vanilla reference implementation. When targeting a non-baremetal platform, a *benchmarking mode* is also available.
7480

7581
```C { line_numbers="true" }

0 commit comments

Comments
 (0)