You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: content/learning-paths/cross-platform/multiplying-matrices-with-sme2/6-sme2-matmul-asm.md
+31-25Lines changed: 31 additions & 25 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -5,31 +5,37 @@ weight: 8
5
5
### FIXED, DO NOT MODIFY
6
6
layout: learningpathall
7
7
---
8
+
## Overview
8
9
9
-
In this chapter, you will use an SME2-optimized matrix multiplication written
10
-
directly in assembly.
10
+
In this section, you'll learn how to run an SME2-optimized matrix multiplication implemented directly in assembly.
11
11
12
-
## About the SME2 assembly implementation
12
+
This implementation is based on the algorithm described in [Arm's SME Programmer's
13
+
Guide](https://developer.arm.com/documentation/109246/0100/matmul-fp32--Single-precision-matrix-by-matrix-multiplication) and has been adapted to integrate with the existing C and intrinsics-based code in this Learning Path. It demonstrates how to apply low-level optimizations for matrix multiplication using the SME2 instruction set, with a focus on preprocessing and outer-product accumulation.
14
+
15
+
You'll explore how the assembly implementation works in practice, how it interfaces with C wrappers, and how to verify or benchmark its performance. Whether you're validating correctness or measuring execution speed, this example provides a clear, modular foundation for working with SME2 features in your own codebase.
13
16
14
-
### Description
17
+
By mastering this assembly implementation, you'll gain deeper insight into SME2 execution patterns and how to integrate low-level optimizations in high-performance workloads.
18
+
19
+
## About the SME2 assembly implementation
15
20
16
-
This Learning Path reuses the assembly version provided in the [SME Programmer's
21
+
This Learning Path reuses the assembly version described in [The SME Programmer's
Note that `matmul_asm` has been annotated with 2 attributes:
67
-
`__arm_new("za")` and `__arm_locally_streaming`. This instructs the compiler
68
-
to swith to streaming mode and save the ZA storage (and restore it when the
69
-
function returns).
71
+
You can see that `matmul_asm` has been annotated with two attributes: `__arm_new("za")` and `__arm_locally_streaming`. These attributes instruct the compiler to enable streaming mode and manage ZA state on entry and return
72
+
73
+
## How it integrates with the main function
74
+
75
+
The same `main.c` file supports both the intrinsic and assembly implementations. The implementation to use is selected at compile time via the `IMPL` macro. This design reduces duplication and simplifies maintenance.
76
+
77
+
## Execution modes
70
78
71
-
The high-level `matmul_asm` function is called from `main.c`. This file might look a bit complex at first sight, but fear not, here are some explanations:
72
-
- the same `main.c` is used for the assembly- and intrinsic-based versions of the matrix multiplication --- this is parametrized at compilation time with the `IMPL` macro. This avoids code duplication and improves maintenance.
73
79
- on a baremetal platform, the program only works in *verification mode*, where it compares the results of the assembly-based (resp. intrinsic-based) matrix multiplication with the vanilla reference implementation. When targeting a non-baremetal platform, a *benchmarking mode* is also available.
0 commit comments