Skip to content

Commit 69e0aa8

Browse files
Tweaks
1 parent f8f643a commit 69e0aa8

File tree

5 files changed

+38
-94
lines changed

5 files changed

+38
-94
lines changed
Lines changed: 15 additions & 33 deletions
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,12 @@
11
---
2-
title: Beyond this implementation
2+
title: Going further
33
weight: 12
44

55
### FIXED, DO NOT MODIFY
66
layout: learningpathall
77
---
88

9-
## Going further
9+
## Beyond this implementation
1010

1111
There are many different ways that you can extend and optimize the matrix multiplication algorithm beyond the specific SME2 implementation that you've explored in this Learning Path. While the current approach is tuned for performance on a specific hardware target, further improvements can make your code more general, more efficient, and better suited to a wider range of applications.
1212

@@ -22,9 +22,9 @@ Some ideas of improvements that you might like to test out include:
2222

2323
## Generalize the algorithm for different data types
2424

25-
So far, you've focused on multiplying floating-point matrices. In practice, matrix operations often involve integer types as well.
25+
So far, you've focused on multiplying floating-point matrices. In practice, matrix operations often involve integer types as well.
2626

27-
The structure of the algorithm remains consistent across data types. It uses preprocessing with tiling and outer product–based multiplication. To adapt it for other data types, you only need to change how values are:
27+
The structure of the algorithm (The core logic - tiling, outer product, and accumulation) remains consistent across data types. It uses preprocessing with tiling and outer product–based multiplication. To adapt it for other data types, you only need to change how values are:
2828

2929
* Loaded from memory
3030
* Accumulated (often with widening)
@@ -35,10 +35,10 @@ Languages that support [generic programming](https://en.wikipedia.org/wiki/Gener
3535
Templates allow you to:
3636

3737
* Swap data types flexibly
38-
* Handle accumulation in a wider format (a common requirement)
38+
* Handle accumulation in a wider format when needed
3939
* Reuse algorithm logic across multiple matrix types
4040

41-
By expressing the algorithm generically, you benefit from the compiler generating multiple variants, allowing you the opportunity to focus on:
41+
By expressing the algorithm generically, you benefit from the compiler generating multiple optimized variants, allowing you the opportunity to focus on:
4242

4343
- Creating efficient algorithm design
4444
- Testing and verification
@@ -55,41 +55,23 @@ svld1_x2(...); // Load two vectors at once
5555
```
5656
Loading two vectors at a time enables the simultaneous computing of more tiles. Since the matrices are already laid out efficiently in memory, consecutive loading is fast. Implementing this approach can make improvements to the ``macc`` to load ``ratio``.
5757
58-
In order to check your understanding of SME2, you can try to implement this
59-
unrolling yourself in the intrinsic version (the assembly version already has this
60-
optimization). You can check your work by comparing your results to the expected
61-
reference values.
58+
In order to check your understanding of SME2, you can try to implement this unrolling yourself in the intrinsic version (the assembly version already has this optimization). You can check your work by comparing your results to the expected reference values.
6259
63-
## Apply strategies
60+
## Optimize for special matrix shapes
6461
65-
One method for optimization is to use strategies that are flexible depending on
66-
the matrices' dimensions. This is especially easy to set up when working in C or
67-
C++, rather than directly in assembly language.
62+
One method for optimization is to use strategies that are flexible depending on the matrices' dimensions. This is especially easy to set up when working in C or C++, rather than directly in assembly language.
6863
69-
By playing with the mathematical properties of matrix multiplication and the
70-
outer product, it is possible to minimize data movement as well as reduce the
71-
overall number of operations to perform.
64+
By playing with the mathematical properties of matrix multiplication and the outer product, it is possible to minimize data movement as well as reduce the overall number of operations to perform.
7265
73-
For example, it is common that one of the matrices is actually a vector, meaning
74-
that it has a single row or column, and then it becomes advantageous to
75-
transpose it. Can you see why?
66+
For example, it is common that one of the matrices is actually a vector, meaning that it has a single row or column, and then it becomes advantageous to transpose it. Can you see why?
7667
77-
The answer is that as the elements are stored contiguously in memory, an ``Nx1``
78-
and ``1xN`` matrices have the exact same memory layout. The transposition
79-
becomes a no-op, and the matrix elements stay in the same place in memory.
68+
The answer is that as the elements are stored contiguously in memory, an ``Nx1``and ``1xN`` matrices have the exact same memory layout. The transposition becomes a no-op, and the matrix elements stay in the same place in memory.
8069
81-
An even more *degenerated* case that is easy to manage is when one of the
82-
matrices is essentially a scalar, which means that it is a matrix with one row
83-
and one column.
70+
An even more *degenerated* case that is easy to manage is when one of the matrices is essentially a scalar, which means that it is a matrix with one row and one column.
8471
85-
Although our current code handles it correctly from a results point of view, a
86-
different algorithm and use of instructions might be more efficient. Can you
87-
think of another way?
72+
Although the current code used here handles it correctly from a results point of view, a different algorithm and use of instructions might be more efficient. Can you think of another way?
8873
8974
90-
In order to check your understanding of SME2, you can try to implement this
91-
unrolling yourself in the intrinsic version (the asm version already has this
92-
optimization). You can check your work by comparing your results to the expected
93-
reference values.
75+
In order to check your understanding of SME2, you can try to implement thisunrolling yourself in the intrinsic version (the asm version already has this optimization). You can check your work by comparing your results to the expected reference values.
9476
9577

content/learning-paths/cross-platform/multiplying-matrices-with-sme2/6-sme2-matmul-asm.md

Lines changed: 15 additions & 38 deletions
Original file line numberDiff line numberDiff line change
@@ -30,10 +30,8 @@ The key changes include:
3030
* Avoiding register `x18`, which is reserved as a platform register
3131

3232
Here:
33-
- The `preprocess` function is named `preprocess_l_asm` and is defined in
34-
`preprocess_l_asm.S`
35-
- The outer product-based matrix multiplication is named `matmul_asm_impl`and
36-
is defined in `matmul_asm_impl.S`
33+
- The `preprocess` function is named `preprocess_l_asm` and is defined in `preprocess_l_asm.S`
34+
- The outer product-based matrix multiplication is named `matmul_asm_impl` and is defined in `matmul_asm_impl.S`
3735

3836
Both functions are declared in `matmul.h`:
3937

@@ -49,13 +47,9 @@ void matmul_asm_impl(
4947
float *restrict matResult) __arm_streaming __arm_inout("za");
5048
```
5149
52-
You can see that they have been marked with two attributes: `__arm_streaming`
53-
and `__arm_inout("za")`. This instructs the compiler that these functions
54-
expect the streaming mode to be active, and that they don't need to save or restore the ZA storage.
50+
Both functions are annotated with the `__arm_streaming` and `__arm_inout("za")` attributes. These indicate that the function expects streaming mode to be active and does not need to save or restore the ZA storage.
5551
56-
These two functions are stitched together in `matmul_asm.c` with the
57-
same prototype as the reference implementation of matrix multiplication, so that
58-
a top-level `matmul_asm` can be called from the `main` function:
52+
These two functions are stitched together in `matmul_asm.c` with the same prototype as the reference implementation of matrix multiplication, so that a top-level `matmul_asm` can be called from the `main` function:
5953
6054
```C
6155
__arm_new("za") __arm_locally_streaming void matmul_asm(
@@ -68,15 +62,15 @@ __arm_new("za") __arm_locally_streaming void matmul_asm(
6862
}
6963
```
7064

71-
You can see that `matmul_asm` has been annotated with two attributes: `__arm_new("za")` and `__arm_locally_streaming`. These attributes instruct the compiler to enable streaming mode and manage ZA state on entry and return
65+
You can see that `matmul_asm` is annotated with two attributes: `__arm_new("za")` and `__arm_locally_streaming`. These attributes instruct the compiler to enable streaming mode and manage ZA state on entry and return.
7266

7367
## How it integrates with the main function
7468

7569
The same `main.c` file supports both the intrinsic and assembly implementations. The implementation to use is selected at compile time via the `IMPL` macro. This design reduces duplication and simplifies maintenance.
7670

7771
## Execution modes
7872

79-
- on a baremetal platform, the program only works in *verification mode*, where it compares the results of the assembly-based (resp. intrinsic-based) matrix multiplication with the vanilla reference implementation. When targeting a non-baremetal platform, a *benchmarking mode* is also available.
73+
- On a baremetal platform, the program runs in *verification mode*, where it compares the results of the assembly-based matrix multiplication with the vanilla reference implementation. When targeting a non-baremetal platform, a *benchmarking mode* is also available.
8074

8175
```C { line_numbers="true" }
8276
#ifndef __ARM_FEATURE_SME2
@@ -227,8 +221,8 @@ int main(int argc, char **argv) {
227221
float *matResult_ref = (float *)malloc(M * N * sizeof(float));
228222

229223
// Initialize matrices. Input matrices are initialized with random values in
230-
// non debug mode. In debug mode, all matrices are initialized with linear
231-
// or known values values for easier debugging.
224+
// non-debug mode. In debug mode, all matrices are initialized with linear
225+
// or known values for easier debugging.
232226
#ifdef DEBUG
233227
initialize_matrix(matLeft, M * K, LINEAR_INIT);
234228
initialize_matrix(matRight, K * N, LINEAR_INIT);
@@ -327,36 +321,19 @@ int main(int argc, char **argv) {
327321
}
328322
```
329323
330-
The same `main.c` file is used for the assembly and intrinsic-based versions
331-
of the matrix multiplication. It first sets the `M`, `K` and `N`
332-
parameters, to either the arguments supplied on the command line (lines 93-95)
333-
or uses the default value (lines 73-75). In non-baremetal mode, it also accepts
334-
(lines 82-89 and lines 98-108), as first parameter, an iteration count `I`
324+
The same `main.c` file is used for the assembly and intrinsic-based versions of the matrix multiplication. It first sets the `M`, `K` and `N` parameters, to either the arguments supplied on the command line (lines 93-95) or uses the default value (lines 73-75). In non-baremetal mode, it also accepts (lines 82-89 and lines 98-108), as first parameter, an iteration count `I`
335325
used for benchmarking.
336326
337-
Depending on the `M`, `K`, `N` dimension parameters, `main` allocates
338-
memory for all the matrices and initializes `matLeft` and `matRight` with
339-
random data. The actual matrix multiplication implementation is provided through
340-
the `IMPL` macro.
327+
Depending on the `M`, `K`, `N` dimension parameters, `main` allocates memory for all the matrices and initializes `matLeft` and `matRight` with random data. The actual matrix multiplication implementation is provided through the `IMPL` macro.
341328
342-
In *verification mode*, it then runs the matrix multiplication from `IMPL`
343-
(line 167) and computes the reference values for the preprocessed matrix as well
344-
as the result matrix (lines 170 and 171). It then compares the actual values to
345-
the reference values and reports errors, if there are any (lines 173-177).
346-
Finally, all the memory is deallocated (lines 236-243) before exiting the
329+
In *verification mode*, it then runs the matrix multiplication from `IMPL` (line 167) and computes the reference values for the preprocessed matrix as well as the result matrix (lines 170 and 171). It then compares the actual values to the reference values and reports errors, if there are any (lines 173-177). Finally, all the memory is deallocated (lines 236-243) before exiting the
347330
program with a success or failure return code at line 245.
348331
349-
In *benchmarking mode*, it will first run the vanilla reference matrix
350-
multiplication (resp. assembly- or intrinsic-based matrix multiplication) 10
351-
times without measuring elapsed time to warm-up the CPU. It will then measure
352-
the elapsed execution time of the vanilla reference matrix multiplication (resp.
353-
assembly- or intrinsic-based matrix multiplication) `I` times and then compute
332+
In *benchmarking mode*, it will first run the vanilla reference matrix multiplication (resp. assembly- or intrinsic-based matrix multiplication) 10 times without measuring elapsed time to warm-up the CPU. It will then measure the elapsed execution time of the vanilla reference matrix multiplication (resp.assembly- or intrinsic-based matrix multiplication) `I` times and then compute
354333
and report the minimum, maximum and average execution times.
355334
356335
{{% notice Note %}}
357-
Benchmarking and profiling are not simple tasks. The purpose of this Learning Path
358-
is to provide some basic guidelines on the performance improvement that can be
359-
obtained with SME2.
336+
Benchmarking and profiling are not simple tasks. The purpose of this Learning Path is to provide some basic guidelines on the performance improvement that can be obtained with SME2.
360337
{{% /notice %}}
361338
362339
### Compile and run it
@@ -401,7 +378,7 @@ whether the preprocessing and matrix multiplication passed (`PASS`) or failed
401378
(`FAILED`) the comparison the vanilla reference implementation.
402379
403380
{{% notice Tip %}}
404-
The example above uses the default values for the `M` (125), `K`(25) and `N`(70)
381+
The example above uses the default values for the `M` (125), `K`(70) and `N`(70)
405382
parameters. You can override this and provide your own values on the command line:
406383
407384
{{< tabpane code=true >}}
@@ -414,5 +391,5 @@ parameters. You can override this and provide your own values on the command lin
414391
{{< /tab >}}
415392
{{< /tabpane >}}
416393
417-
Here the values `M=7`, `K=8` and `N=9` are used instead.
394+
In this example, `M=7`, `K=8`, and `N=9` are used.
418395
{{% /notice %}}

content/learning-paths/cross-platform/multiplying-matrices-with-sme2/7-sme2-matmul-intr.md

Lines changed: 6 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -1,30 +1,22 @@
11
---
2-
title: SME2 intrinsics matrix multiplication
2+
title: Matrix multiplication using SME2 intrinsics in C
33
weight: 9
44

55
### FIXED, DO NOT MODIFY
66
layout: learningpathall
77
---
88

9-
In this section, you will write an SME2 optimized matrix multiplication in C
10-
using the intrinsics that the compiler provides.
9+
In this section, you will write an SME2-optimized matrix multiplication routine in C using the intrinsics that the compiler provides.
1110

12-
## Matrix multiplication with SME2 intrinsics
11+
## What are instrinsics?
1312

14-
*Intrinsics*, also know known as *compiler intrinsics* or *intrinsic functions*,
15-
are the functions available to application developers that the compiler has an
16-
intimate knowledge of. This enables the compiler to either translate the
17-
function to a specific instruction or to perform specific optimizations, or
18-
both.
13+
*Intrinsics*, also known as *compiler intrinsics* or *intrinsic functions*, are the functions available to application developers that the compiler has intimate knowledge of. This enables the compiler to either translate the function to a specific instruction or to perform specific optimizations, or both.
1914

2015
You can learn more about intrinsics in this [Wikipedia
2116
Article on Intrinsic Function](https://en.wikipedia.org/wiki/Intrinsic_function).
2217

2318
Using intrinsics allows the programmer to use the specific instructions required
24-
to achieve the required performance while writing in C all the
25-
typically-required standard code, such as loops. This produces performance close
26-
to what can be reached with hand-written assembly whilst being significantly
27-
more maintainable and portable.
19+
to achieve the required performance while writing in C all the typically-required standard code, such as loops. This produces performance close to what can be reached with hand-written assembly whilst being significantly more maintainable and portable.
2820

2921
All Arm-specific intrinsics are specified in the
3022
[ACLE](https://github.com/ARM-software/acle), which is the Arm C Language Extension. ACLE
@@ -51,10 +43,7 @@ Note the `__arm_new("za")` and `__arm_locally_streaming` at line 1 that will
5143
make the compiler save the ZA storage so we can use it without destroying its
5244
content if it was still in use by one of the callers.
5345
54-
`SVL`, the dimension of the ZA storage, is requested from the underlying
55-
hardware with the `svcntsw()` function call at line 5, and passed down to the
56-
`preprocess_l_intr` and `matmul_intr_impl` functions. `svcntsw()` is a
57-
function provided be the ACLE library.
46+
`SVL`, the dimension of the ZA storage, is requested from the underlying hardware with the `svcntsw()` function call at line 5, and passed down to the `preprocess_l_intr` and `matmul_intr_impl` functions. `svcntsw()` is a function provided by the ACLE library.
5847
5948
### Matrix preprocessing
6049

content/learning-paths/cross-platform/multiplying-matrices-with-sme2/8-benchmarking.md

Lines changed: 1 addition & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -39,11 +39,7 @@ Reference implementation: min time = 101 us, max time = 438 us, avg time = 139.4
3939
SME2 implementation *intr*: min time = 1 us, max time = 8 us, avg time = 1.82 us
4040
```
4141

42-
The execution time is reported in microseconds. A wide spread between the
43-
minimum and maximum figures can be noted and is expected as the way of doing the
44-
benchmarking is simplified for the purpose of simplicity. You will, however,
45-
note that the intrinsic version of the matrix multiplication brings on average a
46-
76x execution time reduction.
42+
The execution time is reported in microseconds. A wide spread between the minimum and maximum figures can be noted and is expected as the way of doing the benchmarking is simplified for the purpose of simplicity. You will, however, note that the intrinsic version of the matrix multiplication brings on average a 76x execution time reduction.
4743

4844
{{% notice Tip %}}
4945
You can override the default values for `M` (125), `K` (25), and `N` (70) and

content/learning-paths/cross-platform/multiplying-matrices-with-sme2/9-debugging.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -105,5 +105,5 @@ Tracing is disabled by default because it significantly slows down simulation an
105105

106106
## Use debug mode for matrix inspection
107107

108-
It can be helpful when debugging to understand where an element in the Tile is coming from. The current code base allows you to do that in `debug` mode, when `-DDEBUG` is passed to the compiler in the `Makefile`. If you look into `main.c`, you will notice that the matrix initialization is no
108+
It can be helpful when debugging to understand where an element in the tile is coming from. The current code base allows you to do that in `debug` mode, when `-DDEBUG` is passed to the compiler in the `Makefile`. If you look into `main.c`, you will notice that the matrix initialization is no
109109
longer random, but instead initializes each element with its linear index. This makes it *easier* to find where the matrix elements are loaded in the tile in tarmac trace, for example.

0 commit comments

Comments
 (0)