You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
There are many different ways that you can extend and optimize the matrix multiplication algorithm beyond the specific SME2 implementation that you've explored in this Learning Path. While the current approach is tuned for performance on a specific hardware target, further improvements can make your code more general, more efficient, and better suited to a wider range of applications.
12
12
@@ -22,9 +22,9 @@ Some ideas of improvements that you might like to test out include:
22
22
23
23
## Generalize the algorithm for different data types
24
24
25
-
So far, you've focused on multiplying floating-point matrices. In practice, matrix operations often involve integer types as well.
25
+
So far, you've focused on multiplying floating-point matrices. In practice, matrix operations often involve integer types as well.
26
26
27
-
The structure of the algorithm remains consistent across data types. It uses preprocessing with tiling and outer product–based multiplication. To adapt it for other data types, you only need to change how values are:
27
+
The structure of the algorithm (The core logic - tiling, outer product, and accumulation) remains consistent across data types. It uses preprocessing with tiling and outer product–based multiplication. To adapt it for other data types, you only need to change how values are:
28
28
29
29
* Loaded from memory
30
30
* Accumulated (often with widening)
@@ -35,10 +35,10 @@ Languages that support [generic programming](https://en.wikipedia.org/wiki/Gener
35
35
Templates allow you to:
36
36
37
37
* Swap data types flexibly
38
-
* Handle accumulation in a wider format (a common requirement)
38
+
* Handle accumulation in a wider format when needed
39
39
* Reuse algorithm logic across multiple matrix types
40
40
41
-
By expressing the algorithm generically, you benefit from the compiler generating multiple variants, allowing you the opportunity to focus on:
41
+
By expressing the algorithm generically, you benefit from the compiler generating multiple optimized variants, allowing you the opportunity to focus on:
42
42
43
43
- Creating efficient algorithm design
44
44
- Testing and verification
@@ -55,41 +55,23 @@ svld1_x2(...); // Load two vectors at once
55
55
```
56
56
Loading two vectors at a time enables the simultaneous computing of more tiles. Since the matrices are already laid out efficiently in memory, consecutive loading is fast. Implementing this approach can make improvements to the ``macc`` to load ``ratio``.
57
57
58
-
In order to check your understanding of SME2, you can try to implement this
59
-
unrolling yourself in the intrinsic version (the assembly version already has this
60
-
optimization). You can check your work by comparing your results to the expected
61
-
reference values.
58
+
In order to check your understanding of SME2, you can try to implement this unrolling yourself in the intrinsic version (the assembly version already has this optimization). You can check your work by comparing your results to the expected reference values.
62
59
63
-
## Apply strategies
60
+
## Optimize for special matrix shapes
64
61
65
-
One method for optimization is to use strategies that are flexible depending on
66
-
the matrices' dimensions. This is especially easy to set up when working in C or
67
-
C++, rather than directly in assembly language.
62
+
One method for optimization is to use strategies that are flexible depending on the matrices' dimensions. This is especially easy to set up when working in C or C++, rather than directly in assembly language.
68
63
69
-
By playing with the mathematical properties of matrix multiplication and the
70
-
outer product, it is possible to minimize data movement as well as reduce the
71
-
overall number of operations to perform.
64
+
By playing with the mathematical properties of matrix multiplication and the outer product, it is possible to minimize data movement as well as reduce the overall number of operations to perform.
72
65
73
-
For example, it is common that one of the matrices is actually a vector, meaning
74
-
that it has a single row or column, and then it becomes advantageous to
75
-
transpose it. Can you see why?
66
+
For example, it is common that one of the matrices is actually a vector, meaning that it has a single row or column, and then it becomes advantageous to transpose it. Can you see why?
76
67
77
-
The answer is that as the elements are stored contiguously in memory, an ``Nx1``
78
-
and ``1xN`` matrices have the exact same memory layout. The transposition
79
-
becomes a no-op, and the matrix elements stay in the same place in memory.
68
+
The answer is that as the elements are stored contiguously in memory, an ``Nx1``and ``1xN`` matrices have the exact same memory layout. The transposition becomes a no-op, and the matrix elements stay in the same place in memory.
80
69
81
-
An even more *degenerated* case that is easy to manage is when one of the
82
-
matrices is essentially a scalar, which means that it is a matrix with one row
83
-
and one column.
70
+
An even more *degenerated* case that is easy to manage is when one of the matrices is essentially a scalar, which means that it is a matrix with one row and one column.
84
71
85
-
Although our current code handles it correctly from a results point of view, a
86
-
different algorithm and use of instructions might be more efficient. Can you
87
-
think of another way?
72
+
Although the current code used here handles it correctly from a results point of view, a different algorithm and use of instructions might be more efficient. Can you think of another way?
88
73
89
74
90
-
In order to check your understanding of SME2, you can try to implement this
91
-
unrolling yourself in the intrinsic version (the asm version already has this
92
-
optimization). You can check your work by comparing your results to the expected
93
-
reference values.
75
+
In order to check your understanding of SME2, you can try to implement thisunrolling yourself in the intrinsic version (the asm version already has this optimization). You can check your work by comparing your results to the expected reference values.
You can see that they have been marked with two attributes: `__arm_streaming`
53
-
and `__arm_inout("za")`. This instructs the compiler that these functions
54
-
expect the streaming mode to be active, and that they don't need to save or restore the ZA storage.
50
+
Both functions are annotated with the `__arm_streaming` and `__arm_inout("za")` attributes. These indicate that the function expects streaming mode to be active and does not need to save or restore the ZA storage.
55
51
56
-
These two functions are stitched together in `matmul_asm.c` with the
57
-
same prototype as the reference implementation of matrix multiplication, so that
58
-
a top-level `matmul_asm` can be called from the `main` function:
52
+
These two functions are stitched together in `matmul_asm.c` with the same prototype as the reference implementation of matrix multiplication, so that a top-level `matmul_asm` can be called from the `main` function:
You can see that `matmul_asm`has been annotated with two attributes: `__arm_new("za")` and `__arm_locally_streaming`. These attributes instruct the compiler to enable streaming mode and manage ZA state on entry and return
65
+
You can see that `matmul_asm`is annotated with two attributes: `__arm_new("za")` and `__arm_locally_streaming`. These attributes instruct the compiler to enable streaming mode and manage ZA state on entry and return.
72
66
73
67
## How it integrates with the main function
74
68
75
69
The same `main.c` file supports both the intrinsic and assembly implementations. The implementation to use is selected at compile time via the `IMPL` macro. This design reduces duplication and simplifies maintenance.
76
70
77
71
## Execution modes
78
72
79
-
-on a baremetal platform, the program only works in *verification mode*, where it compares the results of the assembly-based (resp. intrinsic-based) matrix multiplication with the vanilla reference implementation. When targeting a non-baremetal platform, a *benchmarking mode* is also available.
73
+
-On a baremetal platform, the program runs in *verification mode*, where it compares the results of the assembly-based matrix multiplication with the vanilla reference implementation. When targeting a non-baremetal platform, a *benchmarking mode* is also available.
80
74
81
75
```C { line_numbers="true" }
82
76
#ifndef __ARM_FEATURE_SME2
@@ -227,8 +221,8 @@ int main(int argc, char **argv) {
227
221
float *matResult_ref = (float *)malloc(M * N * sizeof(float));
228
222
229
223
// Initialize matrices. Input matrices are initialized with random values in
230
-
// nondebug mode. In debug mode, all matrices are initialized with linear
231
-
// or known values values for easier debugging.
224
+
// non-debug mode. In debug mode, all matrices are initialized with linear
225
+
// or known values for easier debugging.
232
226
#ifdef DEBUG
233
227
initialize_matrix(matLeft, M * K, LINEAR_INIT);
234
228
initialize_matrix(matRight, K * N, LINEAR_INIT);
@@ -327,36 +321,19 @@ int main(int argc, char **argv) {
327
321
}
328
322
```
329
323
330
-
The same `main.c` file is used for the assembly and intrinsic-based versions
331
-
of the matrix multiplication. It first sets the `M`, `K` and `N`
332
-
parameters, to either the arguments supplied on the command line (lines 93-95)
333
-
or uses the default value (lines 73-75). In non-baremetal mode, it also accepts
334
-
(lines 82-89 and lines 98-108), as first parameter, an iteration count `I`
324
+
The same `main.c` file is used for the assembly and intrinsic-based versions of the matrix multiplication. It first sets the `M`, `K` and `N` parameters, to either the arguments supplied on the command line (lines 93-95) or uses the default value (lines 73-75). In non-baremetal mode, it also accepts (lines 82-89 and lines 98-108), as first parameter, an iteration count `I`
335
325
used for benchmarking.
336
326
337
-
Depending on the `M`, `K`, `N` dimension parameters, `main` allocates
338
-
memory for all the matrices and initializes `matLeft` and `matRight` with
339
-
random data. The actual matrix multiplication implementation is provided through
340
-
the `IMPL` macro.
327
+
Depending on the `M`, `K`, `N` dimension parameters, `main` allocates memory for all the matrices and initializes `matLeft` and `matRight` with random data. The actual matrix multiplication implementation is provided through the `IMPL` macro.
341
328
342
-
In *verification mode*, it then runs the matrix multiplication from `IMPL`
343
-
(line 167) and computes the reference values for the preprocessed matrix as well
344
-
as the result matrix (lines 170 and 171). It then compares the actual values to
345
-
the reference values and reports errors, if there are any (lines 173-177).
346
-
Finally, all the memory is deallocated (lines 236-243) before exiting the
329
+
In *verification mode*, it then runs the matrix multiplication from `IMPL` (line 167) and computes the reference values for the preprocessed matrix as well as the result matrix (lines 170 and 171). It then compares the actual values to the reference values and reports errors, if there are any (lines 173-177). Finally, all the memory is deallocated (lines 236-243) before exiting the
347
330
program with a success or failure return code at line 245.
348
331
349
-
In *benchmarking mode*, it will first run the vanilla reference matrix
350
-
multiplication (resp. assembly- or intrinsic-based matrix multiplication) 10
351
-
times without measuring elapsed time to warm-up the CPU. It will then measure
352
-
the elapsed execution time of the vanilla reference matrix multiplication (resp.
353
-
assembly- or intrinsic-based matrix multiplication) `I` times and then compute
332
+
In *benchmarking mode*, it will first run the vanilla reference matrix multiplication (resp. assembly- or intrinsic-based matrix multiplication) 10 times without measuring elapsed time to warm-up the CPU. It will then measure the elapsed execution time of the vanilla reference matrix multiplication (resp.assembly- or intrinsic-based matrix multiplication) `I` times and then compute
354
333
and report the minimum, maximum and average execution times.
355
334
356
335
{{% notice Note %}}
357
-
Benchmarking and profiling are not simple tasks. The purpose of this Learning Path
358
-
is to provide some basic guidelines on the performance improvement that can be
359
-
obtained with SME2.
336
+
Benchmarking and profiling are not simple tasks. The purpose of this Learning Path is to provide some basic guidelines on the performance improvement that can be obtained with SME2.
360
337
{{% /notice %}}
361
338
362
339
### Compile and run it
@@ -401,7 +378,7 @@ whether the preprocessing and matrix multiplication passed (`PASS`) or failed
401
378
(`FAILED`) the comparison the vanilla reference implementation.
402
379
403
380
{{% notice Tip %}}
404
-
The example above uses the default values for the `M` (125), `K`(25) and `N`(70)
381
+
The example above uses the default values for the `M` (125), `K`(70) and `N`(70)
405
382
parameters. You can override this and provide your own values on the command line:
406
383
407
384
{{< tabpane code=true >}}
@@ -414,5 +391,5 @@ parameters. You can override this and provide your own values on the command lin
414
391
{{< /tab >}}
415
392
{{< /tabpane >}}
416
393
417
-
Here the values `M=7`, `K=8` and `N=9` are used instead.
394
+
In this example, `M=7`, `K=8`, and `N=9` are used.
Copy file name to clipboardExpand all lines: content/learning-paths/cross-platform/multiplying-matrices-with-sme2/7-sme2-matmul-intr.md
+6-17Lines changed: 6 additions & 17 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,30 +1,22 @@
1
1
---
2
-
title: SME2 intrinsics matrix multiplication
2
+
title: Matrix multiplication using SME2 intrinsics in C
3
3
weight: 9
4
4
5
5
### FIXED, DO NOT MODIFY
6
6
layout: learningpathall
7
7
---
8
8
9
-
In this section, you will write an SME2 optimized matrix multiplication in C
10
-
using the intrinsics that the compiler provides.
9
+
In this section, you will write an SME2-optimized matrix multiplication routine in C using the intrinsics that the compiler provides.
11
10
12
-
## Matrix multiplication with SME2 intrinsics
11
+
## What are instrinsics?
13
12
14
-
*Intrinsics*, also know known as *compiler intrinsics* or *intrinsic functions*,
15
-
are the functions available to application developers that the compiler has an
16
-
intimate knowledge of. This enables the compiler to either translate the
17
-
function to a specific instruction or to perform specific optimizations, or
18
-
both.
13
+
*Intrinsics*, also known as *compiler intrinsics* or *intrinsic functions*, are the functions available to application developers that the compiler has intimate knowledge of. This enables the compiler to either translate the function to a specific instruction or to perform specific optimizations, or both.
19
14
20
15
You can learn more about intrinsics in this [Wikipedia
21
16
Article on Intrinsic Function](https://en.wikipedia.org/wiki/Intrinsic_function).
22
17
23
18
Using intrinsics allows the programmer to use the specific instructions required
24
-
to achieve the required performance while writing in C all the
25
-
typically-required standard code, such as loops. This produces performance close
26
-
to what can be reached with hand-written assembly whilst being significantly
27
-
more maintainable and portable.
19
+
to achieve the required performance while writing in C all the typically-required standard code, such as loops. This produces performance close to what can be reached with hand-written assembly whilst being significantly more maintainable and portable.
28
20
29
21
All Arm-specific intrinsics are specified in the
30
22
[ACLE](https://github.com/ARM-software/acle), which is the Arm C Language Extension. ACLE
@@ -51,10 +43,7 @@ Note the `__arm_new("za")` and `__arm_locally_streaming` at line 1 that will
51
43
make the compiler save the ZA storage so we can use it without destroying its
52
44
content if it was still in use by one of the callers.
53
45
54
-
`SVL`, the dimension of the ZA storage, is requested from the underlying
55
-
hardware with the `svcntsw()` function call at line 5, and passed down to the
56
-
`preprocess_l_intr` and `matmul_intr_impl` functions. `svcntsw()` is a
57
-
function provided be the ACLE library.
46
+
`SVL`, the dimension of the ZA storage, is requested from the underlying hardware with the `svcntsw()` function call at line 5, and passed down to the `preprocess_l_intr` and `matmul_intr_impl` functions. `svcntsw()` is a function provided by the ACLE library.
Copy file name to clipboardExpand all lines: content/learning-paths/cross-platform/multiplying-matrices-with-sme2/8-benchmarking.md
+1-5Lines changed: 1 addition & 5 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -39,11 +39,7 @@ Reference implementation: min time = 101 us, max time = 438 us, avg time = 139.4
39
39
SME2 implementation *intr*: min time = 1 us, max time = 8 us, avg time = 1.82 us
40
40
```
41
41
42
-
The execution time is reported in microseconds. A wide spread between the
43
-
minimum and maximum figures can be noted and is expected as the way of doing the
44
-
benchmarking is simplified for the purpose of simplicity. You will, however,
45
-
note that the intrinsic version of the matrix multiplication brings on average a
46
-
76x execution time reduction.
42
+
The execution time is reported in microseconds. A wide spread between the minimum and maximum figures can be noted and is expected as the way of doing the benchmarking is simplified for the purpose of simplicity. You will, however, note that the intrinsic version of the matrix multiplication brings on average a 76x execution time reduction.
47
43
48
44
{{% notice Tip %}}
49
45
You can override the default values for `M` (125), `K` (25), and `N` (70) and
Copy file name to clipboardExpand all lines: content/learning-paths/cross-platform/multiplying-matrices-with-sme2/9-debugging.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -105,5 +105,5 @@ Tracing is disabled by default because it significantly slows down simulation an
105
105
106
106
## Use debug mode for matrix inspection
107
107
108
-
It can be helpful when debugging to understand where an element in the Tile is coming from. The current code base allows you to do that in `debug` mode, when `-DDEBUG` is passed to the compiler in the `Makefile`. If you look into `main.c`, you will notice that the matrix initialization is no
108
+
It can be helpful when debugging to understand where an element in the tile is coming from. The current code base allows you to do that in `debug` mode, when `-DDEBUG` is passed to the compiler in the `Makefile`. If you look into `main.c`, you will notice that the matrix initialization is no
109
109
longer random, but instead initializes each element with its linear index. This makes it *easier* to find where the matrix elements are loaded in the tile in tarmac trace, for example.
0 commit comments