Skip to content

Commit 0bc079a

Browse files
authored
Merge branch 'ArmDeveloperEcosystem:main' into main
2 parents 4f758a9 + a34f4da commit 0bc079a

16 files changed

+1560
-995
lines changed

content/learning-paths/cross-platform/multiplying-matrices-with-sme2/1-get-started.md

Lines changed: 204 additions & 117 deletions
Large diffs are not rendered by default.
Lines changed: 73 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,73 @@
1+
---
2+
title: Going further
3+
weight: 12
4+
5+
### FIXED, DO NOT MODIFY
6+
layout: learningpathall
7+
---
8+
9+
In this section, you will learn about the many different optimizations that are
10+
available to you.
11+
12+
## Generalize the algorithms
13+
14+
In this Learning Path, you focused on using SME2 for matrix multiplication with
15+
floating point numbers. However in practice, any library or framework supporting
16+
matrix multiplication should also handle various integer types.
17+
18+
You can see that the algorithm structure for matrix preprocessing as well as
19+
multiplication with the outer product does not change at all for other data
20+
types - they only need to be adapted.
21+
22+
This is suitable for languages with [generic
23+
programming](https://en.wikipedia.org/wiki/Generic_programming) like C++ with
24+
templates. You can even make the template manage a case where the value
25+
accumulated during the product uses a larger type than the input matrices. SME2
26+
has the instructions to deal efficiently with this common case scenario.
27+
28+
This enables the library developer to focus on the algorithm, testing, and
29+
optimizations, while allowing the compiler to generate multiple variants.
30+
31+
## Unroll further
32+
33+
You might have noticed that ``matmul_intr_impl`` computes only one tile at a
34+
time, for the sake of simplicity.
35+
36+
SME2 does support multi-vector instructions, and some were used in
37+
``preprocess_l_intr``, for example, ``svld1_x2``.
38+
39+
Loading two vectors at a time enables the simultaneous computing of more tiles,
40+
and as the input matrices have been laid out in memory in a neat way, the
41+
consecutive loading of the data is efficient. Implementing this approach can
42+
make improvements to the ``macc`` to load ``ratio``.
43+
44+
In order to check your understanding of SME2, you can try to implement this
45+
unrolling yourself in the intrinsic version (the asm version already has this
46+
optimization). You can check your work by comparing your results to the expected
47+
reference values.
48+
49+
## Apply strategies
50+
51+
One method for optimization is to use strategies that are flexible depending on
52+
the matrices' dimensions. This is especially easy to set up when working in C or
53+
C++, rather than directly in assembly language.
54+
55+
By playing with the mathematical properties of matrix multiplication and the
56+
outer product, it is possible to minimize data movement as well as reduce the
57+
overall number of operations to perform.
58+
59+
For example, it is common that one of the matrices is actually a vector, meaning
60+
that it has a single row or column, and then it becomes advantageous to
61+
transpose it. Can you see why?
62+
63+
The answer is that as the elements are stored contiguously in memory, an ``Nx1``
64+
and ``1xN`` matrices have the exact same memory layout. The transposition
65+
becomes a no-op, and the matrix elements stay in the same place in memory.
66+
67+
An even more *degenerated* case that is easy to manage is when one of the
68+
matrices is essentially a scalar, which means that it is a matrix with one row
69+
and one column.
70+
71+
Although our current code handles it correctly from a results point of view, a
72+
different algorithm and use of instructions might be more efficient. Can you
73+
think of another way?

content/learning-paths/cross-platform/multiplying-matrices-with-sme2/2-check-your-environment.md

Lines changed: 185 additions & 107 deletions
Large diffs are not rendered by default.
Lines changed: 78 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,78 @@
1+
---
2+
title: Streaming mode
3+
weight: 5
4+
5+
### FIXED, DO NOT MODIFY
6+
layout: learningpathall
7+
---
8+
9+
In real-world large-scale software, a program moves back and forth from
10+
streaming mode, and some streaming mode routines call other streaming mode
11+
routines, which means that some state needs to be saved and restored. This
12+
includes the ZA storage. This is defined in the ACLE and supported by the
13+
compiler: the programmer *just* has to annotate the functions with some keywords
14+
and let the compiler automatically perform the low-level tasks of managing the
15+
streaming mode. This frees the developer from a tedious and error-prone task.
16+
See [Introduction to streaming and non-streaming
17+
mode](https://arm-software.github.io/acle/main/acle.html#controlling-the-use-of-streaming-mode)
18+
for further information. The rest of this section references information from
19+
the ACLE.
20+
21+
## About streaming mode
22+
23+
The AArch64 architecture defines a concept called *streaming mode*, controlled
24+
by a processor state bit called `PSTATE.SM`. At any given point in time, the
25+
processor is either in streaming mode (`PSTATE.SM==1`) or in non-streaming mode
26+
(`PSTATE.SM==0`). There is an instruction called `SMSTART` to enter streaming mode
27+
and an instruction called `SMSTOP` to return to non-streaming mode.
28+
29+
Streaming mode has three main effects on C and C++ code:
30+
31+
- It can change the length of SVE vectors and predicates: the length of an SVE
32+
vector in streaming mode is called the “streaming vector length” (SVL), which
33+
might be different from the normal non-streaming vector length. See
34+
[Effect of streaming mode on VL](https://arm-software.github.io/acle/main/acle.html#effect-of-streaming-mode-on-vl)
35+
for more details.
36+
- Some instructions can only be executed in streaming mode, which means that
37+
their associated ACLE intrinsics can only be used in streaming mode. These
38+
intrinsics are called “streaming intrinsics”.
39+
- Some other instructions can only be executed in non-streaming mode, which
40+
means that their associated ACLE intrinsics can only be used in non-streaming
41+
mode. These intrinsics are called “non-streaming intrinsics”.
42+
43+
The C and C++ standards define the behavior of programs in terms of an *abstract
44+
machine*. As an extension, the ACLE specification applies the distinction
45+
between streaming mode and non-streaming mode to this abstract machine: at any
46+
given point in time, the abstract machine is either in streaming mode or in
47+
non-streaming mode.
48+
49+
This distinction between processor mode and abstract machine mode is mostly just
50+
a specification detail. However, the usual “as if” rule applies: the
51+
processor's actual mode at runtime can be different from the abstract machine's
52+
mode, provided that this does not alter the behavior of the program. One
53+
practical consequence of this is that C and C++ code does not specify the exact
54+
placement of `SMSTART` and `SMSTOP` instructions; the source code simply places
55+
limits on where such instructions go. For example, when stepping through a
56+
program in a debugger, the processor mode might sometimes be different from the
57+
one implied by the source code.
58+
59+
ACLE provides attributes that specify whether the abstract machine executes statements:
60+
61+
- In non-streaming mode, in which case they are called *non-streaming statements*.
62+
- In streaming mode, in which case they are called *streaming statements*.
63+
- In either mode, in which case they are called *streaming-compatible statements*.
64+
65+
SME provides an area of storage called ZA, of size `SVL.B` x `SVL.B` bytes. It
66+
also provides a processor state bit called `PSTATE.ZA` to control whether ZA
67+
is enabled.
68+
69+
In C and C++ code, access to ZA is controlled at function granularity: a
70+
function either uses ZA or it does not. Another way to say this is that a
71+
function either “has ZA state” or it does not.
72+
73+
If a function does have ZA state, the function can either share that ZA state
74+
with the function's caller or create new ZA state “from scratch”. In the latter
75+
case, it is the compiler's responsibility to free up ZA so that the function can
76+
use it; see the description of the lazy saving scheme in
77+
[AAPCS64](https://arm-software.github.io/acle/main/acle.html#AAPCS64) for details
78+
about how the compiler does this.

content/learning-paths/cross-platform/multiplying-matrices-with-sme2/4-outer-product.md

Lines changed: 0 additions & 108 deletions
This file was deleted.
Original file line numberDiff line numberDiff line change
@@ -1,16 +1,14 @@
11
---
22
title: Vanilla matrix multiplication
3-
weight: 5
3+
weight: 6
44

55
### FIXED, DO NOT MODIFY
66
layout: learningpathall
77
---
88

9-
## Vanilla matrix multiplication
10-
119
In this section, you will learn about an example of standard matrix multiplication in C.
1210

13-
### Algorithm description
11+
## Vanilla matrix multiplication algorithm
1412

1513
The vanilla matrix multiplication operation takes two input matrices, A [Ar
1614
rows x Ac columns] and B [Br rows x Bc columns], to produce an output matrix C
@@ -22,7 +20,6 @@ element in the B column then summing all these products, as Figure 2 shows.
2220

2321
This implies that the A, B, and C matrices have some constraints on their
2422
dimensions:
25-
2623
- A's number of columns must match B's number of rows: Ac == Br.
2724
- C has the dimensions Cr == Ar and Cc == Bc.
2825

@@ -31,22 +28,21 @@ properties and use, by reading this [Wikipedia
3128
article on Matrix Multiplication](https://en.wikipedia.org/wiki/Matrix_multiplication).
3229

3330
In this Learning Path, you will see the following variable names:
34-
35-
- ``matLeft`` corresponds to the left-hand side argument of the matrix
31+
- `matLeft` corresponds to the left-hand side argument of the matrix
3632
multiplication.
37-
- ``matRight``corresponds to the right-hand side of the matrix multiplication.
38-
- ``M`` is ``matLeft`` number of rows.
39-
- ``K`` is ``matLeft`` number of columns (and ``matRight`` number of rows).
40-
- ``N`` is ``matRight`` number of columns.
41-
- ``matResult``corresponds to the result of the matrix multiplication, with
42-
``M`` rows and ``N`` columns.
33+
- `matRight`corresponds to the right-hand side of the matrix multiplication.
34+
- `M` is `matLeft` number of rows.
35+
- `K` is `matLeft` number of columns (and `matRight` number of rows).
36+
- `N` is `matRight` number of columns.
37+
- `matResult`corresponds to the result of the matrix multiplication, with
38+
`M` rows and `N` columns.
4339

44-
### C implementation
40+
## C implementation
4541

4642
A literal implementation of the textbook matrix multiplication algorithm, as
47-
described above, can be found in file ``matmul_vanilla.c``:
43+
described above, can be found in file `matmul_vanilla.c`:
4844

49-
```C
45+
```C { line_numbers="true" }
5046
void matmul(uint64_t M, uint64_t K, uint64_t N,
5147
const float *restrict matLeft, const float *restrict matRight,
5248
float *restrict matResult) {
@@ -65,16 +61,16 @@ void matmul(uint64_t M, uint64_t K, uint64_t N,
6561
```
6662
6763
In this Learning Path, the matrices are laid out in memory as contiguous
68-
sequences of elements, in [Row-Major
69-
Order](https://en.wikipedia.org/wiki/Row-_and_column-major_order). The
70-
``matmul`` function performs the algorithm described above.
71-
72-
The pointers to ``matLeft``, ``matRight`` and ``matResult`` have been annotated as
73-
``restrict``, which informs the compiler that the memory areas designated by
74-
those pointers do not alias. This means that they do not overlap in any way, so that the
75-
compiler does not need to insert extra instructions to deal with these cases.
76-
The pointers to ``matLeft`` and ``matRight`` are marked as ``const`` as neither of these two matrices are modified by ``matmul``.
77-
78-
You now have a reference standard matrix multiplication function. You will use it later
79-
on in this Learning Path to ensure that the assembly version and the intrinsics
80-
version of the multiplication algorithm do not contain errors.
64+
sequences of elements, in [Row-Major Order](https://en.wikipedia.org/wiki/Row-_and_column-major_order).
65+
The `matmul` function performs the algorithm described above.
66+
67+
The pointers to `matLeft`, `matRight` and `matResult` have been annotated
68+
as `restrict`, which informs the compiler that the memory areas designated by
69+
those pointers do not alias. This means that they do not overlap in any way, so
70+
that the compiler does not need to insert extra instructions to deal with these
71+
cases. The pointers to `matLeft` and `matRight` are marked as `const` as
72+
neither of these two matrices are modified by `matmul`.
73+
74+
You now have a reference standard matrix multiplication function. You will use
75+
it later on in this Learning Path to ensure that the assembly version and the
76+
intrinsics version of the multiplication algorithm do not contain errors.

0 commit comments

Comments
 (0)