Skip to content

Commit 2e16b8d

Browse files
tweaks
1 parent b4a415d commit 2e16b8d

File tree

2 files changed

+33
-88
lines changed

2 files changed

+33
-88
lines changed

content/learning-paths/cross-platform/multiplying-matrices-with-sme2/7-sme2-matmul-intr.md

Lines changed: 26 additions & 73 deletions
Original file line numberDiff line numberDiff line change
@@ -15,8 +15,7 @@ In this section, you will write an SME2-optimized matrix multiplication routine
1515
You can learn more about intrinsics in this [Wikipedia
1616
Article on Intrinsic Function](https://en.wikipedia.org/wiki/Intrinsic_function).
1717

18-
Using intrinsics allows the programmer to use the specific instructions required
19-
to achieve the required performance while writing in C all the typically-required standard code, such as loops. This produces performance close to what can be reached with hand-written assembly whilst being significantly more maintainable and portable.
18+
Using intrinsics allows you to write performance-critical code in C while still using standard constructs like loops. This produces performance close to what can be reached with hand-written assembly whilst being significantly more maintainable and portable.
2019

2120
All Arm-specific intrinsics are specified in the
2221
[ACLE](https://github.com/ARM-software/acle), which is the Arm C Language Extension. ACLE
@@ -25,8 +24,7 @@ is supported by the main compilers, most notably [GCC](https://gcc.gnu.org/) and
2524

2625
## Implementation
2726

28-
Here again, a top level function named `matmul_intr` in `matmul_intr.c`
29-
will be used to stitch together the preprocessing and the multiplication:
27+
In this example, a top-level function named `matmul_intr`, defined in `matmul_intr.c`, brings together the preprocessing and matrix multiplication steps:
3028

3129
```C "{ line_numbers = true }"
3230
__arm_new("za") __arm_locally_streaming void matmul_intr(
@@ -39,9 +37,7 @@ __arm_new("za") __arm_locally_streaming void matmul_intr(
3937
}
4038
```
4139
42-
Note the `__arm_new("za")` and `__arm_locally_streaming` at line 1 that will
43-
make the compiler save the ZA storage so we can use it without destroying its
44-
content if it was still in use by one of the callers.
40+
Note the use of `__arm_new("za")` and `__arm_locally_streaming` at line 1. These attributes ensure that the compiler saves the ZA storage, allowing your function to use it safely without destroying its content if it was still in use by one of the callers.
4541
4642
`SVL`, the dimension of the ZA storage, is requested from the underlying hardware with the `svcntsw()` function call at line 5, and passed down to the `preprocess_l_intr` and `matmul_intr_impl` functions. `svcntsw()` is a function provided by the ACLE library.
4743
@@ -106,50 +102,21 @@ void preprocess_l_intr(
106102

107103
Note that `preprocess_l_intr` has been annotated at line 3 with:
108104

109-
- `__arm_streaming`, because this function is using streaming instructions,
105+
- `__arm_streaming` - because this function is using streaming instructions
110106

111-
- `__arm_inout("za")`, because `preprocess_l_intr` reuses the ZA storage
112-
from its caller.
107+
- `__arm_inout("za")` - because `preprocess_l_intr` reuses the ZA storage
108+
from its caller
113109

114-
The matrix preprocessing is performed in a double nested loop, over the `M`
115-
(line 7) and `K` (line 12) dimensions of the input matrix `a`. Both loops
116-
have an `SVL` step increment, which corresponds to the horizontal and vertical
117-
dimensions of the ZA storage that will be used. The dimensions of `a` may not
118-
be perfect multiples of `SVL` though... which is why the predicates `pMDim`
119-
(line 9) and `pKDim` (line 14) are computed in order to know which rows
120-
(respectively columns) are valid.
110+
The matrix preprocessing is performed in a double-nested loop, over the `M` (line 7) and `K` (line 12) dimensions of the input matrix `a`. Both loops have an `SVL` step increment, which corresponds to the horizontal and vertical dimensions of the ZA storage that will be used. The dimensions of `a` might not be perfect multiples of `SVL` however, which is why the predicates `pMDim`
111+
(line 9) and `pKDim` (line 14) are computed in order to know which rows (respectively columns) are valid.
121112

122113
The core of `preprocess_l_intr` is made of two parts:
123114

124-
- Lines 17 - 37: load matrix tile as rows. In this part, loop unrolling has been
125-
used at 2 different levels. At the lowest level, 4 rows are loaded at a time
126-
(lines 24-27). But this goes much further because as SME2 has multi-vectors
127-
operations (hence the `svld1_x2` intrinsic to load 2 rows in 2 vector
128-
registers), this allows the function to load the consecutive row, which
129-
happens to be the row from the neighboring tile on the right: this means two
130-
tiles are processed at once. At line 29-32, the pairs of vector registers are
131-
rearranged on quads of vector registers so they can be stored horizontally in
132-
the two tiles' ZA storage at lines 33-36 with the `svwrite_hor_za32_f32_vg4`
133-
intrinsic. Of course, as the input matrix may not have dimensions that are
134-
perfect multiples of `SVL`, the `p0`, `p1`, `p2` and `p3` predicates
135-
are computed with the `svpsel_lane_c32` intrinsic (lines 18-21) so that
136-
elements outside of the input matrix are set to 0 when they are loaded at
137-
lines 24-27.
138-
139-
- Lines 39 - 51: read the matrix tile as columns and store them. Now that the 2
140-
tiles have been loaded *horizontally*, they will be read *vertically* with the
141-
`svread_ver_za32_f32_vg4` intrinsic to quad-registers of vectors (`zq0`
142-
and `zq1`) at lines 45-48 and then stored with the `svst1` intrinsic to
143-
the relevant location in the destination matrix `a_mod` (lines 49-50). Note
144-
again the usage of predicates `p0` and `p1` (computed at lines 43-44) to
145-
`svst1` to prevent writing out of the matrix bounds.
146-
147-
Using intrinsics simplifies function development significantly, provided one has
148-
a good understanding of the SME2 instruction set. Predicates, which are
149-
fundamental to SVE and SME, enable a natural expression of algorithms while
150-
handling corner cases efficiently. Notably, there is no explicit condition
151-
checking within the loops to account for rows or columns extending beyond matrix
152-
bounds.
115+
- Lines 17 - 37: load matrix tile as rows. In this part, loop unrolling has been used at two different levels. At the lowest level, 4 rows are loaded at a time (lines 24-27). But this goes much further because as SME2 has multi-vectors operations (hence the `svld1_x2` intrinsic to load 2 rows in 2 vector registers), this allows the function to load the consecutive row, which happens to be the row from the neighboring tile on the right: this means two tiles are processed at once. At lines 29-32, the pairs of vector registers are rearranged on quads of vector registers so they can be stored horizontally in the two tiles' ZA storage at lines 33-36 with the`svwrite_hor_za32_f32_vg4` intrinsic. Of course, as the input matrix might not have dimensions that are perfect multiples of `SVL`, the `p0`, `p1`, `p2` and `p3` predicates are computed with the `svpsel_lane_c32` intrinsic (lines 18-21) so that elements outside of the input matrix are set to 0 when they are loaded at lines 24-27.
116+
117+
- Lines 39 - 51: read the matrix tile as columns and store them. Now that the two tiles have been loaded *horizontally*, they will be read *vertically* with the `svread_ver_za32_f32_vg4` intrinsic to quad-registers of vectors (`zq0` and `zq1`) at lines 45-48 and then stored with the `svst1` intrinsic to the relevant location in the destination matrix `a_mod` (lines 49-50). Note again the usage of predicates `p0` and `p1` (computed at lines 43-44) to `svst1` to prevent writing out of the matrix bounds.
118+
119+
Using intrinsics simplifies function development, provided you have a good understanding of the SME2 instruction set. Predicates, which are fundamental to both SVE and SME, allow you to express algorithms cleanly while handling corner cases efficiently. Notably, the loops for include no explicit condition checks for rows or columns that extend beyond matrix bounds.
153120

154121
### Outer-product multiplication
155122

@@ -208,33 +175,19 @@ void matmul_intr_impl(
208175
209176
Note again that `matmul_intr_impl` function has been annotated at line 4 with:
210177
211-
- `__arm_streaming`, because the function is using streaming instructions,
212-
213-
- `__arm_inout("za")`, because the function reuses the ZA storage from its caller.
214-
215-
The multiplication with the outer product is performed in a double-nested loop,
216-
over the `M` (line 7) and `N` (line 11) dimensions of the input matrices
217-
`matLeft_mod` and `matRight`. Both loops have an `SVL` step increment,
218-
which corresponds to the horizontal and vertical dimensions of the ZA storage
219-
that will be used as one tile at a time will be processed. The `M` and `N`
220-
dimensions of the inputs may not be perfect multiples of `SVL` so the
221-
predicates `pMDim` (line 9) (respectively `pNDim` at line 13) are computed in order
222-
to know which rows (respectively columns) are valid.
223-
224-
The core of the multiplication is done in 2 parts:
225-
226-
- Outer-product and accumulation at lines 15-25. As `matLeft` has been
227-
laid-out perfectly in memory with `preprocess_l_intr`, this part becomes
228-
straightforward. First, the tile is zeroed with the `svzero_za` intrinsics
229-
at line 16 so the outer products can be accumulated in the tile. The outer
230-
products are computed and accumulation over the `K` common dimension with
231-
the loop at line 19: the column of `matleft_mod` and the row of `matRight`
232-
are loaded with the `svld1` intrinsics at line 20-23 to vector registers
233-
`zL` and `zR`, which are then used at line 24 with the `svmopa_za32_m`
234-
intrinsic to perform the outer product and accumulation (to tile 0). This
235-
is exactly what was shown in Figure 2 earlier in the Learning Path.
236-
Note again the usage of the `pMDim` and `pNDim` predicates to deal
237-
correctly with the rows and columns respectively which are out of bounds.
178+
- `__arm_streaming`, because the function is using streaming instructions
179+
180+
- `__arm_inout("za")`, because the function reuses the ZA storage from its caller
181+
182+
The multiplication with the outer product is performed in a double-nested loop, over the `M` (line 7) and `N` (line 11) dimensions of the input matrices `matLeft_mod` and `matRight`. Both loops have an `SVL` step increment, which corresponds to the horizontal and vertical dimensions of the ZA storage that will be used as one tile at a time will be processed.
183+
184+
The `M` and `N` dimensions of the inputs might not be perfect multiples of `SVL` so the predicates `pMDim` (line 9) (respectively `pNDim` at line 13) are computed in order to know which rows (respectively columns) are valid.
185+
186+
The core of the multiplication is done in two parts:
187+
188+
- Outer-product and accumulation at lines 15-25. As `matLeft` has been laid out perfectly in memory with `preprocess_l_intr`, this part becomes straightforward. First, the tile is zeroed with the `svzero_za` intrinsics at line 16 so the outer products can be accumulated in the tile. The outer
189+
products are computed and accumulation over the `K` common dimension with the loop at line 19: the column of `matleft_mod` and the row of `matRight` are loaded with the `svld1` intrinsics at line 20-23 to vector registers `zL` and `zR`, which are then used at line 24 with the `svmopa_za32_m` intrinsic to perform the outer product and accumulation (to tile 0). This
190+
is exactly what was shown in Figure 2 earlier in the Learning Path. Note again the usage of the `pMDim` and `pNDim` predicates to deal correctly with the rows and columns respectively which are out of bounds.
238191
239192
- Storing of the result matrix at lines 27-46. The previous section computed the
240193
matrix multiplication result for the current tile, which now needs to be

content/learning-paths/cross-platform/multiplying-matrices-with-sme2/8-benchmarking.md

Lines changed: 7 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -64,19 +64,11 @@ Reference implementation: min time = 101 us, max time = 373 us, avg time = 136.4
6464
SME2 implementation *asm*: min time = 1 us, max time = 8 us, avg time = 1.44 us
6565
```
6666

67-
You can note that, although the vanilla reference matrix multiplication is the
68-
same, there is some variability in the execution time.
69-
70-
You'll also note that the assembly version of the SME2 matrix multiplication is
71-
slightly faster (1.44 us compared to 1.82 us for the intrinsic-based version).
72-
This *must not* convince you that assembly is better though! The comparison done
73-
here is far from being an apples-to-apples comparison:
74-
- Firstly, the assembly version has some requirements on the `K` parameter that
75-
the intrinsic version does not have.
76-
- Second, the assembly version has an optimization that the intrinsic version,
77-
for the sake of readability in this Learning Path, does not have (see the
78-
[Going further
67+
You'll notice that although the vanilla reference matrix multiplication is the same, there is some variability in the execution time.
68+
69+
The assembly version of the SME2 matrix multiplication runs slightly faster (1.44 us compared to 1.82 us for the intrinsic-based version). However, this should not lead you to be convinced that assembly is inherently better. The comparison here is not apples-to-apples:
70+
- Firstly, the assembly version has specific constraints on the `K` parameter that the intrinsics version does not.
71+
- Second, the assembly version includes an optimization that the intrinsic version, for the sake of readability in this Learning Path, does not have (see the [Going further
7972
section](/learning-paths/cross-platform/multiplying-matrices-with-sme2/10-going-further/)
80-
to know more).
81-
- Last, but not least, the intrinsic version is *easily* readable and
82-
maintainable.
73+
to learn more).
74+
- Most importantly, the intrinsics version is significantly more readable and maintainable. These are qualities that matter in real-world development.

0 commit comments

Comments
 (0)