You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: content/learning-paths/cross-platform/multiplying-matrices-with-sme2/10-going-further.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -24,7 +24,7 @@ Some ideas of improvements that you might like to test out include:
24
24
25
25
So far, you've focused on multiplying floating-point matrices. In practice, matrix operations often involve integer types as well.
26
26
27
-
The structure of the algorithm (The core logic - tiling, outer product, and accumulation) remains consistent across data types. It uses preprocessing with tiling and outer product–based multiplication. To adapt it for other data types, you only need to change how values are:
27
+
The structure of the algorithm (the core logic - tiling, outer product, and accumulation) remains consistent across data types. It uses preprocessing with tiling and outer product–based multiplication. To adapt it for other data types, you only need to change how values are:
Copy file name to clipboardExpand all lines: content/learning-paths/cross-platform/multiplying-matrices-with-sme2/7-sme2-matmul-intr.md
+26-73Lines changed: 26 additions & 73 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -15,8 +15,7 @@ In this section, you will write an SME2-optimized matrix multiplication routine
15
15
You can learn more about intrinsics in this [Wikipedia
16
16
Article on Intrinsic Function](https://en.wikipedia.org/wiki/Intrinsic_function).
17
17
18
-
Using intrinsics allows the programmer to use the specific instructions required
19
-
to achieve the required performance while writing in C all the typically-required standard code, such as loops. This produces performance close to what can be reached with hand-written assembly whilst being significantly more maintainable and portable.
18
+
Using intrinsics allows you to write performance-critical code in C while still using standard constructs like loops. This produces performance close to what can be reached with hand-written assembly whilst being significantly more maintainable and portable.
20
19
21
20
All Arm-specific intrinsics are specified in the
22
21
[ACLE](https://github.com/ARM-software/acle), which is the Arm C Language Extension. ACLE
@@ -25,8 +24,7 @@ is supported by the main compilers, most notably [GCC](https://gcc.gnu.org/) and
25
24
26
25
## Implementation
27
26
28
-
Here again, a top level function named `matmul_intr` in `matmul_intr.c`
29
-
will be used to stitch together the preprocessing and the multiplication:
27
+
In this example, a top-level function named `matmul_intr`, defined in `matmul_intr.c`, brings together the preprocessing and matrix multiplication steps:
Note the `__arm_new("za")` and `__arm_locally_streaming` at line 1 that will
43
-
make the compiler save the ZA storage so we can use it without destroying its
44
-
content if it was still in use by one of the callers.
40
+
Note the use of `__arm_new("za")` and `__arm_locally_streaming` at line 1. These attributes ensure that the compiler saves the ZA storage, allowing your function to use it safely without destroying its content if it was still in use by one of the callers.
45
41
46
42
`SVL`, the dimension of the ZA storage, is requested from the underlying hardware with the `svcntsw()` function call at line 5, and passed down to the `preprocess_l_intr` and `matmul_intr_impl` functions. `svcntsw()` is a function provided by the ACLE library.
47
43
@@ -106,50 +102,21 @@ void preprocess_l_intr(
106
102
107
103
Note that `preprocess_l_intr` has been annotated at line 3 with:
108
104
109
-
-`__arm_streaming`, because this function is using streaming instructions,
105
+
-`__arm_streaming` - because this function is using streaming instructions
110
106
111
-
-`__arm_inout("za")`, because `preprocess_l_intr` reuses the ZA storage
112
-
from its caller.
107
+
-`__arm_inout("za")` - because `preprocess_l_intr` reuses the ZA storage
108
+
from its caller
113
109
114
-
The matrix preprocessing is performed in a double nested loop, over the `M`
115
-
(line 7) and `K` (line 12) dimensions of the input matrix `a`. Both loops
116
-
have an `SVL` step increment, which corresponds to the horizontal and vertical
117
-
dimensions of the ZA storage that will be used. The dimensions of `a` may not
118
-
be perfect multiples of `SVL` though... which is why the predicates `pMDim`
119
-
(line 9) and `pKDim` (line 14) are computed in order to know which rows
120
-
(respectively columns) are valid.
110
+
The matrix preprocessing is performed in a double-nested loop, over the `M` (line 7) and `K` (line 12) dimensions of the input matrix `a`. Both loops have an `SVL` step increment, which corresponds to the horizontal and vertical dimensions of the ZA storage that will be used. The dimensions of `a` might not be perfect multiples of `SVL` however, which is why the predicates `pMDim`
111
+
(line 9) and `pKDim` (line 14) are computed in order to know which rows (respectively columns) are valid.
121
112
122
113
The core of `preprocess_l_intr` is made of two parts:
123
114
124
-
- Lines 17 - 37: load matrix tile as rows. In this part, loop unrolling has been
125
-
used at 2 different levels. At the lowest level, 4 rows are loaded at a time
126
-
(lines 24-27). But this goes much further because as SME2 has multi-vectors
127
-
operations (hence the `svld1_x2` intrinsic to load 2 rows in 2 vector
128
-
registers), this allows the function to load the consecutive row, which
129
-
happens to be the row from the neighboring tile on the right: this means two
130
-
tiles are processed at once. At line 29-32, the pairs of vector registers are
131
-
rearranged on quads of vector registers so they can be stored horizontally in
132
-
the two tiles' ZA storage at lines 33-36 with the `svwrite_hor_za32_f32_vg4`
133
-
intrinsic. Of course, as the input matrix may not have dimensions that are
134
-
perfect multiples of `SVL`, the `p0`, `p1`, `p2` and `p3` predicates
135
-
are computed with the `svpsel_lane_c32` intrinsic (lines 18-21) so that
136
-
elements outside of the input matrix are set to 0 when they are loaded at
137
-
lines 24-27.
138
-
139
-
- Lines 39 - 51: read the matrix tile as columns and store them. Now that the 2
140
-
tiles have been loaded *horizontally*, they will be read *vertically* with the
141
-
`svread_ver_za32_f32_vg4` intrinsic to quad-registers of vectors (`zq0`
142
-
and `zq1`) at lines 45-48 and then stored with the `svst1` intrinsic to
143
-
the relevant location in the destination matrix `a_mod` (lines 49-50). Note
144
-
again the usage of predicates `p0` and `p1` (computed at lines 43-44) to
145
-
`svst1` to prevent writing out of the matrix bounds.
146
-
147
-
Using intrinsics simplifies function development significantly, provided one has
148
-
a good understanding of the SME2 instruction set. Predicates, which are
149
-
fundamental to SVE and SME, enable a natural expression of algorithms while
150
-
handling corner cases efficiently. Notably, there is no explicit condition
151
-
checking within the loops to account for rows or columns extending beyond matrix
152
-
bounds.
115
+
- Lines 17 - 37: load matrix tile as rows. In this part, loop unrolling has been used at two different levels. At the lowest level, 4 rows are loaded at a time (lines 24-27). But this goes much further because as SME2 has multi-vectors operations (hence the `svld1_x2` intrinsic to load 2 rows in 2 vector registers), this allows the function to load the consecutive row, which happens to be the row from the neighboring tile on the right: this means two tiles are processed at once. At lines 29-32, the pairs of vector registers are rearranged on quads of vector registers so they can be stored horizontally in the two tiles' ZA storage at lines 33-36 with the`svwrite_hor_za32_f32_vg4` intrinsic. Of course, as the input matrix might not have dimensions that are perfect multiples of `SVL`, the `p0`, `p1`, `p2` and `p3` predicates are computed with the `svpsel_lane_c32` intrinsic (lines 18-21) so that elements outside of the input matrix are set to 0 when they are loaded at lines 24-27.
116
+
117
+
- Lines 39 - 51: read the matrix tile as columns and store them. Now that the two tiles have been loaded *horizontally*, they will be read *vertically* with the `svread_ver_za32_f32_vg4` intrinsic to quad-registers of vectors (`zq0` and `zq1`) at lines 45-48 and then stored with the `svst1` intrinsic to the relevant location in the destination matrix `a_mod` (lines 49-50). Note again the usage of predicates `p0` and `p1` (computed at lines 43-44) to `svst1` to prevent writing out of the matrix bounds.
118
+
119
+
Using intrinsics simplifies function development, provided you have a good understanding of the SME2 instruction set. Predicates, which are fundamental to both SVE and SME, allow you to express algorithms cleanly while handling corner cases efficiently. Notably, the loops for include no explicit condition checks for rows or columns that extend beyond matrix bounds.
153
120
154
121
### Outer-product multiplication
155
122
@@ -208,33 +175,19 @@ void matmul_intr_impl(
208
175
209
176
Note again that `matmul_intr_impl` function has been annotated at line 4 with:
210
177
211
-
- `__arm_streaming`, because the function is using streaming instructions,
212
-
213
-
- `__arm_inout("za")`, because the function reuses the ZA storage from its caller.
214
-
215
-
The multiplication with the outer product is performed in a double-nested loop,
216
-
over the `M` (line 7) and `N` (line 11) dimensions of the input matrices
217
-
`matLeft_mod` and `matRight`. Both loops have an `SVL` step increment,
218
-
which corresponds to the horizontal and vertical dimensions of the ZA storage
219
-
that will be used as one tile at a time will be processed. The `M` and `N`
220
-
dimensions of the inputs may not be perfect multiples of `SVL` so the
221
-
predicates `pMDim` (line 9) (respectively `pNDim` at line 13) are computed in order
222
-
to know which rows (respectively columns) are valid.
223
-
224
-
The core of the multiplication is done in 2 parts:
225
-
226
-
- Outer-product and accumulation at lines 15-25. As `matLeft` has been
227
-
laid-out perfectly in memory with `preprocess_l_intr`, this part becomes
228
-
straightforward. First, the tile is zeroed with the `svzero_za` intrinsics
229
-
at line 16 so the outer products can be accumulated in the tile. The outer
230
-
products are computed and accumulation over the `K` common dimension with
231
-
the loop at line 19: the column of `matleft_mod` and the row of `matRight`
232
-
are loaded with the `svld1` intrinsics at line 20-23 to vector registers
233
-
`zL` and `zR`, which are then used at line 24 with the `svmopa_za32_m`
234
-
intrinsic to perform the outer product and accumulation (to tile 0). This
235
-
is exactly what was shown in Figure 2 earlier in the Learning Path.
236
-
Note again the usage of the `pMDim` and `pNDim` predicates to deal
237
-
correctly with the rows and columns respectively which are out of bounds.
178
+
- `__arm_streaming`, because the function is using streaming instructions
179
+
180
+
- `__arm_inout("za")`, because the function reuses the ZA storage from its caller
181
+
182
+
The multiplication with the outer product is performed in a double-nested loop, over the `M` (line 7) and `N` (line 11) dimensions of the input matrices `matLeft_mod` and `matRight`. Both loops have an `SVL` step increment, which corresponds to the horizontal and vertical dimensions of the ZA storage that will be used as one tile at a time will be processed.
183
+
184
+
The `M` and `N` dimensions of the inputs might not be perfect multiples of `SVL` so the predicates `pMDim` (line 9) (respectively `pNDim` at line 13) are computed in order to know which rows (respectively columns) are valid.
185
+
186
+
The core of the multiplication is done in two parts:
187
+
188
+
- Outer-product and accumulation at lines 15-25. As `matLeft` has been laid out perfectly in memory with `preprocess_l_intr`, this part becomes straightforward. First, the tile is zeroed with the `svzero_za` intrinsics at line 16 so the outer products can be accumulated in the tile. The outer
189
+
products are computed and accumulation over the `K` common dimension with the loop at line 19: the column of `matleft_mod` and the row of `matRight` are loaded with the `svld1` intrinsics at line 20-23 to vector registers `zL` and `zR`, which are then used at line 24 with the `svmopa_za32_m` intrinsic to perform the outer product and accumulation (to tile 0). This
190
+
is exactly what was shown in Figure 2 earlier in the Learning Path. Note again the usage of the `pMDim` and `pNDim` predicates to deal correctly with the rows and columns respectively which are out of bounds.
238
191
239
192
- Storing of the result matrix at lines 27-46. The previous section computed the
240
193
matrix multiplication result for the current tile, which now needs to be
Copy file name to clipboardExpand all lines: content/learning-paths/cross-platform/multiplying-matrices-with-sme2/8-benchmarking.md
+7-15Lines changed: 7 additions & 15 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -64,19 +64,11 @@ Reference implementation: min time = 101 us, max time = 373 us, avg time = 136.4
64
64
SME2 implementation *asm*: min time = 1 us, max time = 8 us, avg time = 1.44 us
65
65
```
66
66
67
-
You can note that, although the vanilla reference matrix multiplication is the
68
-
same, there is some variability in the execution time.
69
-
70
-
You'll also note that the assembly version of the SME2 matrix multiplication is
71
-
slightly faster (1.44 us compared to 1.82 us for the intrinsic-based version).
72
-
This *must not* convince you that assembly is better though! The comparison done
73
-
here is far from being an apples-to-apples comparison:
74
-
- Firstly, the assembly version has some requirements on the `K` parameter that
75
-
the intrinsic version does not have.
76
-
- Second, the assembly version has an optimization that the intrinsic version,
77
-
for the sake of readability in this Learning Path, does not have (see the
78
-
[Going further
67
+
You'll notice that although the vanilla reference matrix multiplication is the same, there is some variability in the execution time.
68
+
69
+
The assembly version of the SME2 matrix multiplication runs slightly faster (1.44 us compared to 1.82 us for the intrinsic-based version). However, this should not lead you to be convinced that assembly is inherently better. The comparison here is not apples-to-apples:
70
+
- Firstly, the assembly version has specific constraints on the `K` parameter that the intrinsics version does not.
71
+
- Second, the assembly version includes an optimization that the intrinsic version, for the sake of readability in this Learning Path, does not have (see the [Going further
Copy file name to clipboardExpand all lines: content/learning-paths/cross-platform/multiplying-matrices-with-sme2/_index.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -18,7 +18,7 @@ prerequisites:
18
18
- Intermediate proficiency with the C programming language and the Armv9-A assembly language
19
19
- A computer running Linux, macOS, or Windows
20
20
- Installations of Git and Docker for project setup and emulation
21
-
- A platform that supports SME2 (see the list of [devices with SME2 support](/learning-paths/cross-platform/multiplying-matrices-with-sme2/1-get-started/#devices) or an emulator to run code with SME2 instructions
21
+
- A platform that supports SME2 - see the list of [devices with SME2 support](/learning-paths/cross-platform/multiplying-matrices-with-sme2/1-get-started/#devices) or an emulator to run code with SME2 instructions
22
22
- Compiler support for SME2 instructions (for example, LLVM 17+ with SME2 backend support)
0 commit comments