You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: content/learning-paths/cross-platform/sme2/6-SME2-matmul-intr.md
+9-19Lines changed: 9 additions & 19 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -25,9 +25,7 @@ is supported by the main compilers, most notably [GCC](https://gcc.gnu.org/) and
25
25
26
26
## Streaming mode
27
27
28
-
In the previous page, the assembly language gave the programmer full access to the processor features. However, this comes at a cost in terms of complexity and
29
-
maintenance, especially when one has to manage large code bases with deeply-nested function calls. The assembly version is very low level, and does not deal
30
-
fully with the SME state.
28
+
On the previous page, assembly language provided the programmer with full access to processor features. However, this comes at the cost of increased complexity and maintenance, particularly when managing large codebases with deeply nested function calls. Additionally, the assembly version operates at a very low level and does not fully handle the SME state.
31
29
32
30
In real-world large-scale software, the program moves back and forth from streaming mode, and some streaming mode routines call other streaming mode routines, which means that some state needs to be saved and restored. This includes the ZA storage. This is defined in the ACLE and
33
31
supported by the compiler: the programmer *just* has to annotate the function
@@ -216,12 +214,9 @@ The core of ``preprocess_l_intr`` is made of two parts:
216
214
again the usage of predicates ``p0`` and ``p1`` (computed at lines 43-44) to
217
215
``svst1`` to prevent writing out of the matrix bounds.
218
216
219
-
As you can see, the usage of intrinsics greatly simplifies the writing of a
220
-
function once one has a good understanding of the available instructions in the
221
-
SME2 instruction set. The usage of predicates, which are at the core of SVE and
222
-
SME and allows to express an algorithm almost naturally and deal elegantly with
223
-
the corner cases (you will note that there is no explicit testing in the loops
224
-
for the cases where the rows or columns are outside of the matrix bounds).
217
+
Using intrinsics simplifies function development significantly, provided one has a good understanding of the SME2 instruction set.
218
+
Predicates, which are fundamental to SVE and SME, enable a natural expression of algorithms while handling corner cases efficiently.
219
+
Notably, there is no explicit condition checking within the loops to account for rows or columns extending beyond matrix bounds.
225
220
226
221
### Outer-product multiplication
227
222
@@ -304,12 +299,11 @@ The core of the multiplication is done in 2 parts:
304
299
are loaded with the ``svld1`` intrinsics at line 20-23 to vector registers
305
300
``zL`` and ``zR``, which are then used at line 24 with the ``svmopa_za32_m``
306
301
intrinsic to perform the outer product and accumulation (to tile 0). This
307
-
corresponds exactly to what you saw in figure 2 earlier in the learning path.
302
+
is exactly what was shown in Figure 2 earlier in the Learning Path.
308
303
Note again the usage of the ``pMDim`` and ``pNDim`` predicates to deal
309
304
correctly with the rows and columns respectively which are out of bounds.
310
305
311
-
- Storing of the result matrix at lines 27-46. The previous part has computed
312
-
the result of the matrix multiplication for the current tile, which now needs
306
+
- Storing of the result matrix at lines 27-46. The previous section computed the matrix multiplication result for the current tile, which now needs
313
307
to be written back to memory. This is done with the loop at line 29 which will
314
308
iterate over all rows of the tile: the ``svst1_hor_za32`` intrinsic at lines
315
309
35-46 stores directly from the tile to memory. Note that the loop has been
@@ -318,13 +312,9 @@ The core of the multiplication is done in 2 parts:
318
312
gracefully with the parts of the tile which are out-of-bound for the
319
313
destination matrix ``matResult``.
320
314
321
-
Once again you will note that the usage of the intrinsics made it easy to take
322
-
advantage of the full power of SME2 --- once there is a good understanding of the
323
-
available SME2 instructions. The predicates deal elegantly with the corner
324
-
cases. And most importantly, our code will deal with different SVL from different
325
-
hardware implementations without having to be recompiled. It's the important
326
-
concept of *compile-once*/*run-everywhere*, plus the implementations that have
327
-
larger SVL will perform the computation faster (for the same binary).
315
+
Once again, intrinsics makes it easy to fully leverage SME2, provided you have a solid understanding of its available instructions.
316
+
Predicates handle corner cases elegantly, ensuring robust execution. Most importantly, the code adapts to different SVL values across various hardware implementations without requiring recompilation.
317
+
This follows the key principle of compile-once, run-everywhere, allowing systems with larger SVL to execute computations more efficiently while using the same binary.
0 commit comments