Skip to content

Commit dcef42f

Browse files
Further improvements.
1 parent 1270cea commit dcef42f

File tree

1 file changed

+9
-19
lines changed

1 file changed

+9
-19
lines changed

content/learning-paths/cross-platform/sme2/6-SME2-matmul-intr.md

Lines changed: 9 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -25,9 +25,7 @@ is supported by the main compilers, most notably [GCC](https://gcc.gnu.org/) and
2525

2626
## Streaming mode
2727

28-
In the previous page, the assembly language gave the programmer full access to the processor features. However, this comes at a cost in terms of complexity and
29-
maintenance, especially when one has to manage large code bases with deeply-nested function calls. The assembly version is very low level, and does not deal
30-
fully with the SME state.
28+
On the previous page, assembly language provided the programmer with full access to processor features. However, this comes at the cost of increased complexity and maintenance, particularly when managing large codebases with deeply nested function calls. Additionally, the assembly version operates at a very low level and does not fully handle the SME state.
3129

3230
In real-world large-scale software, the program moves back and forth from streaming mode, and some streaming mode routines call other streaming mode routines, which means that some state needs to be saved and restored. This includes the ZA storage. This is defined in the ACLE and
3331
supported by the compiler: the programmer *just* has to annotate the function
@@ -216,12 +214,9 @@ The core of ``preprocess_l_intr`` is made of two parts:
216214
again the usage of predicates ``p0`` and ``p1`` (computed at lines 43-44) to
217215
``svst1`` to prevent writing out of the matrix bounds.
218216

219-
As you can see, the usage of intrinsics greatly simplifies the writing of a
220-
function once one has a good understanding of the available instructions in the
221-
SME2 instruction set. The usage of predicates, which are at the core of SVE and
222-
SME and allows to express an algorithm almost naturally and deal elegantly with
223-
the corner cases (you will note that there is no explicit testing in the loops
224-
for the cases where the rows or columns are outside of the matrix bounds).
217+
Using intrinsics simplifies function development significantly, provided one has a good understanding of the SME2 instruction set.
218+
Predicates, which are fundamental to SVE and SME, enable a natural expression of algorithms while handling corner cases efficiently.
219+
Notably, there is no explicit condition checking within the loops to account for rows or columns extending beyond matrix bounds.
225220

226221
### Outer-product multiplication
227222

@@ -304,12 +299,11 @@ The core of the multiplication is done in 2 parts:
304299
are loaded with the ``svld1`` intrinsics at line 20-23 to vector registers
305300
``zL`` and ``zR``, which are then used at line 24 with the ``svmopa_za32_m``
306301
intrinsic to perform the outer product and accumulation (to tile 0). This
307-
corresponds exactly to what you saw in figure 2 earlier in the learning path.
302+
is exactly what was shown in Figure 2 earlier in the Learning Path.
308303
Note again the usage of the ``pMDim`` and ``pNDim`` predicates to deal
309304
correctly with the rows and columns respectively which are out of bounds.
310305
311-
- Storing of the result matrix at lines 27-46. The previous part has computed
312-
the result of the matrix multiplication for the current tile, which now needs
306+
- Storing of the result matrix at lines 27-46. The previous section computed the matrix multiplication result for the current tile, which now needs
313307
to be written back to memory. This is done with the loop at line 29 which will
314308
iterate over all rows of the tile: the ``svst1_hor_za32`` intrinsic at lines
315309
35-46 stores directly from the tile to memory. Note that the loop has been
@@ -318,13 +312,9 @@ The core of the multiplication is done in 2 parts:
318312
gracefully with the parts of the tile which are out-of-bound for the
319313
destination matrix ``matResult``.
320314
321-
Once again you will note that the usage of the intrinsics made it easy to take
322-
advantage of the full power of SME2 --- once there is a good understanding of the
323-
available SME2 instructions. The predicates deal elegantly with the corner
324-
cases. And most importantly, our code will deal with different SVL from different
325-
hardware implementations without having to be recompiled. It's the important
326-
concept of *compile-once*/*run-everywhere*, plus the implementations that have
327-
larger SVL will perform the computation faster (for the same binary).
315+
Once again, intrinsics makes it easy to fully leverage SME2, provided you have a solid understanding of its available instructions.
316+
Predicates handle corner cases elegantly, ensuring robust execution. Most importantly, the code adapts to different SVL values across various hardware implementations without requiring recompilation.
317+
This follows the key principle of compile-once, run-everywhere, allowing systems with larger SVL to execute computations more efficiently while using the same binary.
328318
329319
### Compile and run
330320

0 commit comments

Comments
 (0)