Made corrections based on Tamar's feedback

markos · markos · commit 895c3ec82d4d · 2024-01-14T19:46:19.000+02:00
diff --git a/content/learning-paths/cross-platform/loop-reflowing/_review.md b/content/learning-paths/cross-platform/loop-reflowing/_review.md
@@ -32,7 +32,7 @@ review:
             - To evaluate a sum of products of 4 x 8-bit signed/unsigned integers in each 32-bit element in the input vectors.
         correct_answer: 3
         explanation: >
-            For each 32-bit element in the input vectors A[i], B[i], `SDOT`/`UDOT` evaluate the sum of the products between the 4 x 8-bit signed/unsigned integers that comprise the A[i], B[i] elements. The corresponding 32-bit element in the output vector holds the resulting sums.
+            For each 32-bit element in the input vectors A[i], B[i], `SDOT`/`UDOT` evaluate the sum of the products between the 4 x 8-bit signed/unsigned integers that comprise the A[i], B[i] elements. The corresponding 32-bit element in the output vector holds the resulting sums. For SVE, `SDOT`/`UDOT` instruction also works on 16-bit signed/unsigned integers.
 
 
 # ================================================================================
diff --git a/content/learning-paths/cross-platform/loop-reflowing/autovectorization-and-restrict.md b/content/learning-paths/cross-platform/loop-reflowing/autovectorization-and-restrict.md
@@ -78,7 +78,7 @@ addvec:
         ret
  ```
 
-As you can see, the compiler has enabled autovectorization for this algorithm and the output is identical to the hand-written function!
+As you can see, the compiler has enabled autovectorization for this algorithm and the output is identical to the hand-written function! Strictly speaking, you don't even need `restrict` in such a trivial loop as it will be autovectorized anyway when certain optimization levels are added to the compilation flags (`-O2` for clang, `-O3` for gcc). However, the use of restrict simplifies the code and generates SIMD code similar to the hand written version in `addvec_neon.c`.
 
 This is just a trivial example though and not all loops can be autovectorized that easily by the compiler. 
 
diff --git a/content/learning-paths/cross-platform/loop-reflowing/autovectorization-limits.md b/content/learning-paths/cross-platform/loop-reflowing/autovectorization-limits.md
@@ -64,7 +64,7 @@ For example if, `f()`, `g()` are functions that take `float` arguments, this loo
 
 There is a special case of the math library trigonometry and transcendental functions (like `sin`, `cos`, `exp`, etc). There is progress underway to enable these functions to be autovectorized, as the compiler will be able to use their vectorized counterparts in `mathvec` library (`libmvec`).
 
-So for example, something like the following *will be autovectorized* in the future for Arm.
+So for example, something like the following is actually *already autovectorized* in current gcc trunk for Arm (note you have to add `-Ofast` to compilation flags to enable such autovectorization):
 
 ```C
 void addfunc(float *restrict C, float *A, float *B, size_t N) {
@@ -81,6 +81,7 @@ We will expand on autovectorization of conditionals in the next section.
 * In general, no branches in the loop, no if/else/switch
 
 This is not universally true, there are cases where branches can actually be vectorized, we will expand this in the next section.
+And in the case of SVE/SVE2 on Arm, predicates will actually make this easier and remove or minimize these limitations at least in some cases. There is currently work in progress on the compiler front to enable the use of predicates in such loops. We will probably return with a new LP to explain SVE/SVE2 autovectorization and predicates in more depth.
 
 * Only inner-most loops will be vectorized.
 
@@ -94,7 +95,8 @@ To clarify, consider the following nested loop:
     }
 ```
 
-In such a case, only the inner loop will be vectorized, again provided all the other conditions also apply (no branches and the inner loop is countable).
+In such a case, only the inner loop will be vectorized, again provided all the other conditions also apply (no branches and the inner loop is countable). 
+In fact, there are some cases where outer loop types are also autovectorized, but these are outside the scope of this LP.
 
 * No data inter-dependency between iterations
 
diff --git a/content/learning-paths/cross-platform/loop-reflowing/autovectorization-on-arm-1.md b/content/learning-paths/cross-platform/loop-reflowing/autovectorization-on-arm-1.md
@@ -183,7 +183,11 @@ int main() {
 }
 ```
 
-You need to recompile the code with `gcc -O3 -Wall -g -fno-inline -march=armv8-a+dotprod` in order to hint to the compiler that it has the new instructions at its disposal.
+You need to add `-march=armv8-a+dotprod` to the compilation flags in order to hint to the compiler that it has the new instructions at its disposal, that is:
+
+```bash
+gcc -O3 -Wall -g -fno-inline -march=armv8-a+dotprod
+```
 
 The assembly output will be quite larger as the use of `SDOT` can only work in the main loop where the size is a multiple of 16. Then the compiler will unroll the loop to use ASIMD instructions if the size is greater than 8, and byte-handling instructions if the size is smaller.
 You could eliminate those extra tail instructions by converting `N -= N % 4` to 8 or even 16: