ArmDeveloperEcosystem
diff --git a/‎content/learning-paths/cross-platform/loop-reflowing/_index.md‎
Lines changed: 7 additions & 8 deletions b/‎content/learning-paths/cross-platform/loop-reflowing/_index.md‎
Lines changed: 7 additions & 8 deletions
diff --git a/‎content/learning-paths/cross-platform/loop-reflowing/_review.md‎
Lines changed: 1 addition & 1 deletion b/‎content/learning-paths/cross-platform/loop-reflowing/_review.md‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎content/learning-paths/cross-platform/loop-reflowing/autovectorization-and-restrict.md‎
Lines changed: 38 additions & 15 deletions b/‎content/learning-paths/cross-platform/loop-reflowing/autovectorization-and-restrict.md‎
Lines changed: 38 additions & 15 deletions
diff --git a/‎content/learning-paths/cross-platform/loop-reflowing/autovectorization-conditionals.md‎
Lines changed: 19 additions & 10 deletions b/‎content/learning-paths/cross-platform/loop-reflowing/autovectorization-conditionals.md‎
Lines changed: 19 additions & 10 deletions
diff --git a/‎content/learning-paths/cross-platform/loop-reflowing/autovectorization-limits.md‎
Lines changed: 64 additions & 55 deletions b/‎content/learning-paths/cross-platform/loop-reflowing/autovectorization-limits.md‎
Lines changed: 64 additions & 55 deletions
@@ -3,23 +3,22 @@ title: Loop Reflowing/Autovectorization
 
 minutes_to_complete: 45
 
-who_is_this_for: This is an advanced topic for C/C++ developers who are interested in taking advantage of autovectorization in compilers
+who_is_this_for: This is an advanced topic for C/C++ developers who are interested in taking advantage of autovectorization in compilers.
 
 learning_objectives: 
-    - Learn how to modify loops in order to take advantage of autovectorization in compilers
+    - Modify loops to take advantage of autovectorization in compilers
 
 prerequisites:
-    - An Arm computer running Linux OS and a recent version of compiler (Clang or GCC) installed
+    - An Arm computer running Linux and a recent version of Clang or the GNU compiler (gcc) installed.
 
 author_primary: Konstantinos Margaritis
 
 ### Tags
 skilllevels: Advanced
-subjects: Programming
+subjects: Performance and Architecture
 armips:
-    - Aarch64
-    - Armv8-a
-    - Armv9-a
+    - Neoverse
+    - Cortex-A
 tools_software_languages:
     - GCC
     - Clang
@@ -28,8 +27,8 @@ operatingsystems:
     - Linux
 shared_path: true
 shared_between:
-    - laptops-and-desktops
     - servers-and-cloud-computing
+    - laptops-and-desktops
     - smartphones-and-mobile
 
 
 
@@ -4,7 +4,7 @@ review:
         question: >
             Autovectorization is:
         answers:
-            - The automatic generation of 3D vectors so that 3D applications/games run faster.
+            - The automatic generation of 3D vectors so that 3D games run faster.
             - Converting an array of numbers in C to an STL C++ vector object.
             - The process where an algorithm is automatically vectorized by the compiler to use SIMD instructions.
         correct_answer: 3
 
@@ -1,26 +1,31 @@
 ---
-title: Autovectorization and restrict
+title: Autovectorization using the restrict keyword
 weight: 3
 
 ### FIXED, DO NOT MODIFY
 layout: learningpathall
 ---
 
-## Autovectorization and restrict keyword
+You may have already experienced some form of autovectorization by reading [Understand the restrict keyword in C99](/learning-paths/cross-platform/restrict-keyword-c99/).
 
-You have already experienced some form of autovectorization by learning about the [`restrict` keyword in a previous Learning Path](https://learn.arm.com/learning-paths/cross-platform/restrict-keyword-c99/).
-Our example is a classic textbook example that the compiler will autovectorize simply by using `restrict`:
+The example in the previous section is a classic textbook example that the compiler will autovectorize by using `restrict`.
 
-Try the previously saved files, compile them both and compare the assembly output:
+Compile the previously saved files:
 
 ```bash
 gcc -O2 addvec.c -o addvec
 gcc -O2 addvec_neon.c -o addvec_neon
 ```
 
-Let's look at the assembly output of `addvec`:
+Generate the assembly output using:
 
-```as
+```bash
+objdump -D addvec 
+```
+
+The assembly output of the `addvec()` function is shown below:
+
+```output
 addvec:
         mov     x3, 0
 .L2:
@@ -34,9 +39,15 @@ addvec:
         ret
 ```
 
-Similarly, for the `addvec_neon` executable:
+Generate the assembly output for `addvec_neon` using:
+
+```bash
+objdump -D addvec_neon
+```
+
+The assembly output for the `addvec()` function from the `addvec_neon` executable is shown below:
 
-```as
+```output
 addvec:
         mov     x3, 0
 .L6:
@@ -50,9 +61,9 @@ addvec:
         ret
  ```
 
-The latter uses Advanced SIMD/Neon instructions `fadd` with operands `v0.4s`, `v1.4s` to perform calculations in 4 x 32-bit floating-point elements.
+The second example uses the Advanced SIMD/Neon instruction `fadd` with operands `v0.4s`, `v1.4s` to perform calculations in 4 x 32-bit floating-point elements.
 
-Let's try to add `restrict` to the output argument `C` in the first `addvec` function:
+Add the `restrict` keyword to the output argument `C` in the `addvec()` function in `addvec.c`:
 
 ```C
 void addvec(float *restrict C, float *A, float *B) {
@@ -63,8 +74,14 @@ void addvec(float *restrict C, float *A, float *B) {
 ```
 
 Recompile and check the assembly output again:
+```bash
+gcc -O2 addvec.c -o addvec
+objdump -D addvec
+```
+
+The assembly output for the `addvec` function is now: 
 
-```as
+```output
 addvec:
         mov     x3, 0
 .L2:
@@ -78,10 +95,16 @@ addvec:
         ret
  ```
 
-As you can see, the compiler has enabled autovectorization for this algorithm and the output is identical to the hand-written function! Strictly speaking, you don't even need `restrict` in such a trivial loop as it will be autovectorized anyway when certain optimization levels are added to the compilation flags (`-O2` for clang, `-O3` for gcc). However, the use of restrict simplifies the code and generates SIMD code similar to the hand written version in `addvec_neon.c`.
+As you can see, the compiler has enabled autovectorization for this algorithm and the output is identical to the hand-written function.
+
+Strictly speaking, you don't even need `restrict` in such a trivial loop as it will be autovectorized anyway when certain optimization levels are added to the compilation flags (`-O2` for clang, `-O3` for gcc). However, the use of restrict simplifies the code and generates SIMD code similar to the hand written version in `addvec_neon.c`.
+
+The reason for this is related to how each compiler decides whether to use autovectorization or not. 
+
+For each candidate loop the compiler will estimate the possible performance gains against a cost model, which is affected by many parameters and of course the optimization level in the compilation flags. 
 
-The reason for this is because of the way each compiler decides whether to use autovectorization or not. For each candidate loop the compiler will estimate the possible performance gains against a cost model, which is affected by many parameters and of course the optimization level in the compilation flags. This cost model will estimate whether the autovectorized code grows in size and if the performance gains are enough to outweigh this increase in code size. Based on this estimation, the compiler will decide to use this vectorized code or fall back to a more 'safe' scalar implementation. This decision however is something that is not set in stone and is constantly reevaluated during compiler development.
+The cost model estimates whether the autovectorized code grows in size and if the performance gains are enough to outweigh the increase in code size. Based on this estimation, the compiler will decide to use vectorized code or fall back to a more 'safe' scalar implementation. This decision however is fluid and is constantly reevaluated during compiler development.
 
-This analysis goes beyond the scope of this LP, this was just one trivial example to demonstrate how the autovectorization can be triggered by a flag.
+Compiler cost model analysis is beyond the scope of this Learning Path, but the example demonstrates how autovectorization can be triggered by a flag.
 
 You will see some more advanced examples in the next sections.
@@ -6,11 +6,13 @@ weight: 5
 layout: learningpathall
 ---
 
-## Autovectorization and conditionals 
+In the previous section, you learned that compilers cannot autovectorize loops with branches. 
 
-Previously we mentioned that compilers cannot autovectorize loops with branches. In this section, you will see that in more detail, when it is possible to enable the vectorizer in the compiler by adapting the loop and when it is required to modify the algorithm or write manually optimized code.
+In this section, you will see more examples of loops with branches.
 
-### If/else/switch in loops
+You will learn when it is possible to enable the vectorizer in the compiler by adapting the loop, and when you are required to modify the algorithm or write manually optimized code.
+
+### Loops with if/else/switch statements
 
 Consider the following function, a modified form of the previous function that uses weighted coefficients for `A[i]`.
 
@@ -26,7 +28,9 @@ void addvecweight(float *restrict C, float *A, float *B,
 }
 ```
 
-You might be tempted to think that this loop cannot be vectorized. Such loops are not that uncommon and compilers have a difficult time understanding the pattern and transforming them to vectorizeable forms, when it is possible. However, this is actually a vectorizable loop, as the conditional can actually be moved out of the loop, as this is a loop-invariant conditional. Essentially the compiler would transform -internally- the loop in something like the following:
+You might think that this loop cannot be vectorized. Such loops are not uncommon and compilers have a difficult time understanding the pattern and transforming them to vectorizable forms. However, this is actually a vectorizable loop, as the conditional can be moved out of the loop, as this is a loop-invariant conditional. 
+
+The compiler will internally transform the loop into something similar to the code below: 
 
 ```C
 void addvecweight(float *restrict C, float *A, float *B, size_t N) {
@@ -42,9 +46,11 @@ void addvecweight(float *restrict C, float *A, float *B, size_t N) {
 }
 ```
 
-which is in essence, two different loops and we know that the compiler can vectorize them. Both gcc and llvm can actually autovectorize this loop, but the output is slightly different, performance may actually vary depending on the flags used and the exact nature of the loop.
+These are two different loops that the compiler can vectorize. 
+
+Both GCC and Clang can autovectorize this loop, but the output is slightly different, performance may vary depending on the flags used and the exact nature of the loop.
 
-However, the following loop is not yet autovectorized by all compilers (llvm/clang autovectorizes this loop, but not gcc):
+However, the loop below is autovectorized by Clang but it is not autovectorized by GCC. 
 
 ```C
 void addvecweight2(float *restrict C, float *A, float *B,
@@ -58,8 +64,9 @@ void addvecweight2(float *restrict C, float *A, float *B,
 }
 ```
 
-Similarly with `switch` statements, if the condition expression in loop-invariant, that is if it does not depend on the loop variable or the elements involved in each iteration.
-For this reason we know that this loop is actually autovectorized:
+The situation is similar with `switch` statements. If the condition expression is loop-invariant, that is if it does not depend on the loop variable or the elements involved in each iteration, it can be autovectorized.
+
+This example is autovectorized:
 
 ```C
 void addvecweight(float *restrict C, float *A, float *B,
@@ -79,7 +86,7 @@ void addvecweight(float *restrict C, float *A, float *B,
 }
 ```
 
-But this one is not:
+This example is not autovectorized: 
 
 ```C
 #define sign(x) (x > 0) ? 1 : ((x < 0) ? -1 : 0)
@@ -102,4 +109,6 @@ void addvecweight(float *restrict C, float *A, float *B,
 }
 ```
 
-The cases you have seen so far are generic, they will work in other architectures besides Arm. In the next section, you will see Arm-specific usecases for autovectorization.
+The cases you have seen so far are generic, they work the same for any architecture. 
+
+In the next section, you will see Arm-specific cases for autovectorization.
@@ -6,65 +6,70 @@ weight: 4
 layout: learningpathall
 ---
 
-## Autovectorization limits
+Autovectorization is not as easy as adding a flag like `restrict` in the arguments list. 
 
-Autovectorization is not as easy as adding a flag like `restrict` in the arguments list. There are some requirements for autovectorization to be enabled, namely:
+There are some requirements for autovectorization to be enabled. Some of the requirements with examples are shown below.
 
-* The loops have to be countable
+#### Countable loops
 
-This means that the following can be vectorized:
+A countable loop is a loop where the number of iterations is known before the loop begins executing.
+
+Countable loops means the following can be vectorized:
 
 ```C
-    for (size_t i=0; i < N; i++) {
-        C[i] = A[i] + B[i];
-    }
+for (size_t i=0; i < N; i++) {
+    C[i] = A[i] + B[i];
+}
 ```
 
-but this one cannot be vectorized:
+This loop is not countable and cannot be vectorized:
 
 ```C
-    i = 0;
-    while(true) {
-        C[i] = A[i] + B[i];
-        i++;
-        if (condition) break;
-    }
+i = 0;
+while(1) {
+    C[i] = A[i] + B[i];
+    i++;
+    if (condition) break;
+}
 ```
 
-Having said that, if condition is such that the `while` loop is actually a countable loop in disguise, then the loop might be vectorizable. For example, this loop will *actually be vectorized*:
+If the `while` loop is actually a countable loop in disguise, then the loop might be vectorizable. 
+
+For example, this loop is vectorizable:
 
 ```C
-    i = 0;
-    while(1) {
-        C[i] = A[i] + B[i];
-        i++;
-        if (i >= N) break;
-    }
+i = 0;
+while(1) {
+    C[i] = A[i] + B[i];
+    i++;
+    if (i >= N) break;
+}
 ```
-but this one will not be vectorizable:
+
+This loop is not vectorizable:
 
 ```C
-    i = 0;
-    while(1) {
-        C[i] = A[i] + B[i];
-        i++;
-        if (C[i] > 0) break;
-    }
+i = 0;
+while(1) {
+    C[i] = A[i] + B[i];
+    i++;
+    if (C[i] > 0) break;
+}
 ```
 
-* No function calls inside the loop
+#### No function calls inside the loop
 
-For example if, `f()`, `g()` are functions that take `float` arguments, this loop cannot be autovectorized:
+If `f()` and `g()` are functions that take `float` arguments this loop cannot be autovectorized:
 
 ```C
-    for (size_t i=0; i < N; i++) {
-        C[i] = f(A[i]) + g(B[i]);
-    }
+for (size_t i=0; i < N; i++) {
+    C[i] = f(A[i]) + g(B[i]);
+}
 ```
 
-There is a special case of the math library trigonometry and transcendental functions (like `sin`, `cos`, `exp`, etc). There is progress underway to enable these functions to be autovectorized, as the compiler will be able to use their vectorized counterparts in `mathvec` library (`libmvec`).
+There is a special case of the math library trigonometry and transcendental functions (like `sin`, `cos`, `exp`, etc). There is work underway to enable these functions to be autovectorized, as the compiler will use their vectorized counterparts in the `mathvec` library (`libmvec`).
 
-So for example, something like the following is actually *already autovectorized* in current gcc trunk for Arm (note you have to add `-Ofast` to compilation flags to enable such autovectorization):
+The loop below is *already autovectorized* in current gcc trunk for Arm (note you have to add `-Ofast` to the compilation flags to enable autovectorization):
 
 ```C
 void addfunc(float *restrict C, float *A, float *B, size_t N) {
@@ -74,38 +79,42 @@ void addfunc(float *restrict C, float *A, float *B, size_t N) {
 }
 ```
 
-This will be in gcc 14 and require a new glibc as well (2.39). Until these are released, if you are using a released compiler as part of a distribution (gcc 13.2 at the time of writing), you will have to manually vectorize such code for performance.
+This feature will be in gcc 14 and require a new glibc version 2.39 as well. Until then, if you are using a released compiler as part of a Linux distribution (such as gcc 13.2), you will need to manually vectorize such code for performance.
 
-We will expand on autovectorization of conditionals in the next section.
+There is more about autovectorization of conditionals in the next section.
 
-* In general, no branches in the loop, no if/else/switch
+#### No branches in the loop and no if/else/switch statements
 
-This is not universally true, there are cases where branches can actually be vectorized, we will expand this in the next section.
-And in the case of SVE/SVE2 on Arm, predicates will actually make this easier and remove or minimize these limitations at least in some cases. There is currently work in progress on the compiler front to enable the use of predicates in such loops. We will probably return with a new LP to explain SVE/SVE2 autovectorization and predicates in more depth.
+This is not universally true, there are cases where branches can actually be vectorized. 
 
-* Only inner-most loops will be vectorized.
+In the case of SVE/SVE2 on Arm, predicates will actually make this easier and remove or minimize these limitations at least in some cases. There is currently work in progress to enable the use of predicates in such loops. SVE/SVE2 autovectorization and predicates is a good topic for a future Learning Path. 
 
-To clarify, consider the following nested loop:
+There is more information on this in the next section.
+
+#### Only inner-most loops will be vectorized.
+
+Consider the following nested loop:
 
 ```C
-    for (size_t i=0; i < N; i++) {
-        for (size_t j=0; j < M; j++) {
-           C[i][j] = A[i][j] + B[i][j];
-        }
+for (size_t i=0; i < N; i++) {
+    for (size_t j=0; j < M; j++) {
+       C[i][j] = A[i][j] + B[i][j];
     }
+}
 ```
 
-In such a case, only the inner loop will be vectorized, again provided all the other conditions also apply (no branches and the inner loop is countable). 
-In fact, there are some cases where outer loop types are also autovectorized, but these are outside the scope of this LP.
+In this case, only the inner loop will be vectorized, again provided all the other conditions also apply (no branches and the inner loop is countable).
+
+There are some cases where outer loop types are autovectorized, but these are not covered in this Learning Path.
+
+#### No data inter-dependency between iterations
 
-* No data inter-dependency between iterations
+This means that each iteration depends on the result of the previous iteration. This example is difficult, but not impossible to autovectorize. 
 
-This means that each iteration depends on the result of the previous iteration. Such a problem is difficult -but not impossible- to autovectorize. Consider the following example:
+The loop below cannot be autovectorized as it is. 
 
 ```C
-    for (size_t i=1; i < N; i++) {
-        C[i] = A[i] + B[i] + C[i-1];
-    }
+for (size_t i=1; i < N; i++) {
+    C[i] = A[i] + B[i] + C[i-1];
+}
 ```
-
-This example cannot be autovectorized as it is.