ArmDeveloperEcosystem
diff --git a/‎content/learning-paths/cross-platform/loop-reflowing/_index.md‎
Lines changed: 41 additions & 0 deletions b/‎content/learning-paths/cross-platform/loop-reflowing/_index.md‎
Lines changed: 41 additions & 0 deletions
diff --git a/‎content/learning-paths/cross-platform/loop-reflowing/_next-steps.md‎
Lines changed: 31 additions & 0 deletions b/‎content/learning-paths/cross-platform/loop-reflowing/_next-steps.md‎
Lines changed: 31 additions & 0 deletions
diff --git a/‎content/learning-paths/cross-platform/loop-reflowing/_review.md‎
Lines changed: 44 additions & 0 deletions b/‎content/learning-paths/cross-platform/loop-reflowing/_review.md‎
Lines changed: 44 additions & 0 deletions
diff --git a/‎content/learning-paths/cross-platform/loop-reflowing/autovectorization-and-restrict.md‎
Lines changed: 87 additions & 0 deletions b/‎content/learning-paths/cross-platform/loop-reflowing/autovectorization-and-restrict.md‎
Lines changed: 87 additions & 0 deletions
diff --git a/‎content/learning-paths/cross-platform/loop-reflowing/autovectorization-conditionals.md‎
Lines changed: 105 additions & 0 deletions b/‎content/learning-paths/cross-platform/loop-reflowing/autovectorization-conditionals.md‎
Lines changed: 105 additions & 0 deletions
@@ -0,0 +1,41 @@
+---
+title: Loop Reflowing/Autovectorization
+
+minutes_to_complete: 45
+
+who_is_this_for: This is an advanced topic for C/C++ developers who are interested in taking advantage of autovectorization in compilers
+
+learning_objectives: 
+    - Learn how to modify loops in order to take advantage of autovectorization in compilers
+
+prerequisites:
+    - An Arm computer running Linux OS and a recent version of compiler (Clang or GCC) installed
+
+author_primary: Konstantinos Margaritis
+
+### Tags
+skilllevels: Advanced
+subjects: Programming
+armips:
+    - Aarch64
+    - Armv8-a
+    - Armv9-a
+tools_software_languages:
+    - GCC
+    - Clang
+    - Coding
+operatingsystems:
+    - Linux
+shared_path: true
+shared_between:
+    - laptops-and-desktops
+    - servers-and-cloud-computing
+    - smartphones-and-mobile
+
+
+### FIXED, DO NOT MODIFY
+# ================================================================================
+weight: 1                       # _index.md always has weight of 1 to order correctly
+layout: "learningpathall"       # All files under learning paths have this same wrapper
+learning_path_main_page: "yes"  # This should be surfaced when looking for related content. Only set for _index.md of learning path content.
+---
@@ -0,0 +1,31 @@
+---
+next_step_guidance: You now have a good understanding of Autovectorization, when to use it and how.
+
+recommended_path: /learning-paths/servers-and-cloud-computing/top-down-n1/
+
+further_reading:
+    - resource:
+        title: An update on GNU performance
+        link: https://community.arm.com/arm-community-blogs/b/tools-software-ides-blog/posts/update-on-gnu-performance
+        type: blog
+    - resource:
+        title: Auto-Vectorization in LLVM¶
+        link: https://llvm.org/docs/Vectorizers.html
+        type: website
+    - resource:
+        title: GCC Autovectorization
+        link: https://hpac.cs.umu.se/teaching/sem-accg-16/slides/08.Schmitz-GGC_Autovec.pdf
+        type: documentation
+    - resource:
+        title: Auto-vectorization in GCC
+        link: https://gcc.gnu.org/projects/tree-ssa/vectorization.html
+        type: website
+
+
+# ================================================================================
+#       FIXED, DO NOT MODIFY
+# ================================================================================
+weight: 21                  # set to always be larger than the content in this path, and one more than 'review'
+title: "Next Steps"         # Always the same
+layout: "learningpathall"   # All files under learning paths have this same wrapper
+---
@@ -0,0 +1,44 @@
+---
+review:
+    - questions:
+        question: >
+            Autovectorization is:
+        answers:
+            - The automatic generation of 3D vectors so that 3D applications/games run faster.
+            - Converting an array of numbers in C to an STL C++ vector object.
+            - The process where an algorithm is automatically vectorized by the compiler to use SIMD instructions.
+        correct_answer: 3
+        explanation: >
+            Vectorization is the process that converts a loop to use SIMD instructions and is a manual process. Autovectorization is when the compiler does this conversion automatically by detecting specific patterns in the loop that enable it to use specific SIMD instructions to increase performance.
+
+    - questions:
+        question: >
+            Can the compiler autovectorize all kinds of loops?
+        answers:
+            - No, only countable loops.
+            - All loops except loops with function calls.
+            - Yes, all of them.
+            - No, only a few kinds of loops are vectorizable based on specific conditions.
+        correct_answer: 4                   
+        explanation: >
+            There are quite a few requirements so that a loop can be detected as vectorizable by the compiler. In particular, it has to be countable, mostly without branches, no function calls, no data inter-dependency.
+               
+    - questions:
+        question: >
+            The purpose of the `SDOT`/`UDOT` instructions on Arm is:
+        answers:
+            - To evaluate a dot product between 4 x 32-bit float elements in a vector.
+            - To change the position of the decimal point ('dot') in a floating-point number
+            - To evaluate a sum of products of 4 x 8-bit signed/unsigned integers in each 32-bit element in the input vectors.
+        correct_answer: 3
+        explanation: >
+            For each 32-bit element in the input vectors A[i], B[i], `SDOT`/`UDOT` evaluate the sum of the products between the 4 x 8-bit signed/unsigned integers that comprise the A[i], B[i] elements. The corresponding 32-bit element in the output vector holds the resulting sums. For SVE, `SDOT`/`UDOT` instruction also works on 16-bit signed/unsigned integers.
+
+
+# ================================================================================
+#       FIXED, DO NOT MODIFY
+# ================================================================================
+title: "Review"                 # Always the same title
+weight: 20                      # Set to always be larger than the content in this path
+layout: "learningpathall"       # All files under learning paths have this same wrapper
+---
@@ -0,0 +1,87 @@
+---
+title: Autovectorization and restrict
+weight: 3
+
+### FIXED, DO NOT MODIFY
+layout: learningpathall
+---
+
+## Autovectorization and restrict keyword
+
+You have already experienced some form of autovectorization by learning about the [`restrict` keyword in a previous Learning Path](https://learn.arm.com/learning-paths/cross-platform/restrict-keyword-c99/).
+Our example is a classic textbook example that the compiler will autovectorize simply by using `restrict`:
+
+Try the previously saved files, compile them both and compare the assembly output:
+
+```bash
+gcc -O2 addvec.c -o addvec
+gcc -O2 addvec_neon.c -o addvec_neon
+```
+
+Let's look at the assembly output of `addvec`:
+
+```as
+addvec:
+        mov     x3, 0
+.L2:
+        ldr     s0, [x1, x3, lsl 2]
+        ldr     s1, [x2, x3, lsl 2]
+        fadd    s0, s0, s1
+        str     s0, [x0, x3, lsl 2]
+        add     x3, x3, 1
+        cmp     x3, 100
+        bne     .L2
+        ret
+```
+
+Similarly, for the `addvec_neon` executable:
+
+```as
+addvec:
+        mov     x3, 0
+.L6:
+        ldr     q0, [x1, x3]
+        ldr     q1, [x2, x3]
+        fadd    v0.4s, v0.4s, v1.4s
+        str     q0, [x0, x3]
+        add     x3, x3, 16
+        cmp     x3, 400
+        bne     .L6
+        ret
+ ```
+
+The latter uses Advanced SIMD/Neon instructions `fadd` with operands `v0.4s`, `v1.4s` to perform calculations in 4 x 32-bit floating-point elements.
+
+Let's try to add `restrict` to the output argument `C` in the first `addvec` function:
+
+```C
+void addvec(float *restrict C, float *A, float *B) {
+    for (size_t i=0; i < N; i++) {
+    	C[i] = A[i] + B[i];
+    }
+}
+```
+
+Recompile and check the assembly output again:
+
+```as
+addvec:
+        mov     x3, 0
+.L2:
+        ldr     q0, [x1, x3]
+        ldr     q1, [x2, x3]
+        fadd    v0.4s, v0.4s, v1.4s
+        str     q0, [x0, x3]
+        add     x3, x3, 16
+        cmp     x3, 400
+        bne     .L2
+        ret
+ ```
+
+As you can see, the compiler has enabled autovectorization for this algorithm and the output is identical to the hand-written function! Strictly speaking, you don't even need `restrict` in such a trivial loop as it will be autovectorized anyway when certain optimization levels are added to the compilation flags (`-O2` for clang, `-O3` for gcc). However, the use of restrict simplifies the code and generates SIMD code similar to the hand written version in `addvec_neon.c`.
+
+The reason for this is because of the way each compiler decides whether to use autovectorization or not. For each candidate loop the compiler will estimate the possible performance gains against a cost model, which is affected by many parameters and of course the optimization level in the compilation flags. This cost model will estimate whether the autovectorized code grows in size and if the performance gains are enough to outweigh this increase in code size. Based on this estimation, the compiler will decide to use this vectorized code or fall back to a more 'safe' scalar implementation. This decision however is something that is not set in stone and is constantly reevaluated during compiler development.
+
+This analysis goes beyond the scope of this LP, this was just one trivial example to demonstrate how the autovectorization can be triggered by a flag.
+
+You will see some more advanced examples in the next sections.
@@ -0,0 +1,105 @@
+---
+title: Autovectorization and conditionals
+weight: 5
+
+### FIXED, DO NOT MODIFY
+layout: learningpathall
+---
+
+## Autovectorization and conditionals 
+
+Previously we mentioned that compilers cannot autovectorize loops with branches. In this section, you will see that in more detail, when it is possible to enable the vectorizer in the compiler by adapting the loop and when it is required to modify the algorithm or write manually optimized code.
+
+### If/else/switch in loops
+
+Consider the following function, a modified form of the previous function that uses weighted coefficients for `A[i]`.
+
+```C
+void addvecweight(float *restrict C, float *A, float *B,
+                    size_t N, float weight) {
+    for (size_t i=0; i < N; i++) {
+        if (weight < 0.5f)
+            C[i] = A[i] + B[i];
+        else
+            C[i] = 1.5f*A[i] + 0.5f * B[i];
+    }
+}
+```
+
+You might be tempted to think that this loop cannot be vectorized. Such loops are not that uncommon and compilers have a difficult time understanding the pattern and transforming them to vectorizeable forms, when it is possible. However, this is actually a vectorizable loop, as the conditional can actually be moved out of the loop, as this is a loop-invariant conditional. Essentially the compiler would transform -internally- the loop in something like the following:
+
+```C
+void addvecweight(float *restrict C, float *A, float *B, size_t N) {
+    if (weight < 0.5f) {
+        for (size_t i=0; i < N; i++) {
+            C[i] = A[i] + B[i];
+        }
+    } else {
+        for (size_t i=0; i < N; i++) {
+            C[i] = 1.5f*A[i] + 0.5f * B[i];
+        }
+    }
+}
+```
+
+which is in essence, two different loops and we know that the compiler can vectorize them. Both gcc and llvm can actually autovectorize this loop, but the output is slightly different, performance may actually vary depending on the flags used and the exact nature of the loop.
+
+However, the following loop is not yet autovectorized by all compilers (llvm/clang autovectorizes this loop, but not gcc):
+
+```C
+void addvecweight2(float *restrict C, float *A, float *B,
+                    size_t N, float weight) {
+    for (size_t i=0; i < N; i++) {
+        if (A[i] < 0.5f)
+            C[i] = A[i] + B[i];
+        else
+            C[i] = 1.5f*A[i] + 0.5f * B[i];
+    }
+}
+```
+
+Similarly with `switch` statements, if the condition expression in loop-invariant, that is if it does not depend on the loop variable or the elements involved in each iteration.
+For this reason we know that this loop is actually autovectorized:
+
+```C
+void addvecweight(float *restrict C, float *A, float *B,
+                    size_t N, int w) {
+    for (size_t i=0; i < N; i++) {
+        switch (w) {
+        case 1:
+            C[i] = A[i] + B[i];
+            break;
+        case :
+            C[i] = 1.5f*A[i] + 0.5f * B[i];
+            break;
+        default:
+            break;
+        }
+    }
+}
+```
+
+But this one is not:
+
+```C
+#define sign(x) (x > 0) ? 1 : ((x < 0) ? -1 : 0)
+
+void addvecweight(float *restrict C, float *A, float *B,
+                    size_t N, int w) {
+    for (size_t i=0; i < N; i++) {
+        switch (sign(A[i])) {
+        case 1:
+            C[i] = 0.5f * A[i] + 1.5f * B[i];
+            break;
+        case -1:
+            C[i] = 1.5f * A[i] + 0.5f * B[i];
+            break;
+        default:
+            C[i] = A[i] + B[i];
+            break;
+        }
+    }
+}
+```
+
+The cases you have seen so far are generic, they will work in other architectures besides Arm. In the next section, you will see Arm-specific usecases for autovectorization.