Merge pull request #697 from lizwar/Autovectorization

pareenaverma · web-flow · commit ed5e75db5b71 · 2024-01-29T11:29:41.000-05:00
Autovectorization_editorial review complete_KB to sign off
diff --git a/content/learning-paths/cross-platform/loop-reflowing/_index.md b/content/learning-paths/cross-platform/loop-reflowing/_index.md
@@ -1,5 +1,5 @@
 ---
-title: Loop Reflowing/Autovectorization
+title: Learn about Autovectorization
 
 draft: true
 
diff --git a/content/learning-paths/cross-platform/loop-reflowing/_next-steps.md b/content/learning-paths/cross-platform/loop-reflowing/_next-steps.md
@@ -9,7 +9,7 @@ further_reading:
         link: https://community.arm.com/arm-community-blogs/b/tools-software-ides-blog/posts/update-on-gnu-performance
         type: blog
     - resource:
-        title: Auto-Vectorization in LLVM¶
+        title: Auto-Vectorization in LLVM
         link: https://llvm.org/docs/Vectorizers.html
         type: website
     - resource:
diff --git a/content/learning-paths/cross-platform/loop-reflowing/autovectorization-and-restrict.md b/content/learning-paths/cross-platform/loop-reflowing/autovectorization-and-restrict.md
@@ -103,8 +103,8 @@ The reason for this is related to how each compiler decides whether to use autov
 
 For each candidate loop the compiler will estimate the possible performance gains against a cost model, which is affected by many parameters and of course the optimization level in the compilation flags. 
 
-The cost model estimates whether the autovectorized code grows in size and if the performance gains are enough to outweigh the increase in code size. Based on this estimation, the compiler will decide to use vectorized code or fall back to a more 'safe' scalar implementation. This decision however is fluid and is constantly reevaluated during compiler development.
+The cost model estimates whether the autovectorized code grows in size and if the performance gains are enough to outweigh the increase in code size. Based on this estimation, the compiler will decide to use vectorized code or fall back to a more 'safe' scalar implementation. This decision, however, is fluid and is constantly reevaluated during compiler development.
 
-Compiler cost model analysis is beyond the scope of this Learning Path, but the example demonstrates how autovectorization can be triggered by a flag.
+Compiler cost model analysis is beyond the scope of this Learning Path but the above example demonstrates how autovectorization can be triggered by a flag.
 
-You will see some more advanced examples in the next sections.
+You will see some more advanced examples in the next sections.
diff --git a/content/learning-paths/cross-platform/loop-reflowing/autovectorization-conditionals.md b/content/learning-paths/cross-platform/loop-reflowing/autovectorization-conditionals.md
@@ -10,7 +10,7 @@ In the previous section, you learned that compilers cannot autovectorize loops w
 
 In this section, you will see more examples of loops with branches.
 
-You will learn when it is possible to enable the vectorizer in the compiler by adapting the loop, and when you are required to modify the algorithm or write manually optimized code.
+You will learn when it is possible to enable the vectorizer in the compiler by adapting the loop and when you are required to modify the algorithm or write manually optimized code.
 
 ### Loops with if/else/switch statements
 
@@ -48,9 +48,9 @@ void addvecweight(float *restrict C, float *A, float *B, size_t N) {
 
 These are two different loops that the compiler can vectorize. 
 
-Both GCC and Clang can autovectorize this loop, but the output is slightly different, performance may vary depending on the flags used and the exact nature of the loop.
+Both GCC and Clang can autovectorize this loop but the output is slightly different, and performance may vary depending on the flags used and the exact nature of the loop.
 
-However, the loop below is autovectorized by Clang but it is not autovectorized by GCC. 
+The loop below is autovectorized by Clang but it is not autovectorized by GCC. 
 
 ```C
 void addvecweight2(float *restrict C, float *A, float *B,
@@ -111,4 +111,4 @@ void addvecweight(float *restrict C, float *A, float *B,
 
 The cases you have seen so far are generic, they work the same for any architecture. 
 
-In the next section, you will see Arm-specific cases for autovectorization.
+In the next section, you will see Arm-specific cases for autovectorization.
diff --git a/content/learning-paths/cross-platform/loop-reflowing/autovectorization-limits.md b/content/learning-paths/cross-platform/loop-reflowing/autovectorization-limits.md
@@ -22,7 +22,7 @@ for (size_t i=0; i < N; i++) {
 }
 ```
 
-This loop is not countable and cannot be vectorized:
+But this loop is not countable and cannot be vectorized:
 
 ```C
 i = 0;
@@ -46,7 +46,7 @@ while(1) {
 }
 ```
 
-This loop is not vectorizable:
+But this loop is not vectorizable:
 
 ```C
 i = 0;
@@ -59,17 +59,17 @@ while(1) {
 
 #### No function calls inside the loop
 
-If `f()` and `g()` are functions that take `float` arguments this loop cannot be autovectorized:
+If `f()` and `g()` are functions that take `float` arguments, the loop cannot be autovectorized:
 
 ```C
 for (size_t i=0; i < N; i++) {
     C[i] = f(A[i]) + g(B[i]);
 }
 ```
 
-There is a special case of the math library trigonometry and transcendental functions (like `sin`, `cos`, `exp`, etc). There is work underway to enable these functions to be autovectorized, as the compiler will use their vectorized counterparts in the `mathvec` library (`libmvec`).
+There is a special case with the math library trigonometry and transcendental functions (like `sin`, `cos`, `exp`, etc). There is work underway to enable these functions to be autovectorized, as the compiler will use their vectorized counterparts in the `mathvec` library (`libmvec`).
 
-The loop below is *already autovectorized* in current gcc trunk for Arm (note you have to add `-Ofast` to the compilation flags to enable autovectorization):
+The loop below is *already autovectorized* in current gcc trunk for Arm (note, you have to add `-Ofast` to the compilation flags to enable autovectorization):
 
 ```C
 void addfunc(float *restrict C, float *A, float *B, size_t N) {
@@ -79,7 +79,7 @@ void addfunc(float *restrict C, float *A, float *B, size_t N) {
 }
 ```
 
-This feature will be in gcc 14 and require a new glibc version 2.39 as well. Until then, if you are using a released compiler as part of a Linux distribution (such as gcc 13.2), you will need to manually vectorize such code for performance.
+This feature will be in gcc 14 and requires a new glibc version 2.39 as well. Until then, if you are using a released compiler as part of a Linux distribution (such as gcc 13.2), you will need to manually vectorize such code for performance.
 
 There is more about autovectorization of conditionals in the next section.
 
@@ -105,11 +105,11 @@ for (size_t i=0; i < N; i++) {
 
 In this case, only the inner loop will be vectorized, again provided all the other conditions also apply (no branches and the inner loop is countable).
 
-There are some cases where outer loop types are autovectorized, but these are not covered in this Learning Path.
+There are some cases where outer loop types are autovectorized but these are not covered in this Learning Path.
 
 #### No data inter-dependency between iterations
 
-This means that each iteration depends on the result of the previous iteration. This example is difficult, but not impossible to autovectorize. 
+This means that each iteration depends on the result of the previous iteration. This example is difficult but not impossible to autovectorize. 
 
 The loop below cannot be autovectorized as it is. 
 
diff --git a/content/learning-paths/cross-platform/loop-reflowing/autovectorization-on-arm-1.md b/content/learning-paths/cross-platform/loop-reflowing/autovectorization-on-arm-1.md
@@ -72,7 +72,7 @@ dotprod:
         ret
 ```
 
-You can see that it's a pretty standard implementation, doing one element at a time. The option `-fno-inline` is necessary to avoid inlining any code from the function `dot-prod()` into `main()` for performance reasons. In general, this is a good thing, but demonstrating the autovectorization process is more difficult if there is no easy way to distinguish the caller from the callee.
+You can see that it's a pretty standard implementation, doing one element at a time. The option `-fno-inline` is necessary to avoid inlining any code from the function `dot-prod()` into `main()` for performance reasons. In general, this is a good thing but demonstrating the autovectorization process is more difficult if there is no easy way to distinguish the caller from the callee.
 
 Next, increase the optimization level to `-O3`, recompile, and observe the assembly output again:
 
@@ -135,7 +135,7 @@ dotprod:
         b       .L3
 ```
 
-The code is larger, but you can see that some autovectorization has taken place.
+The code is larger but you can see that some autovectorization has taken place.
 
 The label `.L4` includes the main loop and you can see that the `mla` instruction is used to multiply and accumulate the dot products, 4 elements at a time. 
 
@@ -145,7 +145,7 @@ With the new code, you can expect a performance gain of about 4x.
 
 You might be wondering if there is a way to hint to the compiler that the sizes are always going to be multiples of 4 and avoid the last part of the code. 
 
-The answer is *yes*, but it depends on the compiler. In the case of gcc, it is enough to add an instruction that ensures the sizes are multiples of 4.
+The answer is *yes* but it depends on the compiler. In the case of gcc, it is enough to add an instruction that ensures the sizes are multiples of 4.
 
 Modify the `dotprod()` function to add the multiples of 4 hint as shown below:
 
@@ -195,7 +195,7 @@ Is there anything else the compiler can do?
 
 Modern compilers are very proficient at generating code that utilizes all available instructions, provided they have the right information.
 
-For example, the `dotprod()` function operates on `int32_t` elements, what if you could limit the range to 8-bit? 
+For example, the `dotprod()` function operates on `int32_t` elements. What if you could limit the range to 8-bit? 
 
 There is an Armv8 ISA extension that [provides signed and unsigned dot product instructions](https://developer.arm.com/documentation/102651/a/What-are-dot-product-intructions-) to perform a dot product across 8-bit elements of 2 vectors and store the results in the 32-bit elements of the resulting vector. 
 
@@ -237,7 +237,7 @@ gcc -O3 -fno-inline -march=armv8-a+dotprod dotprod.c -o dotprod
 
 You need to compile with the architecture flag to use the dot product instructions. 
 
-The assembly output will be quite larger as the use of `SDOT` can only work in the main loop where the size is a multiple of 16. The compiler will unroll the loop to use Advanced SIMD instructions if the size is greater than 8, and byte-handling instructions if the size is smaller.
+The assembly output will be quite large as the use of `SDOT` can only work in the main loop where the size is a multiple of 16. The compiler will unroll the loop to use Advanced SIMD instructions if the size is greater than 8 and byte-handling instructions if the size is smaller.
 
 You can eliminate the extra tail instructions by converting `N -= N % 4` to 8 or even 16 as shown below: 
 
diff --git a/content/learning-paths/cross-platform/loop-reflowing/autovectorization-on-arm-2.md b/content/learning-paths/cross-platform/loop-reflowing/autovectorization-on-arm-2.md
@@ -8,7 +8,7 @@ layout: learningpathall
 
 The previous example using the `SDOT`/`UDOT` instructions is only one of the Arm-specific optimizations possible.
 
-While it is not possible to demonstrate all of the specialized instructions offered by the Arm architecture, it's worth looking at another example:
+While it is not possible to demonstrate all of the specialized instructions offered by the Arm architecture, it's worth looking at another example.
 
 Below is a very simple loop, calculating what is known as a Sum of Absolute Differences (SAD). Such code is very common in video codecs and used in calculating differences between video frames.
 
@@ -37,9 +37,9 @@ int main() {
 }
 ```
 
-A hint to the compiler was added that the size is a multiple of 16 to avoid generating cases for smaller lengths. *This is only for demonstration purposes*.
+A hint to the compiler was added that the size is a multiple of 16 to avoid generating cases for smaller lengths. *This is for demonstration purposes only*.
 
-Save the code above to a file named `sadtest.c` and compile it:
+Save the above code to a file named `sadtest.c` and compile it:
 
 ```bash
 gcc -O3 -fno-inline sadtest.c -o sadtest
@@ -71,11 +71,11 @@ sad8:
         ret
 ```
 
- You can see that the compiler generates code that uses 3 specialized instructions that exist only on Arm: [`SABDL2`](https://developer.arm.com/documentation/ddi0596/2021-03/SIMD-FP-Instructions/SABDL--SABDL2--Signed-Absolute-Difference-Long-?lang=en), [`SABAL`](https://developer.arm.com/documentation/ddi0596/2021-03/SIMD-FP-Instructions/SABAL--SABAL2--Signed-Absolute-difference-and-Accumulate-Long-?lang=en) and [`SADALP`](https://developer.arm.com/documentation/ddi0596/2021-03/SIMD-FP-Instructions/SADALP--Signed-Add-and-Accumulate-Long-Pairwise-?lang=en).
+You can see that the compiler generates code that uses 3 specialized instructions that exist only on Arm: [`SABDL2`](https://developer.arm.com/documentation/ddi0596/2021-03/SIMD-FP-Instructions/SABDL--SABDL2--Signed-Absolute-Difference-Long-?lang=en), [`SABAL`](https://developer.arm.com/documentation/ddi0596/2021-03/SIMD-FP-Instructions/SABAL--SABAL2--Signed-Absolute-difference-and-Accumulate-Long-?lang=en) and [`SADALP`](https://developer.arm.com/documentation/ddi0596/2021-03/SIMD-FP-Instructions/SADALP--Signed-Add-and-Accumulate-Long-Pairwise-?lang=en).
 
 The accumulator variable is not 8-bit but 32-bit, so the typical SIMD implementation that would involve 16 x 8-bit subtractions, then 16 x absolute values and 16 x additions would not do, and a widening conversion to 32-bit would have to take place before the accumulation.
 
-This would mean that 4x items at a time would be accumulated, but with the use of these instructions, the performance gain can be up to 16x faster than the original scalar code, or about 4x faster than the typical SIMD implementation.
+This would mean that 4x items at a time would be accumulated but, with the use of these instructions, the performance gain can be up to 16x faster than the original scalar code, or about 4x faster than the typical SIMD implementation.
 
 For completeness the SVE2 version will be provided, which does not depend on size being a multiple of 16.
 
@@ -126,7 +126,7 @@ You might ask why you should learn about autovectorization if you need to have s
 
 Autovectorization is a tool. The goal is to minimize the effort required by developers and maximize the performance, while at the same time requiring low maintenance in terms of code size. 
 
-It is far easier to maintain hundreds or thousands of functions that are known to generate the fastest code using autovectorization, for all platforms, than it is to maintain the same number of functions in multiple versions for each supported architecture and SIMD engine. 
+It is far easier to maintain hundreds or thousands of functions that are known to generate the fastest code using autovectorization for all platforms, than it is to maintain the same number of functions in multiple versions for each supported architecture and SIMD engine. 
 
 As with most tools, the better you know how to use it, the better the results will be.
 
diff --git a/content/learning-paths/cross-platform/loop-reflowing/introduction-to-autovectorization.md b/content/learning-paths/cross-platform/loop-reflowing/introduction-to-autovectorization.md
@@ -8,15 +8,15 @@ layout: learningpathall
 
 ## Before you begin
 
-You should have an Arm Linux system with gcc installed. Refer to the [GNU compiler](/install-guides/gcc/native/) install guide for instructions. The examples use gcc as the compiler, but you can also use Clang. 
+You should have an Arm Linux system with gcc installed. Refer to the [GNU compiler](/install-guides/gcc/native/) install guide for instructions. The examples use gcc as the compiler but you can also use Clang. 
 
 ## Introduction to autovectorization
 
 CPU time is often spent executing code inside loops. Software that performs time-consuming calculations in image/video processing, games, scientific software, and AI, often revolves around a few loops doing most of the calculations.
 
 With the advent of single instruction, multiple data (SIMD) processing and vector engines in modern CPUs (like Neon and SVE), specialized instructions are available to improve the performance and efficiency of loops. However, the loops themselves need to be adapted to use SIMD instructions. The adaptation process is called *__vectorization__* and is synonymous with SIMD optimization.
 
-Depending on the actual loop and the operations involved, vectorization is possible or impossible and the loop is labeled as vectorizable or non-vectorizable.
+Depending on the actual loop and the operations involved, vectorization is either possible or not and the loop is labeled as vectorizable or non-vectorizable.
 
 Consider the following simple loop which adds 2 vectors:
 
@@ -41,7 +41,7 @@ int main() {
 
 Use a text editor to copy the code above and save it as `addvec.c`.
 
-This is the most referred-to example with regards to vectorization, because it is easy to explain. 
+This is the most referred-to example with regards to vectorization because it is easy to explain. 
 
 For Advanced SIMD/Neon, the vectorized form is the following:
 
@@ -76,7 +76,7 @@ For many developers, vectorizing is a daunting task. Automating the process is o
 
 Autovectorization in compilers has been in development for the past 20 years. However, recent advances in both major compilers (Clang and GCC) have started to render autovectorization a viable alternative to hand-written SIMD code for more than just the basic loops. Some loop types are still not detected as autovectorizable, and it is not directly obvious which kinds of loops are autovectorizable and which are not.
 
-As a constantly advancing field, it is not easy to keep track of compiler support for autovectorization. It is an advanced Computer Science topic that involves the subjects of graph theory, compilers, and deep understanding of each architecture and the respective SIMD engines. The number of experts in the field is extremely small.
+As a constantly advancing field, it is not easy to keep track of compiler support for autovectorization. It is an advanced Computer Science topic that involves the subjects of graph theory, compilers, and a deep understanding of each architecture and the respective SIMD engines. The number of experts in the field is extremely small.
 
 In this Learning Path, you will learn about autovectorization through examples and identify how to adapt some loops to enable autovectorization.