Skip to content

Commit ed5e75d

Browse files
authored
Merge pull request #697 from lizwar/Autovectorization
Autovectorization_editorial review complete_KB to sign off
2 parents 9f07067 + 6246d9a commit ed5e75d

File tree

8 files changed

+32
-32
lines changed

8 files changed

+32
-32
lines changed

content/learning-paths/cross-platform/loop-reflowing/_index.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
---
2-
title: Loop Reflowing/Autovectorization
2+
title: Learn about Autovectorization
33

44
draft: true
55

content/learning-paths/cross-platform/loop-reflowing/_next-steps.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@ further_reading:
99
link: https://community.arm.com/arm-community-blogs/b/tools-software-ides-blog/posts/update-on-gnu-performance
1010
type: blog
1111
- resource:
12-
title: Auto-Vectorization in LLVM
12+
title: Auto-Vectorization in LLVM
1313
link: https://llvm.org/docs/Vectorizers.html
1414
type: website
1515
- resource:

content/learning-paths/cross-platform/loop-reflowing/autovectorization-and-restrict.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -103,8 +103,8 @@ The reason for this is related to how each compiler decides whether to use autov
103103

104104
For each candidate loop the compiler will estimate the possible performance gains against a cost model, which is affected by many parameters and of course the optimization level in the compilation flags.
105105

106-
The cost model estimates whether the autovectorized code grows in size and if the performance gains are enough to outweigh the increase in code size. Based on this estimation, the compiler will decide to use vectorized code or fall back to a more 'safe' scalar implementation. This decision however is fluid and is constantly reevaluated during compiler development.
106+
The cost model estimates whether the autovectorized code grows in size and if the performance gains are enough to outweigh the increase in code size. Based on this estimation, the compiler will decide to use vectorized code or fall back to a more 'safe' scalar implementation. This decision, however, is fluid and is constantly reevaluated during compiler development.
107107

108-
Compiler cost model analysis is beyond the scope of this Learning Path, but the example demonstrates how autovectorization can be triggered by a flag.
108+
Compiler cost model analysis is beyond the scope of this Learning Path but the above example demonstrates how autovectorization can be triggered by a flag.
109109

110-
You will see some more advanced examples in the next sections.
110+
You will see some more advanced examples in the next sections.

content/learning-paths/cross-platform/loop-reflowing/autovectorization-conditionals.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@ In the previous section, you learned that compilers cannot autovectorize loops w
1010

1111
In this section, you will see more examples of loops with branches.
1212

13-
You will learn when it is possible to enable the vectorizer in the compiler by adapting the loop, and when you are required to modify the algorithm or write manually optimized code.
13+
You will learn when it is possible to enable the vectorizer in the compiler by adapting the loop and when you are required to modify the algorithm or write manually optimized code.
1414

1515
### Loops with if/else/switch statements
1616

@@ -48,9 +48,9 @@ void addvecweight(float *restrict C, float *A, float *B, size_t N) {
4848

4949
These are two different loops that the compiler can vectorize.
5050

51-
Both GCC and Clang can autovectorize this loop, but the output is slightly different, performance may vary depending on the flags used and the exact nature of the loop.
51+
Both GCC and Clang can autovectorize this loop but the output is slightly different, and performance may vary depending on the flags used and the exact nature of the loop.
5252

53-
However, the loop below is autovectorized by Clang but it is not autovectorized by GCC.
53+
The loop below is autovectorized by Clang but it is not autovectorized by GCC.
5454

5555
```C
5656
void addvecweight2(float *restrict C, float *A, float *B,
@@ -111,4 +111,4 @@ void addvecweight(float *restrict C, float *A, float *B,
111111
112112
The cases you have seen so far are generic, they work the same for any architecture.
113113
114-
In the next section, you will see Arm-specific cases for autovectorization.
114+
In the next section, you will see Arm-specific cases for autovectorization.

content/learning-paths/cross-platform/loop-reflowing/autovectorization-limits.md

Lines changed: 8 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -22,7 +22,7 @@ for (size_t i=0; i < N; i++) {
2222
}
2323
```
2424

25-
This loop is not countable and cannot be vectorized:
25+
But this loop is not countable and cannot be vectorized:
2626

2727
```C
2828
i = 0;
@@ -46,7 +46,7 @@ while(1) {
4646
}
4747
```
4848

49-
This loop is not vectorizable:
49+
But this loop is not vectorizable:
5050

5151
```C
5252
i = 0;
@@ -59,17 +59,17 @@ while(1) {
5959

6060
#### No function calls inside the loop
6161

62-
If `f()` and `g()` are functions that take `float` arguments this loop cannot be autovectorized:
62+
If `f()` and `g()` are functions that take `float` arguments, the loop cannot be autovectorized:
6363

6464
```C
6565
for (size_t i=0; i < N; i++) {
6666
C[i] = f(A[i]) + g(B[i]);
6767
}
6868
```
6969

70-
There is a special case of the math library trigonometry and transcendental functions (like `sin`, `cos`, `exp`, etc). There is work underway to enable these functions to be autovectorized, as the compiler will use their vectorized counterparts in the `mathvec` library (`libmvec`).
70+
There is a special case with the math library trigonometry and transcendental functions (like `sin`, `cos`, `exp`, etc). There is work underway to enable these functions to be autovectorized, as the compiler will use their vectorized counterparts in the `mathvec` library (`libmvec`).
7171

72-
The loop below is *already autovectorized* in current gcc trunk for Arm (note you have to add `-Ofast` to the compilation flags to enable autovectorization):
72+
The loop below is *already autovectorized* in current gcc trunk for Arm (note, you have to add `-Ofast` to the compilation flags to enable autovectorization):
7373

7474
```C
7575
void addfunc(float *restrict C, float *A, float *B, size_t N) {
@@ -79,7 +79,7 @@ void addfunc(float *restrict C, float *A, float *B, size_t N) {
7979
}
8080
```
8181
82-
This feature will be in gcc 14 and require a new glibc version 2.39 as well. Until then, if you are using a released compiler as part of a Linux distribution (such as gcc 13.2), you will need to manually vectorize such code for performance.
82+
This feature will be in gcc 14 and requires a new glibc version 2.39 as well. Until then, if you are using a released compiler as part of a Linux distribution (such as gcc 13.2), you will need to manually vectorize such code for performance.
8383
8484
There is more about autovectorization of conditionals in the next section.
8585
@@ -105,11 +105,11 @@ for (size_t i=0; i < N; i++) {
105105

106106
In this case, only the inner loop will be vectorized, again provided all the other conditions also apply (no branches and the inner loop is countable).
107107

108-
There are some cases where outer loop types are autovectorized, but these are not covered in this Learning Path.
108+
There are some cases where outer loop types are autovectorized but these are not covered in this Learning Path.
109109

110110
#### No data inter-dependency between iterations
111111

112-
This means that each iteration depends on the result of the previous iteration. This example is difficult, but not impossible to autovectorize.
112+
This means that each iteration depends on the result of the previous iteration. This example is difficult but not impossible to autovectorize.
113113

114114
The loop below cannot be autovectorized as it is.
115115

content/learning-paths/cross-platform/loop-reflowing/autovectorization-on-arm-1.md

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -72,7 +72,7 @@ dotprod:
7272
ret
7373
```
7474

75-
You can see that it's a pretty standard implementation, doing one element at a time. The option `-fno-inline` is necessary to avoid inlining any code from the function `dot-prod()` into `main()` for performance reasons. In general, this is a good thing, but demonstrating the autovectorization process is more difficult if there is no easy way to distinguish the caller from the callee.
75+
You can see that it's a pretty standard implementation, doing one element at a time. The option `-fno-inline` is necessary to avoid inlining any code from the function `dot-prod()` into `main()` for performance reasons. In general, this is a good thing but demonstrating the autovectorization process is more difficult if there is no easy way to distinguish the caller from the callee.
7676

7777
Next, increase the optimization level to `-O3`, recompile, and observe the assembly output again:
7878

@@ -135,7 +135,7 @@ dotprod:
135135
b .L3
136136
```
137137

138-
The code is larger, but you can see that some autovectorization has taken place.
138+
The code is larger but you can see that some autovectorization has taken place.
139139

140140
The label `.L4` includes the main loop and you can see that the `mla` instruction is used to multiply and accumulate the dot products, 4 elements at a time.
141141

@@ -145,7 +145,7 @@ With the new code, you can expect a performance gain of about 4x.
145145

146146
You might be wondering if there is a way to hint to the compiler that the sizes are always going to be multiples of 4 and avoid the last part of the code.
147147

148-
The answer is *yes*, but it depends on the compiler. In the case of gcc, it is enough to add an instruction that ensures the sizes are multiples of 4.
148+
The answer is *yes* but it depends on the compiler. In the case of gcc, it is enough to add an instruction that ensures the sizes are multiples of 4.
149149

150150
Modify the `dotprod()` function to add the multiples of 4 hint as shown below:
151151

@@ -195,7 +195,7 @@ Is there anything else the compiler can do?
195195

196196
Modern compilers are very proficient at generating code that utilizes all available instructions, provided they have the right information.
197197

198-
For example, the `dotprod()` function operates on `int32_t` elements, what if you could limit the range to 8-bit?
198+
For example, the `dotprod()` function operates on `int32_t` elements. What if you could limit the range to 8-bit?
199199

200200
There is an Armv8 ISA extension that [provides signed and unsigned dot product instructions](https://developer.arm.com/documentation/102651/a/What-are-dot-product-intructions-) to perform a dot product across 8-bit elements of 2 vectors and store the results in the 32-bit elements of the resulting vector.
201201

@@ -237,7 +237,7 @@ gcc -O3 -fno-inline -march=armv8-a+dotprod dotprod.c -o dotprod
237237

238238
You need to compile with the architecture flag to use the dot product instructions.
239239

240-
The assembly output will be quite larger as the use of `SDOT` can only work in the main loop where the size is a multiple of 16. The compiler will unroll the loop to use Advanced SIMD instructions if the size is greater than 8, and byte-handling instructions if the size is smaller.
240+
The assembly output will be quite large as the use of `SDOT` can only work in the main loop where the size is a multiple of 16. The compiler will unroll the loop to use Advanced SIMD instructions if the size is greater than 8 and byte-handling instructions if the size is smaller.
241241

242242
You can eliminate the extra tail instructions by converting `N -= N % 4` to 8 or even 16 as shown below:
243243

content/learning-paths/cross-platform/loop-reflowing/autovectorization-on-arm-2.md

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@ layout: learningpathall
88

99
The previous example using the `SDOT`/`UDOT` instructions is only one of the Arm-specific optimizations possible.
1010

11-
While it is not possible to demonstrate all of the specialized instructions offered by the Arm architecture, it's worth looking at another example:
11+
While it is not possible to demonstrate all of the specialized instructions offered by the Arm architecture, it's worth looking at another example.
1212

1313
Below is a very simple loop, calculating what is known as a Sum of Absolute Differences (SAD). Such code is very common in video codecs and used in calculating differences between video frames.
1414

@@ -37,9 +37,9 @@ int main() {
3737
}
3838
```
3939
40-
A hint to the compiler was added that the size is a multiple of 16 to avoid generating cases for smaller lengths. *This is only for demonstration purposes*.
40+
A hint to the compiler was added that the size is a multiple of 16 to avoid generating cases for smaller lengths. *This is for demonstration purposes only*.
4141
42-
Save the code above to a file named `sadtest.c` and compile it:
42+
Save the above code to a file named `sadtest.c` and compile it:
4343
4444
```bash
4545
gcc -O3 -fno-inline sadtest.c -o sadtest
@@ -71,11 +71,11 @@ sad8:
7171
ret
7272
```
7373

74-
You can see that the compiler generates code that uses 3 specialized instructions that exist only on Arm: [`SABDL2`](https://developer.arm.com/documentation/ddi0596/2021-03/SIMD-FP-Instructions/SABDL--SABDL2--Signed-Absolute-Difference-Long-?lang=en), [`SABAL`](https://developer.arm.com/documentation/ddi0596/2021-03/SIMD-FP-Instructions/SABAL--SABAL2--Signed-Absolute-difference-and-Accumulate-Long-?lang=en) and [`SADALP`](https://developer.arm.com/documentation/ddi0596/2021-03/SIMD-FP-Instructions/SADALP--Signed-Add-and-Accumulate-Long-Pairwise-?lang=en).
74+
You can see that the compiler generates code that uses 3 specialized instructions that exist only on Arm: [`SABDL2`](https://developer.arm.com/documentation/ddi0596/2021-03/SIMD-FP-Instructions/SABDL--SABDL2--Signed-Absolute-Difference-Long-?lang=en), [`SABAL`](https://developer.arm.com/documentation/ddi0596/2021-03/SIMD-FP-Instructions/SABAL--SABAL2--Signed-Absolute-difference-and-Accumulate-Long-?lang=en) and [`SADALP`](https://developer.arm.com/documentation/ddi0596/2021-03/SIMD-FP-Instructions/SADALP--Signed-Add-and-Accumulate-Long-Pairwise-?lang=en).
7575

7676
The accumulator variable is not 8-bit but 32-bit, so the typical SIMD implementation that would involve 16 x 8-bit subtractions, then 16 x absolute values and 16 x additions would not do, and a widening conversion to 32-bit would have to take place before the accumulation.
7777

78-
This would mean that 4x items at a time would be accumulated, but with the use of these instructions, the performance gain can be up to 16x faster than the original scalar code, or about 4x faster than the typical SIMD implementation.
78+
This would mean that 4x items at a time would be accumulated but, with the use of these instructions, the performance gain can be up to 16x faster than the original scalar code, or about 4x faster than the typical SIMD implementation.
7979

8080
For completeness the SVE2 version will be provided, which does not depend on size being a multiple of 16.
8181

@@ -126,7 +126,7 @@ You might ask why you should learn about autovectorization if you need to have s
126126

127127
Autovectorization is a tool. The goal is to minimize the effort required by developers and maximize the performance, while at the same time requiring low maintenance in terms of code size.
128128

129-
It is far easier to maintain hundreds or thousands of functions that are known to generate the fastest code using autovectorization, for all platforms, than it is to maintain the same number of functions in multiple versions for each supported architecture and SIMD engine.
129+
It is far easier to maintain hundreds or thousands of functions that are known to generate the fastest code using autovectorization for all platforms, than it is to maintain the same number of functions in multiple versions for each supported architecture and SIMD engine.
130130

131131
As with most tools, the better you know how to use it, the better the results will be.
132132

content/learning-paths/cross-platform/loop-reflowing/introduction-to-autovectorization.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -8,15 +8,15 @@ layout: learningpathall
88

99
## Before you begin
1010

11-
You should have an Arm Linux system with gcc installed. Refer to the [GNU compiler](/install-guides/gcc/native/) install guide for instructions. The examples use gcc as the compiler, but you can also use Clang.
11+
You should have an Arm Linux system with gcc installed. Refer to the [GNU compiler](/install-guides/gcc/native/) install guide for instructions. The examples use gcc as the compiler but you can also use Clang.
1212

1313
## Introduction to autovectorization
1414

1515
CPU time is often spent executing code inside loops. Software that performs time-consuming calculations in image/video processing, games, scientific software, and AI, often revolves around a few loops doing most of the calculations.
1616

1717
With the advent of single instruction, multiple data (SIMD) processing and vector engines in modern CPUs (like Neon and SVE), specialized instructions are available to improve the performance and efficiency of loops. However, the loops themselves need to be adapted to use SIMD instructions. The adaptation process is called *__vectorization__* and is synonymous with SIMD optimization.
1818

19-
Depending on the actual loop and the operations involved, vectorization is possible or impossible and the loop is labeled as vectorizable or non-vectorizable.
19+
Depending on the actual loop and the operations involved, vectorization is either possible or not and the loop is labeled as vectorizable or non-vectorizable.
2020

2121
Consider the following simple loop which adds 2 vectors:
2222

@@ -41,7 +41,7 @@ int main() {
4141
4242
Use a text editor to copy the code above and save it as `addvec.c`.
4343
44-
This is the most referred-to example with regards to vectorization, because it is easy to explain.
44+
This is the most referred-to example with regards to vectorization because it is easy to explain.
4545
4646
For Advanced SIMD/Neon, the vectorized form is the following:
4747
@@ -76,7 +76,7 @@ For many developers, vectorizing is a daunting task. Automating the process is o
7676

7777
Autovectorization in compilers has been in development for the past 20 years. However, recent advances in both major compilers (Clang and GCC) have started to render autovectorization a viable alternative to hand-written SIMD code for more than just the basic loops. Some loop types are still not detected as autovectorizable, and it is not directly obvious which kinds of loops are autovectorizable and which are not.
7878

79-
As a constantly advancing field, it is not easy to keep track of compiler support for autovectorization. It is an advanced Computer Science topic that involves the subjects of graph theory, compilers, and deep understanding of each architecture and the respective SIMD engines. The number of experts in the field is extremely small.
79+
As a constantly advancing field, it is not easy to keep track of compiler support for autovectorization. It is an advanced Computer Science topic that involves the subjects of graph theory, compilers, and a deep understanding of each architecture and the respective SIMD engines. The number of experts in the field is extremely small.
8080

8181
In this Learning Path, you will learn about autovectorization through examples and identify how to adapt some loops to enable autovectorization.
8282

0 commit comments

Comments
 (0)