Skip to content

Commit 313a5a7

Browse files
committed
review autovectorization learning path
1 parent 1570f12 commit 313a5a7

File tree

8 files changed

+240
-136
lines changed

8 files changed

+240
-136
lines changed

content/learning-paths/cross-platform/loop-reflowing/_index.md

Lines changed: 7 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -3,23 +3,22 @@ title: Loop Reflowing/Autovectorization
33

44
minutes_to_complete: 45
55

6-
who_is_this_for: This is an advanced topic for C/C++ developers who are interested in taking advantage of autovectorization in compilers
6+
who_is_this_for: This is an advanced topic for C/C++ developers who are interested in taking advantage of autovectorization in compilers.
77

88
learning_objectives:
9-
- Learn how to modify loops in order to take advantage of autovectorization in compilers
9+
- Modify loops to take advantage of autovectorization in compilers
1010

1111
prerequisites:
12-
- An Arm computer running Linux OS and a recent version of compiler (Clang or GCC) installed
12+
- An Arm computer running Linux and a recent version of Clang or the GNU compiler (gcc) installed.
1313

1414
author_primary: Konstantinos Margaritis
1515

1616
### Tags
1717
skilllevels: Advanced
18-
subjects: Programming
18+
subjects: Performance and Architecture
1919
armips:
20-
- Aarch64
21-
- Armv8-a
22-
- Armv9-a
20+
- Neoverse
21+
- Cortex-A
2322
tools_software_languages:
2423
- GCC
2524
- Clang
@@ -28,8 +27,8 @@ operatingsystems:
2827
- Linux
2928
shared_path: true
3029
shared_between:
31-
- laptops-and-desktops
3230
- servers-and-cloud-computing
31+
- laptops-and-desktops
3332
- smartphones-and-mobile
3433

3534

content/learning-paths/cross-platform/loop-reflowing/_review.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ review:
44
question: >
55
Autovectorization is:
66
answers:
7-
- The automatic generation of 3D vectors so that 3D applications/games run faster.
7+
- The automatic generation of 3D vectors so that 3D games run faster.
88
- Converting an array of numbers in C to an STL C++ vector object.
99
- The process where an algorithm is automatically vectorized by the compiler to use SIMD instructions.
1010
correct_answer: 3

content/learning-paths/cross-platform/loop-reflowing/autovectorization-and-restrict.md

Lines changed: 38 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -1,26 +1,31 @@
11
---
2-
title: Autovectorization and restrict
2+
title: Autovectorization using the restrict keyword
33
weight: 3
44

55
### FIXED, DO NOT MODIFY
66
layout: learningpathall
77
---
88

9-
## Autovectorization and restrict keyword
9+
You may have already experienced some form of autovectorization by reading [Understand the restrict keyword in C99](/learning-paths/cross-platform/restrict-keyword-c99/).
1010

11-
You have already experienced some form of autovectorization by learning about the [`restrict` keyword in a previous Learning Path](https://learn.arm.com/learning-paths/cross-platform/restrict-keyword-c99/).
12-
Our example is a classic textbook example that the compiler will autovectorize simply by using `restrict`:
11+
The example in the previous section is a classic textbook example that the compiler will autovectorize by using `restrict`.
1312

14-
Try the previously saved files, compile them both and compare the assembly output:
13+
Compile the previously saved files:
1514

1615
```bash
1716
gcc -O2 addvec.c -o addvec
1817
gcc -O2 addvec_neon.c -o addvec_neon
1918
```
2019

21-
Let's look at the assembly output of `addvec`:
20+
Generate the assembly output using:
2221

23-
```as
22+
```bash
23+
objdump -D addvec
24+
```
25+
26+
The assembly output of the `addvec()` function is shown below:
27+
28+
```output
2429
addvec:
2530
mov x3, 0
2631
.L2:
@@ -34,9 +39,15 @@ addvec:
3439
ret
3540
```
3641

37-
Similarly, for the `addvec_neon` executable:
42+
Generate the assembly output for `addvec_neon` using:
43+
44+
```bash
45+
objdump -D addvec_neon
46+
```
47+
48+
The assembly output for the `addvec()` function from the `addvec_neon` executable is shown below:
3849

39-
```as
50+
```output
4051
addvec:
4152
mov x3, 0
4253
.L6:
@@ -50,9 +61,9 @@ addvec:
5061
ret
5162
```
5263

53-
The latter uses Advanced SIMD/Neon instructions `fadd` with operands `v0.4s`, `v1.4s` to perform calculations in 4 x 32-bit floating-point elements.
64+
The second example uses the Advanced SIMD/Neon instruction `fadd` with operands `v0.4s`, `v1.4s` to perform calculations in 4 x 32-bit floating-point elements.
5465

55-
Let's try to add `restrict` to the output argument `C` in the first `addvec` function:
66+
Add the `restrict` keyword to the output argument `C` in the `addvec()` function in `addvec.c`:
5667

5768
```C
5869
void addvec(float *restrict C, float *A, float *B) {
@@ -63,8 +74,14 @@ void addvec(float *restrict C, float *A, float *B) {
6374
```
6475
6576
Recompile and check the assembly output again:
77+
```bash
78+
gcc -O2 addvec.c -o addvec
79+
objdump -D addvec
80+
```
81+
82+
The assembly output for the `addvec` function is now:
6683

67-
```as
84+
```output
6885
addvec:
6986
mov x3, 0
7087
.L2:
@@ -78,10 +95,16 @@ addvec:
7895
ret
7996
```
8097

81-
As you can see, the compiler has enabled autovectorization for this algorithm and the output is identical to the hand-written function! Strictly speaking, you don't even need `restrict` in such a trivial loop as it will be autovectorized anyway when certain optimization levels are added to the compilation flags (`-O2` for clang, `-O3` for gcc). However, the use of restrict simplifies the code and generates SIMD code similar to the hand written version in `addvec_neon.c`.
98+
As you can see, the compiler has enabled autovectorization for this algorithm and the output is identical to the hand-written function.
99+
100+
Strictly speaking, you don't even need `restrict` in such a trivial loop as it will be autovectorized anyway when certain optimization levels are added to the compilation flags (`-O2` for clang, `-O3` for gcc). However, the use of restrict simplifies the code and generates SIMD code similar to the hand written version in `addvec_neon.c`.
101+
102+
The reason for this is related to how each compiler decides whether to use autovectorization or not.
103+
104+
For each candidate loop the compiler will estimate the possible performance gains against a cost model, which is affected by many parameters and of course the optimization level in the compilation flags.
82105

83-
The reason for this is because of the way each compiler decides whether to use autovectorization or not. For each candidate loop the compiler will estimate the possible performance gains against a cost model, which is affected by many parameters and of course the optimization level in the compilation flags. This cost model will estimate whether the autovectorized code grows in size and if the performance gains are enough to outweigh this increase in code size. Based on this estimation, the compiler will decide to use this vectorized code or fall back to a more 'safe' scalar implementation. This decision however is something that is not set in stone and is constantly reevaluated during compiler development.
106+
The cost model estimates whether the autovectorized code grows in size and if the performance gains are enough to outweigh the increase in code size. Based on this estimation, the compiler will decide to use vectorized code or fall back to a more 'safe' scalar implementation. This decision however is fluid and is constantly reevaluated during compiler development.
84107

85-
This analysis goes beyond the scope of this LP, this was just one trivial example to demonstrate how the autovectorization can be triggered by a flag.
108+
Compiler cost model analysis is beyond the scope of this Learning Path, but the example demonstrates how autovectorization can be triggered by a flag.
86109

87110
You will see some more advanced examples in the next sections.

content/learning-paths/cross-platform/loop-reflowing/autovectorization-conditionals.md

Lines changed: 19 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -6,11 +6,13 @@ weight: 5
66
layout: learningpathall
77
---
88

9-
## Autovectorization and conditionals
9+
In the previous section, you learned that compilers cannot autovectorize loops with branches.
1010

11-
Previously we mentioned that compilers cannot autovectorize loops with branches. In this section, you will see that in more detail, when it is possible to enable the vectorizer in the compiler by adapting the loop and when it is required to modify the algorithm or write manually optimized code.
11+
In this section, you will see more examples of loops with branches.
1212

13-
### If/else/switch in loops
13+
You will learn when it is possible to enable the vectorizer in the compiler by adapting the loop, and when you are required to modify the algorithm or write manually optimized code.
14+
15+
### Loops with if/else/switch statements
1416

1517
Consider the following function, a modified form of the previous function that uses weighted coefficients for `A[i]`.
1618

@@ -26,7 +28,9 @@ void addvecweight(float *restrict C, float *A, float *B,
2628
}
2729
```
2830
29-
You might be tempted to think that this loop cannot be vectorized. Such loops are not that uncommon and compilers have a difficult time understanding the pattern and transforming them to vectorizeable forms, when it is possible. However, this is actually a vectorizable loop, as the conditional can actually be moved out of the loop, as this is a loop-invariant conditional. Essentially the compiler would transform -internally- the loop in something like the following:
31+
You might think that this loop cannot be vectorized. Such loops are not uncommon and compilers have a difficult time understanding the pattern and transforming them to vectorizable forms. However, this is actually a vectorizable loop, as the conditional can be moved out of the loop, as this is a loop-invariant conditional.
32+
33+
The compiler will internally transform the loop into something similar to the code below:
3034
3135
```C
3236
void addvecweight(float *restrict C, float *A, float *B, size_t N) {
@@ -42,9 +46,11 @@ void addvecweight(float *restrict C, float *A, float *B, size_t N) {
4246
}
4347
```
4448

45-
which is in essence, two different loops and we know that the compiler can vectorize them. Both gcc and llvm can actually autovectorize this loop, but the output is slightly different, performance may actually vary depending on the flags used and the exact nature of the loop.
49+
These are two different loops that the compiler can vectorize.
50+
51+
Both GCC and Clang can autovectorize this loop, but the output is slightly different, performance may vary depending on the flags used and the exact nature of the loop.
4652

47-
However, the following loop is not yet autovectorized by all compilers (llvm/clang autovectorizes this loop, but not gcc):
53+
However, the loop below is autovectorized by Clang but it is not autovectorized by GCC.
4854

4955
```C
5056
void addvecweight2(float *restrict C, float *A, float *B,
@@ -58,8 +64,9 @@ void addvecweight2(float *restrict C, float *A, float *B,
5864
}
5965
```
6066
61-
Similarly with `switch` statements, if the condition expression in loop-invariant, that is if it does not depend on the loop variable or the elements involved in each iteration.
62-
For this reason we know that this loop is actually autovectorized:
67+
The situation is similar with `switch` statements. If the condition expression is loop-invariant, that is if it does not depend on the loop variable or the elements involved in each iteration, it can be autovectorized.
68+
69+
This example is autovectorized:
6370
6471
```C
6572
void addvecweight(float *restrict C, float *A, float *B,
@@ -79,7 +86,7 @@ void addvecweight(float *restrict C, float *A, float *B,
7986
}
8087
```
8188

82-
But this one is not:
89+
This example is not autovectorized:
8390

8491
```C
8592
#define sign(x) (x > 0) ? 1 : ((x < 0) ? -1 : 0)
@@ -102,4 +109,6 @@ void addvecweight(float *restrict C, float *A, float *B,
102109
}
103110
```
104111
105-
The cases you have seen so far are generic, they will work in other architectures besides Arm. In the next section, you will see Arm-specific usecases for autovectorization.
112+
The cases you have seen so far are generic, they work the same for any architecture.
113+
114+
In the next section, you will see Arm-specific cases for autovectorization.

content/learning-paths/cross-platform/loop-reflowing/autovectorization-limits.md

Lines changed: 64 additions & 55 deletions
Original file line numberDiff line numberDiff line change
@@ -6,65 +6,70 @@ weight: 4
66
layout: learningpathall
77
---
88

9-
## Autovectorization limits
9+
Autovectorization is not as easy as adding a flag like `restrict` in the arguments list.
1010

11-
Autovectorization is not as easy as adding a flag like `restrict` in the arguments list. There are some requirements for autovectorization to be enabled, namely:
11+
There are some requirements for autovectorization to be enabled. Some of the requirements with examples are shown below.
1212

13-
* The loops have to be countable
13+
#### Countable loops
1414

15-
This means that the following can be vectorized:
15+
A countable loop is a loop where the number of iterations is known before the loop begins executing.
16+
17+
Countable loops means the following can be vectorized:
1618

1719
```C
18-
for (size_t i=0; i < N; i++) {
19-
C[i] = A[i] + B[i];
20-
}
20+
for (size_t i=0; i < N; i++) {
21+
C[i] = A[i] + B[i];
22+
}
2123
```
2224

23-
but this one cannot be vectorized:
25+
This loop is not countable and cannot be vectorized:
2426

2527
```C
26-
i = 0;
27-
while(true) {
28-
C[i] = A[i] + B[i];
29-
i++;
30-
if (condition) break;
31-
}
28+
i = 0;
29+
while(1) {
30+
C[i] = A[i] + B[i];
31+
i++;
32+
if (condition) break;
33+
}
3234
```
3335

34-
Having said that, if condition is such that the `while` loop is actually a countable loop in disguise, then the loop might be vectorizable. For example, this loop will *actually be vectorized*:
36+
If the `while` loop is actually a countable loop in disguise, then the loop might be vectorizable.
37+
38+
For example, this loop is vectorizable:
3539

3640
```C
37-
i = 0;
38-
while(1) {
39-
C[i] = A[i] + B[i];
40-
i++;
41-
if (i >= N) break;
42-
}
41+
i = 0;
42+
while(1) {
43+
C[i] = A[i] + B[i];
44+
i++;
45+
if (i >= N) break;
46+
}
4347
```
44-
but this one will not be vectorizable:
48+
49+
This loop is not vectorizable:
4550

4651
```C
47-
i = 0;
48-
while(1) {
49-
C[i] = A[i] + B[i];
50-
i++;
51-
if (C[i] > 0) break;
52-
}
52+
i = 0;
53+
while(1) {
54+
C[i] = A[i] + B[i];
55+
i++;
56+
if (C[i] > 0) break;
57+
}
5358
```
5459

55-
* No function calls inside the loop
60+
#### No function calls inside the loop
5661

57-
For example if, `f()`, `g()` are functions that take `float` arguments, this loop cannot be autovectorized:
62+
If `f()` and `g()` are functions that take `float` arguments this loop cannot be autovectorized:
5863

5964
```C
60-
for (size_t i=0; i < N; i++) {
61-
C[i] = f(A[i]) + g(B[i]);
62-
}
65+
for (size_t i=0; i < N; i++) {
66+
C[i] = f(A[i]) + g(B[i]);
67+
}
6368
```
6469

65-
There is a special case of the math library trigonometry and transcendental functions (like `sin`, `cos`, `exp`, etc). There is progress underway to enable these functions to be autovectorized, as the compiler will be able to use their vectorized counterparts in `mathvec` library (`libmvec`).
70+
There is a special case of the math library trigonometry and transcendental functions (like `sin`, `cos`, `exp`, etc). There is work underway to enable these functions to be autovectorized, as the compiler will use their vectorized counterparts in the `mathvec` library (`libmvec`).
6671

67-
So for example, something like the following is actually *already autovectorized* in current gcc trunk for Arm (note you have to add `-Ofast` to compilation flags to enable such autovectorization):
72+
The loop below is *already autovectorized* in current gcc trunk for Arm (note you have to add `-Ofast` to the compilation flags to enable autovectorization):
6873

6974
```C
7075
void addfunc(float *restrict C, float *A, float *B, size_t N) {
@@ -74,38 +79,42 @@ void addfunc(float *restrict C, float *A, float *B, size_t N) {
7479
}
7580
```
7681
77-
This will be in gcc 14 and require a new glibc as well (2.39). Until these are released, if you are using a released compiler as part of a distribution (gcc 13.2 at the time of writing), you will have to manually vectorize such code for performance.
82+
This feature will be in gcc 14 and require a new glibc version 2.39 as well. Until then, if you are using a released compiler as part of a Linux distribution (such as gcc 13.2), you will need to manually vectorize such code for performance.
7883
79-
We will expand on autovectorization of conditionals in the next section.
84+
There is more about autovectorization of conditionals in the next section.
8085
81-
* In general, no branches in the loop, no if/else/switch
86+
#### No branches in the loop and no if/else/switch statements
8287
83-
This is not universally true, there are cases where branches can actually be vectorized, we will expand this in the next section.
84-
And in the case of SVE/SVE2 on Arm, predicates will actually make this easier and remove or minimize these limitations at least in some cases. There is currently work in progress on the compiler front to enable the use of predicates in such loops. We will probably return with a new LP to explain SVE/SVE2 autovectorization and predicates in more depth.
88+
This is not universally true, there are cases where branches can actually be vectorized.
8589
86-
* Only inner-most loops will be vectorized.
90+
In the case of SVE/SVE2 on Arm, predicates will actually make this easier and remove or minimize these limitations at least in some cases. There is currently work in progress to enable the use of predicates in such loops. SVE/SVE2 autovectorization and predicates is a good topic for a future Learning Path.
8791
88-
To clarify, consider the following nested loop:
92+
There is more information on this in the next section.
93+
94+
#### Only inner-most loops will be vectorized.
95+
96+
Consider the following nested loop:
8997
9098
```C
91-
for (size_t i=0; i < N; i++) {
92-
for (size_t j=0; j < M; j++) {
93-
C[i][j] = A[i][j] + B[i][j];
94-
}
99+
for (size_t i=0; i < N; i++) {
100+
for (size_t j=0; j < M; j++) {
101+
C[i][j] = A[i][j] + B[i][j];
95102
}
103+
}
96104
```
97105

98-
In such a case, only the inner loop will be vectorized, again provided all the other conditions also apply (no branches and the inner loop is countable).
99-
In fact, there are some cases where outer loop types are also autovectorized, but these are outside the scope of this LP.
106+
In this case, only the inner loop will be vectorized, again provided all the other conditions also apply (no branches and the inner loop is countable).
107+
108+
There are some cases where outer loop types are autovectorized, but these are not covered in this Learning Path.
109+
110+
#### No data inter-dependency between iterations
100111

101-
* No data inter-dependency between iterations
112+
This means that each iteration depends on the result of the previous iteration. This example is difficult, but not impossible to autovectorize.
102113

103-
This means that each iteration depends on the result of the previous iteration. Such a problem is difficult -but not impossible- to autovectorize. Consider the following example:
114+
The loop below cannot be autovectorized as it is.
104115

105116
```C
106-
for (size_t i=1; i < N; i++) {
107-
C[i] = A[i] + B[i] + C[i-1];
108-
}
117+
for (size_t i=1; i < N; i++) {
118+
C[i] = A[i] + B[i] + C[i-1];
119+
}
109120
```
110-
111-
This example cannot be autovectorized as it is.

0 commit comments

Comments
 (0)