You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: content/learning-paths/cross-platform/loop-reflowing/autovectorization-and-restrict.md
+38-15Lines changed: 38 additions & 15 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,26 +1,31 @@
1
1
---
2
-
title: Autovectorization and restrict
2
+
title: Autovectorization using the restrict keyword
3
3
weight: 3
4
4
5
5
### FIXED, DO NOT MODIFY
6
6
layout: learningpathall
7
7
---
8
8
9
-
## Autovectorization and restrict keyword
9
+
You may have already experienced some form of autovectorization by reading [Understand the restrict keyword in C99](/learning-paths/cross-platform/restrict-keyword-c99/).
10
10
11
-
You have already experienced some form of autovectorization by learning about the [`restrict` keyword in a previous Learning Path](https://learn.arm.com/learning-paths/cross-platform/restrict-keyword-c99/).
12
-
Our example is a classic textbook example that the compiler will autovectorize simply by using `restrict`:
11
+
The example in the previous section is a classic textbook example that the compiler will autovectorize by using `restrict`.
13
12
14
-
Try the previously saved files, compile them both and compare the assembly output:
13
+
Compile the previously saved files:
15
14
16
15
```bash
17
16
gcc -O2 addvec.c -o addvec
18
17
gcc -O2 addvec_neon.c -o addvec_neon
19
18
```
20
19
21
-
Let's look at the assembly output of `addvec`:
20
+
Generate the assembly output using:
22
21
23
-
```as
22
+
```bash
23
+
objdump -D addvec
24
+
```
25
+
26
+
The assembly output of the `addvec()` function is shown below:
27
+
28
+
```output
24
29
addvec:
25
30
mov x3, 0
26
31
.L2:
@@ -34,9 +39,15 @@ addvec:
34
39
ret
35
40
```
36
41
37
-
Similarly, for the `addvec_neon` executable:
42
+
Generate the assembly output for `addvec_neon` using:
43
+
44
+
```bash
45
+
objdump -D addvec_neon
46
+
```
47
+
48
+
The assembly output for the `addvec()` function from the `addvec_neon` executable is shown below:
38
49
39
-
```as
50
+
```output
40
51
addvec:
41
52
mov x3, 0
42
53
.L6:
@@ -50,9 +61,9 @@ addvec:
50
61
ret
51
62
```
52
63
53
-
The latter uses Advanced SIMD/Neon instructions`fadd` with operands `v0.4s`, `v1.4s` to perform calculations in 4 x 32-bit floating-point elements.
64
+
The second example uses the Advanced SIMD/Neon instruction`fadd` with operands `v0.4s`, `v1.4s` to perform calculations in 4 x 32-bit floating-point elements.
54
65
55
-
Let's try to add `restrict` to the output argument `C` in the first `addvec` function:
66
+
Add the `restrict`keyword to the output argument `C` in the `addvec()` function in `addvec.c`:
The assembly output for the `addvec` function is now:
66
83
67
-
```as
84
+
```output
68
85
addvec:
69
86
mov x3, 0
70
87
.L2:
@@ -78,10 +95,16 @@ addvec:
78
95
ret
79
96
```
80
97
81
-
As you can see, the compiler has enabled autovectorization for this algorithm and the output is identical to the hand-written function! Strictly speaking, you don't even need `restrict` in such a trivial loop as it will be autovectorized anyway when certain optimization levels are added to the compilation flags (`-O2` for clang, `-O3` for gcc). However, the use of restrict simplifies the code and generates SIMD code similar to the hand written version in `addvec_neon.c`.
98
+
As you can see, the compiler has enabled autovectorization for this algorithm and the output is identical to the hand-written function.
99
+
100
+
Strictly speaking, you don't even need `restrict` in such a trivial loop as it will be autovectorized anyway when certain optimization levels are added to the compilation flags (`-O2` for clang, `-O3` for gcc). However, the use of restrict simplifies the code and generates SIMD code similar to the hand written version in `addvec_neon.c`.
101
+
102
+
The reason for this is related to how each compiler decides whether to use autovectorization or not.
103
+
104
+
For each candidate loop the compiler will estimate the possible performance gains against a cost model, which is affected by many parameters and of course the optimization level in the compilation flags.
82
105
83
-
The reason for this is because of the way each compiler decides whether to use autovectorization or not. For each candidate loop the compiler will estimate the possible performance gains against a cost model, which is affected by many parameters and of course the optimization level in the compilation flags. This cost model will estimate whether the autovectorized code grows in size and if the performance gains are enough to outweigh this increase in code size. Based on this estimation, the compiler will decide to use this vectorized code or fall back to a more 'safe' scalar implementation. This decision however is something that is not set in stone and is constantly reevaluated during compiler development.
106
+
The cost model estimates whether the autovectorized code grows in size and if the performance gains are enough to outweigh the increase in code size. Based on this estimation, the compiler will decide to use vectorized code or fall back to a more 'safe' scalar implementation. This decision however is fluid and is constantly reevaluated during compiler development.
84
107
85
-
This analysis goes beyond the scope of this LP, this was just one trivial example to demonstrate how the autovectorization can be triggered by a flag.
108
+
Compiler cost model analysis is beyond the scope of this Learning Path, but the example demonstrates how autovectorization can be triggered by a flag.
86
109
87
110
You will see some more advanced examples in the next sections.
Copy file name to clipboardExpand all lines: content/learning-paths/cross-platform/loop-reflowing/autovectorization-conditionals.md
+19-10Lines changed: 19 additions & 10 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -6,11 +6,13 @@ weight: 5
6
6
layout: learningpathall
7
7
---
8
8
9
-
## Autovectorization and conditionals
9
+
In the previous section, you learned that compilers cannot autovectorize loops with branches.
10
10
11
-
Previously we mentioned that compilers cannot autovectorize loops with branches. In this section, you will see that in more detail, when it is possible to enable the vectorizer in the compiler by adapting the loop and when it is required to modify the algorithm or write manually optimized code.
11
+
In this section, you will see more examples of loops with branches.
12
12
13
-
### If/else/switch in loops
13
+
You will learn when it is possible to enable the vectorizer in the compiler by adapting the loop, and when you are required to modify the algorithm or write manually optimized code.
14
+
15
+
### Loops with if/else/switch statements
14
16
15
17
Consider the following function, a modified form of the previous function that uses weighted coefficients for `A[i]`.
You might be tempted to think that this loop cannot be vectorized. Such loops are not that uncommon and compilers have a difficult time understanding the pattern and transforming them to vectorizeable forms, when it is possible. However, this is actually a vectorizable loop, as the conditional can actually be moved out of the loop, as this is a loop-invariant conditional. Essentially the compiler would transform -internally- the loop in something like the following:
31
+
You might think that this loop cannot be vectorized. Such loops are not uncommon and compilers have a difficult time understanding the pattern and transforming them to vectorizable forms. However, this is actually a vectorizable loop, as the conditional can be moved out of the loop, as this is a loop-invariant conditional.
32
+
33
+
The compiler will internally transform the loop into something similar to the code below:
which is in essence, two different loops and we know that the compiler can vectorize them. Both gcc and llvm can actually autovectorize this loop, but the output is slightly different, performance may actually vary depending on the flags used and the exact nature of the loop.
49
+
These are two different loops that the compiler can vectorize.
50
+
51
+
Both GCC and Clang can autovectorize this loop, but the output is slightly different, performance may vary depending on the flags used and the exact nature of the loop.
46
52
47
-
However, the following loop is not yet autovectorized by all compilers (llvm/clang autovectorizes this loop, but not gcc):
53
+
However, the loop below is autovectorized by Clang but it is not autovectorized by GCC.
Similarly with `switch` statements, if the condition expression in loop-invariant, that is if it does not depend on the loop variable or the elements involved in each iteration.
62
-
For this reason we know that this loop is actually autovectorized:
67
+
The situation is similar with `switch` statements. If the condition expression is loop-invariant, that is if it does not depend on the loop variable or the elements involved in each iteration, it can be autovectorized.
The cases you have seen so far are generic, they will work in other architectures besides Arm. In the next section, you will see Arm-specific usecases for autovectorization.
112
+
The cases you have seen so far are generic, they work the same for any architecture.
113
+
114
+
In the next section, you will see Arm-specific cases for autovectorization.
Copy file name to clipboardExpand all lines: content/learning-paths/cross-platform/loop-reflowing/autovectorization-limits.md
+64-55Lines changed: 64 additions & 55 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -6,65 +6,70 @@ weight: 4
6
6
layout: learningpathall
7
7
---
8
8
9
-
## Autovectorization limits
9
+
Autovectorization is not as easy as adding a flag like `restrict` in the arguments list.
10
10
11
-
Autovectorization is not as easy as adding a flag like `restrict` in the arguments list. There are some requirements for autovectorization to be enabled, namely:
11
+
There are some requirements for autovectorization to be enabled. Some of the requirements with examples are shown below.
12
12
13
-
* The loops have to be countable
13
+
#### Countable loops
14
14
15
-
This means that the following can be vectorized:
15
+
A countable loop is a loop where the number of iterations is known before the loop begins executing.
16
+
17
+
Countable loops means the following can be vectorized:
16
18
17
19
```C
18
-
for (size_t i=0; i < N; i++) {
19
-
C[i] = A[i] + B[i];
20
-
}
20
+
for (size_t i=0; i < N; i++) {
21
+
C[i] = A[i] + B[i];
22
+
}
21
23
```
22
24
23
-
but this one cannot be vectorized:
25
+
This loop is not countable and cannot be vectorized:
24
26
25
27
```C
26
-
i = 0;
27
-
while(true) {
28
-
C[i] = A[i] + B[i];
29
-
i++;
30
-
if (condition) break;
31
-
}
28
+
i = 0;
29
+
while(1) {
30
+
C[i] = A[i] + B[i];
31
+
i++;
32
+
if (condition) break;
33
+
}
32
34
```
33
35
34
-
Having said that, if condition is such that the `while` loop is actually a countable loop in disguise, then the loop might be vectorizable. For example, this loop will *actually be vectorized*:
36
+
If the `while` loop is actually a countable loop in disguise, then the loop might be vectorizable.
37
+
38
+
For example, this loop is vectorizable:
35
39
36
40
```C
37
-
i = 0;
38
-
while(1) {
39
-
C[i] = A[i] + B[i];
40
-
i++;
41
-
if (i >= N) break;
42
-
}
41
+
i = 0;
42
+
while(1) {
43
+
C[i] = A[i] + B[i];
44
+
i++;
45
+
if (i >= N) break;
46
+
}
43
47
```
44
-
but this one will not be vectorizable:
48
+
49
+
This loop is not vectorizable:
45
50
46
51
```C
47
-
i = 0;
48
-
while(1) {
49
-
C[i] = A[i] + B[i];
50
-
i++;
51
-
if (C[i] > 0) break;
52
-
}
52
+
i = 0;
53
+
while(1) {
54
+
C[i] = A[i] + B[i];
55
+
i++;
56
+
if (C[i] > 0) break;
57
+
}
53
58
```
54
59
55
-
* No function calls inside the loop
60
+
####No function calls inside the loop
56
61
57
-
For example if, `f()`, `g()` are functions that take `float` arguments, this loop cannot be autovectorized:
62
+
If `f()` and `g()` are functions that take `float` arguments this loop cannot be autovectorized:
58
63
59
64
```C
60
-
for (size_t i=0; i < N; i++) {
61
-
C[i] = f(A[i]) + g(B[i]);
62
-
}
65
+
for (size_t i=0; i < N; i++) {
66
+
C[i] = f(A[i]) + g(B[i]);
67
+
}
63
68
```
64
69
65
-
There is a special case of the math library trigonometry and transcendental functions (like `sin`, `cos`, `exp`, etc). There is progress underway to enable these functions to be autovectorized, as the compiler will be able to use their vectorized counterparts in `mathvec` library (`libmvec`).
70
+
There is a special case of the math library trigonometry and transcendental functions (like `sin`, `cos`, `exp`, etc). There is work underway to enable these functions to be autovectorized, as the compiler will use their vectorized counterparts in the`mathvec` library (`libmvec`).
66
71
67
-
So for example, something like the following is actually *already autovectorized* in current gcc trunk for Arm (note you have to add `-Ofast` to compilation flags to enable such autovectorization):
72
+
The loop below is *already autovectorized* in current gcc trunk for Arm (note you have to add `-Ofast` to the compilation flags to enable autovectorization):
This will be in gcc 14 and require a new glibc as well (2.39). Until these are released, if you are using a released compiler as part of a distribution (gcc 13.2 at the time of writing), you will have to manually vectorize such code for performance.
82
+
This feature will be in gcc 14 and require a new glibc version 2.39 as well. Until then, if you are using a released compiler as part of a Linux distribution (such as gcc 13.2), you will need to manually vectorize such code for performance.
78
83
79
-
We will expand on autovectorization of conditionals in the next section.
84
+
There is more about autovectorization of conditionals in the next section.
80
85
81
-
* In general, no branches in the loop, no if/else/switch
86
+
#### No branches in the loop and no if/else/switch statements
82
87
83
-
This is not universally true, there are cases where branches can actually be vectorized, we will expand this in the next section.
84
-
And in the case of SVE/SVE2 on Arm, predicates will actually make this easier and remove or minimize these limitations at least in some cases. There is currently work in progress on the compiler front to enable the use of predicates in such loops. We will probably return with a new LP to explain SVE/SVE2 autovectorization and predicates in more depth.
88
+
This is not universally true, there are cases where branches can actually be vectorized.
85
89
86
-
* Only inner-most loops will be vectorized.
90
+
In the case of SVE/SVE2 on Arm, predicates will actually make this easier and remove or minimize these limitations at least in some cases. There is currently work in progress to enable the use of predicates in such loops. SVE/SVE2 autovectorization and predicates is a good topic for a future Learning Path.
87
91
88
-
To clarify, consider the following nested loop:
92
+
There is more information on this in the next section.
93
+
94
+
#### Only inner-most loops will be vectorized.
95
+
96
+
Consider the following nested loop:
89
97
90
98
```C
91
-
for (size_t i=0; i < N; i++) {
92
-
for (size_t j=0; j < M; j++) {
93
-
C[i][j] = A[i][j] + B[i][j];
94
-
}
99
+
for (size_t i=0; i < N; i++) {
100
+
for (size_t j=0; j < M; j++) {
101
+
C[i][j] = A[i][j] + B[i][j];
95
102
}
103
+
}
96
104
```
97
105
98
-
In such a case, only the inner loop will be vectorized, again provided all the other conditions also apply (no branches and the inner loop is countable).
99
-
In fact, there are some cases where outer loop types are also autovectorized, but these are outside the scope of this LP.
106
+
In this case, only the inner loop will be vectorized, again provided all the other conditions also apply (no branches and the inner loop is countable).
107
+
108
+
There are some cases where outer loop types are autovectorized, but these are not covered in this Learning Path.
109
+
110
+
#### No data inter-dependency between iterations
100
111
101
-
* No data inter-dependency between iterations
112
+
This means that each iteration depends on the result of the previous iteration. This example is difficult, but not impossible to autovectorize.
102
113
103
-
This means that each iteration depends on the result of the previous iteration. Such a problem is difficult -but not impossible- to autovectorize. Consider the following example:
0 commit comments