Skip to content

Commit 1570f12

Browse files
Merge pull request #679 from VectorCamp/main
Loop Reflowing/Autovectorization LP
2 parents c989db5 + 01f0471 commit 1570f12

File tree

9 files changed

+852
-0
lines changed

9 files changed

+852
-0
lines changed
Lines changed: 41 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,41 @@
1+
---
2+
title: Loop Reflowing/Autovectorization
3+
4+
minutes_to_complete: 45
5+
6+
who_is_this_for: This is an advanced topic for C/C++ developers who are interested in taking advantage of autovectorization in compilers
7+
8+
learning_objectives:
9+
- Learn how to modify loops in order to take advantage of autovectorization in compilers
10+
11+
prerequisites:
12+
- An Arm computer running Linux OS and a recent version of compiler (Clang or GCC) installed
13+
14+
author_primary: Konstantinos Margaritis
15+
16+
### Tags
17+
skilllevels: Advanced
18+
subjects: Programming
19+
armips:
20+
- Aarch64
21+
- Armv8-a
22+
- Armv9-a
23+
tools_software_languages:
24+
- GCC
25+
- Clang
26+
- Coding
27+
operatingsystems:
28+
- Linux
29+
shared_path: true
30+
shared_between:
31+
- laptops-and-desktops
32+
- servers-and-cloud-computing
33+
- smartphones-and-mobile
34+
35+
36+
### FIXED, DO NOT MODIFY
37+
# ================================================================================
38+
weight: 1 # _index.md always has weight of 1 to order correctly
39+
layout: "learningpathall" # All files under learning paths have this same wrapper
40+
learning_path_main_page: "yes" # This should be surfaced when looking for related content. Only set for _index.md of learning path content.
41+
---
Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,31 @@
1+
---
2+
next_step_guidance: You now have a good understanding of Autovectorization, when to use it and how.
3+
4+
recommended_path: /learning-paths/servers-and-cloud-computing/top-down-n1/
5+
6+
further_reading:
7+
- resource:
8+
title: An update on GNU performance
9+
link: https://community.arm.com/arm-community-blogs/b/tools-software-ides-blog/posts/update-on-gnu-performance
10+
type: blog
11+
- resource:
12+
title: Auto-Vectorization in LLVM¶
13+
link: https://llvm.org/docs/Vectorizers.html
14+
type: website
15+
- resource:
16+
title: GCC Autovectorization
17+
link: https://hpac.cs.umu.se/teaching/sem-accg-16/slides/08.Schmitz-GGC_Autovec.pdf
18+
type: documentation
19+
- resource:
20+
title: Auto-vectorization in GCC
21+
link: https://gcc.gnu.org/projects/tree-ssa/vectorization.html
22+
type: website
23+
24+
25+
# ================================================================================
26+
# FIXED, DO NOT MODIFY
27+
# ================================================================================
28+
weight: 21 # set to always be larger than the content in this path, and one more than 'review'
29+
title: "Next Steps" # Always the same
30+
layout: "learningpathall" # All files under learning paths have this same wrapper
31+
---
Lines changed: 44 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,44 @@
1+
---
2+
review:
3+
- questions:
4+
question: >
5+
Autovectorization is:
6+
answers:
7+
- The automatic generation of 3D vectors so that 3D applications/games run faster.
8+
- Converting an array of numbers in C to an STL C++ vector object.
9+
- The process where an algorithm is automatically vectorized by the compiler to use SIMD instructions.
10+
correct_answer: 3
11+
explanation: >
12+
Vectorization is the process that converts a loop to use SIMD instructions and is a manual process. Autovectorization is when the compiler does this conversion automatically by detecting specific patterns in the loop that enable it to use specific SIMD instructions to increase performance.
13+
14+
- questions:
15+
question: >
16+
Can the compiler autovectorize all kinds of loops?
17+
answers:
18+
- No, only countable loops.
19+
- All loops except loops with function calls.
20+
- Yes, all of them.
21+
- No, only a few kinds of loops are vectorizable based on specific conditions.
22+
correct_answer: 4
23+
explanation: >
24+
There are quite a few requirements so that a loop can be detected as vectorizable by the compiler. In particular, it has to be countable, mostly without branches, no function calls, no data inter-dependency.
25+
26+
- questions:
27+
question: >
28+
The purpose of the `SDOT`/`UDOT` instructions on Arm is:
29+
answers:
30+
- To evaluate a dot product between 4 x 32-bit float elements in a vector.
31+
- To change the position of the decimal point ('dot') in a floating-point number
32+
- To evaluate a sum of products of 4 x 8-bit signed/unsigned integers in each 32-bit element in the input vectors.
33+
correct_answer: 3
34+
explanation: >
35+
For each 32-bit element in the input vectors A[i], B[i], `SDOT`/`UDOT` evaluate the sum of the products between the 4 x 8-bit signed/unsigned integers that comprise the A[i], B[i] elements. The corresponding 32-bit element in the output vector holds the resulting sums. For SVE, `SDOT`/`UDOT` instruction also works on 16-bit signed/unsigned integers.
36+
37+
38+
# ================================================================================
39+
# FIXED, DO NOT MODIFY
40+
# ================================================================================
41+
title: "Review" # Always the same title
42+
weight: 20 # Set to always be larger than the content in this path
43+
layout: "learningpathall" # All files under learning paths have this same wrapper
44+
---
Lines changed: 87 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,87 @@
1+
---
2+
title: Autovectorization and restrict
3+
weight: 3
4+
5+
### FIXED, DO NOT MODIFY
6+
layout: learningpathall
7+
---
8+
9+
## Autovectorization and restrict keyword
10+
11+
You have already experienced some form of autovectorization by learning about the [`restrict` keyword in a previous Learning Path](https://learn.arm.com/learning-paths/cross-platform/restrict-keyword-c99/).
12+
Our example is a classic textbook example that the compiler will autovectorize simply by using `restrict`:
13+
14+
Try the previously saved files, compile them both and compare the assembly output:
15+
16+
```bash
17+
gcc -O2 addvec.c -o addvec
18+
gcc -O2 addvec_neon.c -o addvec_neon
19+
```
20+
21+
Let's look at the assembly output of `addvec`:
22+
23+
```as
24+
addvec:
25+
mov x3, 0
26+
.L2:
27+
ldr s0, [x1, x3, lsl 2]
28+
ldr s1, [x2, x3, lsl 2]
29+
fadd s0, s0, s1
30+
str s0, [x0, x3, lsl 2]
31+
add x3, x3, 1
32+
cmp x3, 100
33+
bne .L2
34+
ret
35+
```
36+
37+
Similarly, for the `addvec_neon` executable:
38+
39+
```as
40+
addvec:
41+
mov x3, 0
42+
.L6:
43+
ldr q0, [x1, x3]
44+
ldr q1, [x2, x3]
45+
fadd v0.4s, v0.4s, v1.4s
46+
str q0, [x0, x3]
47+
add x3, x3, 16
48+
cmp x3, 400
49+
bne .L6
50+
ret
51+
```
52+
53+
The latter uses Advanced SIMD/Neon instructions `fadd` with operands `v0.4s`, `v1.4s` to perform calculations in 4 x 32-bit floating-point elements.
54+
55+
Let's try to add `restrict` to the output argument `C` in the first `addvec` function:
56+
57+
```C
58+
void addvec(float *restrict C, float *A, float *B) {
59+
for (size_t i=0; i < N; i++) {
60+
C[i] = A[i] + B[i];
61+
}
62+
}
63+
```
64+
65+
Recompile and check the assembly output again:
66+
67+
```as
68+
addvec:
69+
mov x3, 0
70+
.L2:
71+
ldr q0, [x1, x3]
72+
ldr q1, [x2, x3]
73+
fadd v0.4s, v0.4s, v1.4s
74+
str q0, [x0, x3]
75+
add x3, x3, 16
76+
cmp x3, 400
77+
bne .L2
78+
ret
79+
```
80+
81+
As you can see, the compiler has enabled autovectorization for this algorithm and the output is identical to the hand-written function! Strictly speaking, you don't even need `restrict` in such a trivial loop as it will be autovectorized anyway when certain optimization levels are added to the compilation flags (`-O2` for clang, `-O3` for gcc). However, the use of restrict simplifies the code and generates SIMD code similar to the hand written version in `addvec_neon.c`.
82+
83+
The reason for this is because of the way each compiler decides whether to use autovectorization or not. For each candidate loop the compiler will estimate the possible performance gains against a cost model, which is affected by many parameters and of course the optimization level in the compilation flags. This cost model will estimate whether the autovectorized code grows in size and if the performance gains are enough to outweigh this increase in code size. Based on this estimation, the compiler will decide to use this vectorized code or fall back to a more 'safe' scalar implementation. This decision however is something that is not set in stone and is constantly reevaluated during compiler development.
84+
85+
This analysis goes beyond the scope of this LP, this was just one trivial example to demonstrate how the autovectorization can be triggered by a flag.
86+
87+
You will see some more advanced examples in the next sections.
Lines changed: 105 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,105 @@
1+
---
2+
title: Autovectorization and conditionals
3+
weight: 5
4+
5+
### FIXED, DO NOT MODIFY
6+
layout: learningpathall
7+
---
8+
9+
## Autovectorization and conditionals
10+
11+
Previously we mentioned that compilers cannot autovectorize loops with branches. In this section, you will see that in more detail, when it is possible to enable the vectorizer in the compiler by adapting the loop and when it is required to modify the algorithm or write manually optimized code.
12+
13+
### If/else/switch in loops
14+
15+
Consider the following function, a modified form of the previous function that uses weighted coefficients for `A[i]`.
16+
17+
```C
18+
void addvecweight(float *restrict C, float *A, float *B,
19+
size_t N, float weight) {
20+
for (size_t i=0; i < N; i++) {
21+
if (weight < 0.5f)
22+
C[i] = A[i] + B[i];
23+
else
24+
C[i] = 1.5f*A[i] + 0.5f * B[i];
25+
}
26+
}
27+
```
28+
29+
You might be tempted to think that this loop cannot be vectorized. Such loops are not that uncommon and compilers have a difficult time understanding the pattern and transforming them to vectorizeable forms, when it is possible. However, this is actually a vectorizable loop, as the conditional can actually be moved out of the loop, as this is a loop-invariant conditional. Essentially the compiler would transform -internally- the loop in something like the following:
30+
31+
```C
32+
void addvecweight(float *restrict C, float *A, float *B, size_t N) {
33+
if (weight < 0.5f) {
34+
for (size_t i=0; i < N; i++) {
35+
C[i] = A[i] + B[i];
36+
}
37+
} else {
38+
for (size_t i=0; i < N; i++) {
39+
C[i] = 1.5f*A[i] + 0.5f * B[i];
40+
}
41+
}
42+
}
43+
```
44+
45+
which is in essence, two different loops and we know that the compiler can vectorize them. Both gcc and llvm can actually autovectorize this loop, but the output is slightly different, performance may actually vary depending on the flags used and the exact nature of the loop.
46+
47+
However, the following loop is not yet autovectorized by all compilers (llvm/clang autovectorizes this loop, but not gcc):
48+
49+
```C
50+
void addvecweight2(float *restrict C, float *A, float *B,
51+
size_t N, float weight) {
52+
for (size_t i=0; i < N; i++) {
53+
if (A[i] < 0.5f)
54+
C[i] = A[i] + B[i];
55+
else
56+
C[i] = 1.5f*A[i] + 0.5f * B[i];
57+
}
58+
}
59+
```
60+
61+
Similarly with `switch` statements, if the condition expression in loop-invariant, that is if it does not depend on the loop variable or the elements involved in each iteration.
62+
For this reason we know that this loop is actually autovectorized:
63+
64+
```C
65+
void addvecweight(float *restrict C, float *A, float *B,
66+
size_t N, int w) {
67+
for (size_t i=0; i < N; i++) {
68+
switch (w) {
69+
case 1:
70+
C[i] = A[i] + B[i];
71+
break;
72+
case :
73+
C[i] = 1.5f*A[i] + 0.5f * B[i];
74+
break;
75+
default:
76+
break;
77+
}
78+
}
79+
}
80+
```
81+
82+
But this one is not:
83+
84+
```C
85+
#define sign(x) (x > 0) ? 1 : ((x < 0) ? -1 : 0)
86+
87+
void addvecweight(float *restrict C, float *A, float *B,
88+
size_t N, int w) {
89+
for (size_t i=0; i < N; i++) {
90+
switch (sign(A[i])) {
91+
case 1:
92+
C[i] = 0.5f * A[i] + 1.5f * B[i];
93+
break;
94+
case -1:
95+
C[i] = 1.5f * A[i] + 0.5f * B[i];
96+
break;
97+
default:
98+
C[i] = A[i] + B[i];
99+
break;
100+
}
101+
}
102+
}
103+
```
104+
105+
The cases you have seen so far are generic, they will work in other architectures besides Arm. In the next section, you will see Arm-specific usecases for autovectorization.

0 commit comments

Comments
 (0)