You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: content/learning-paths/cross-platform/loop-reflowing/_review.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -32,7 +32,7 @@ review:
32
32
- To evaluate a sum of products of 4 x 8-bit signed/unsigned integers in each 32-bit element in the input vectors.
33
33
correct_answer: 3
34
34
explanation: >
35
-
For each 32-bit element in the input vectors A[i], B[i], `SDOT`/`UDOT` evaluate the sum of the products between the 4 x 8-bit signed/unsigned integers that comprise the A[i], B[i] elements. The corresponding 32-bit element in the output vector holds the resulting sums.
35
+
For each 32-bit element in the input vectors A[i], B[i], `SDOT`/`UDOT` evaluate the sum of the products between the 4 x 8-bit signed/unsigned integers that comprise the A[i], B[i] elements. The corresponding 32-bit element in the output vector holds the resulting sums. For SVE, `SDOT`/`UDOT` instruction also works on 16-bit signed/unsigned integers.
Copy file name to clipboardExpand all lines: content/learning-paths/cross-platform/loop-reflowing/autovectorization-and-restrict.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -78,7 +78,7 @@ addvec:
78
78
ret
79
79
```
80
80
81
-
As you can see, the compiler has enabled autovectorization for this algorithm and the output is identical to the hand-written function!
81
+
As you can see, the compiler has enabled autovectorization for this algorithm and the output is identical to the hand-written function! Strictly speaking, you don't even need `restrict` in such a trivial loop as it will be autovectorized anyway when certain optimization levels are added to the compilation flags (`-O2` for clang, `-O3` for gcc). However, the use of restrict simplifies the code and generates SIMD code similar to the hand written version in `addvec_neon.c`.
82
82
83
83
This is just a trivial example though and not all loops can be autovectorized that easily by the compiler.
Copy file name to clipboardExpand all lines: content/learning-paths/cross-platform/loop-reflowing/autovectorization-limits.md
+4-2Lines changed: 4 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -64,7 +64,7 @@ For example if, `f()`, `g()` are functions that take `float` arguments, this loo
64
64
65
65
There is a special case of the math library trigonometry and transcendental functions (like `sin`, `cos`, `exp`, etc). There is progress underway to enable these functions to be autovectorized, as the compiler will be able to use their vectorized counterparts in `mathvec` library (`libmvec`).
66
66
67
-
So for example, something like the following *will be autovectorized* in the future for Arm.
67
+
So for example, something like the following is actually *already autovectorized* in current gcc trunk for Arm (note you have to add `-Ofast` to compilation flags to enable such autovectorization):
@@ -81,6 +81,7 @@ We will expand on autovectorization of conditionals in the next section.
81
81
* In general, no branches in the loop, no if/else/switch
82
82
83
83
This is not universally true, there are cases where branches can actually be vectorized, we will expand this in the next section.
84
+
And in the case of SVE/SVE2 on Arm, predicates will actually make this easier and remove or minimize these limitations at least in some cases. There is currently work in progress on the compiler front to enable the use of predicates in such loops. We will probably return with a new LP to explain SVE/SVE2 autovectorization and predicates in more depth.
84
85
85
86
* Only inner-most loops will be vectorized.
86
87
@@ -94,7 +95,8 @@ To clarify, consider the following nested loop:
94
95
}
95
96
```
96
97
97
-
In such a case, only the inner loop will be vectorized, again provided all the other conditions also apply (no branches and the inner loop is countable).
98
+
In such a case, only the inner loop will be vectorized, again provided all the other conditions also apply (no branches and the inner loop is countable).
99
+
In fact, there are some cases where outer loop types are also autovectorized, but these are outside the scope of this LP.
Copy file name to clipboardExpand all lines: content/learning-paths/cross-platform/loop-reflowing/autovectorization-on-arm-1.md
+5-1Lines changed: 5 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -183,7 +183,11 @@ int main() {
183
183
}
184
184
```
185
185
186
-
You need to recompile the code with `gcc -O3 -Wall -g -fno-inline -march=armv8-a+dotprod` in order to hint to the compiler that it has the new instructions at its disposal.
186
+
You need to add `-march=armv8-a+dotprod` to the compilation flags in order to hint to the compiler that it has the new instructions at its disposal, that is:
The assembly output will be quite larger as the use of `SDOT` can only work in the main loop where the size is a multiple of 16. Then the compiler will unroll the loop to use ASIMD instructions if the size is greater than 8, and byte-handling instructions if the size is smaller.
189
193
You could eliminate those extra tail instructions by converting `N -= N % 4` to 8 or even 16:
0 commit comments