Skip to content

Commit 895c3ec

Browse files
committed
Made corrections based on Tamar's feedback
1 parent 616cf53 commit 895c3ec

File tree

4 files changed

+11
-5
lines changed

4 files changed

+11
-5
lines changed

content/learning-paths/cross-platform/loop-reflowing/_review.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -32,7 +32,7 @@ review:
3232
- To evaluate a sum of products of 4 x 8-bit signed/unsigned integers in each 32-bit element in the input vectors.
3333
correct_answer: 3
3434
explanation: >
35-
For each 32-bit element in the input vectors A[i], B[i], `SDOT`/`UDOT` evaluate the sum of the products between the 4 x 8-bit signed/unsigned integers that comprise the A[i], B[i] elements. The corresponding 32-bit element in the output vector holds the resulting sums.
35+
For each 32-bit element in the input vectors A[i], B[i], `SDOT`/`UDOT` evaluate the sum of the products between the 4 x 8-bit signed/unsigned integers that comprise the A[i], B[i] elements. The corresponding 32-bit element in the output vector holds the resulting sums. For SVE, `SDOT`/`UDOT` instruction also works on 16-bit signed/unsigned integers.
3636
3737
3838
# ================================================================================

content/learning-paths/cross-platform/loop-reflowing/autovectorization-and-restrict.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -78,7 +78,7 @@ addvec:
7878
ret
7979
```
8080

81-
As you can see, the compiler has enabled autovectorization for this algorithm and the output is identical to the hand-written function!
81+
As you can see, the compiler has enabled autovectorization for this algorithm and the output is identical to the hand-written function! Strictly speaking, you don't even need `restrict` in such a trivial loop as it will be autovectorized anyway when certain optimization levels are added to the compilation flags (`-O2` for clang, `-O3` for gcc). However, the use of restrict simplifies the code and generates SIMD code similar to the hand written version in `addvec_neon.c`.
8282

8383
This is just a trivial example though and not all loops can be autovectorized that easily by the compiler.
8484

content/learning-paths/cross-platform/loop-reflowing/autovectorization-limits.md

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -64,7 +64,7 @@ For example if, `f()`, `g()` are functions that take `float` arguments, this loo
6464
6565
There is a special case of the math library trigonometry and transcendental functions (like `sin`, `cos`, `exp`, etc). There is progress underway to enable these functions to be autovectorized, as the compiler will be able to use their vectorized counterparts in `mathvec` library (`libmvec`).
6666
67-
So for example, something like the following *will be autovectorized* in the future for Arm.
67+
So for example, something like the following is actually *already autovectorized* in current gcc trunk for Arm (note you have to add `-Ofast` to compilation flags to enable such autovectorization):
6868
6969
```C
7070
void addfunc(float *restrict C, float *A, float *B, size_t N) {
@@ -81,6 +81,7 @@ We will expand on autovectorization of conditionals in the next section.
8181
* In general, no branches in the loop, no if/else/switch
8282

8383
This is not universally true, there are cases where branches can actually be vectorized, we will expand this in the next section.
84+
And in the case of SVE/SVE2 on Arm, predicates will actually make this easier and remove or minimize these limitations at least in some cases. There is currently work in progress on the compiler front to enable the use of predicates in such loops. We will probably return with a new LP to explain SVE/SVE2 autovectorization and predicates in more depth.
8485

8586
* Only inner-most loops will be vectorized.
8687

@@ -94,7 +95,8 @@ To clarify, consider the following nested loop:
9495
}
9596
```
9697
97-
In such a case, only the inner loop will be vectorized, again provided all the other conditions also apply (no branches and the inner loop is countable).
98+
In such a case, only the inner loop will be vectorized, again provided all the other conditions also apply (no branches and the inner loop is countable).
99+
In fact, there are some cases where outer loop types are also autovectorized, but these are outside the scope of this LP.
98100
99101
* No data inter-dependency between iterations
100102

content/learning-paths/cross-platform/loop-reflowing/autovectorization-on-arm-1.md

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -183,7 +183,11 @@ int main() {
183183
}
184184
```
185185
186-
You need to recompile the code with `gcc -O3 -Wall -g -fno-inline -march=armv8-a+dotprod` in order to hint to the compiler that it has the new instructions at its disposal.
186+
You need to add `-march=armv8-a+dotprod` to the compilation flags in order to hint to the compiler that it has the new instructions at its disposal, that is:
187+
188+
```bash
189+
gcc -O3 -Wall -g -fno-inline -march=armv8-a+dotprod
190+
```
187191

188192
The assembly output will be quite larger as the use of `SDOT` can only work in the main loop where the size is a multiple of 16. Then the compiler will unroll the loop to use ASIMD instructions if the size is greater than 8, and byte-handling instructions if the size is smaller.
189193
You could eliminate those extra tail instructions by converting `N -= N % 4` to 8 or even 16:

0 commit comments

Comments
 (0)