You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: content/learning-paths/cross-platform/loop-reflowing/autovectorization-on-arm-1.md
+6-6Lines changed: 6 additions & 6 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -72,7 +72,7 @@ dotprod:
72
72
ret
73
73
```
74
74
75
-
You can see that it's a pretty standard implementation, doing one element at a time. The option `-fno-inline` is necessary to avoid inlining any code from the function `dot-prod()` into `main()` for performance reasons. In general, this is a good thing, but demonstrating the autovectorization process is more difficult if there is no easy way to distinguish the caller from the callee.
75
+
You can see that it's a pretty standard implementation, doing one element at a time. The option `-fno-inline` is necessary to avoid inlining any code from the function `dot-prod()` into `main()` for performance reasons. In general, this is a good thing but demonstrating the autovectorization process is more difficult if there is no easy way to distinguish the caller from the callee.
76
76
77
77
Next, increase the optimization level to `-O3`, recompile, and observe the assembly output again:
78
78
@@ -135,7 +135,7 @@ dotprod:
135
135
b .L3
136
136
```
137
137
138
-
The code is larger, but you can see that some autovectorization has taken place.
138
+
The code is larger but you can see that some autovectorization has taken place.
139
139
140
140
The label `.L4` includes the main loop and you can see that the `mla` instruction is used to multiply and accumulate the dot products, 4 elements at a time.
141
141
@@ -145,7 +145,7 @@ With the new code, you can expect a performance gain of about 4x.
145
145
146
146
You might be wondering if there is a way to hint to the compiler that the sizes are always going to be multiples of 4 and avoid the last part of the code.
147
147
148
-
The answer is *yes*, but it depends on the compiler. In the case of gcc, it is enough to add an instruction that ensures the sizes are multiples of 4.
148
+
The answer is *yes* but it depends on the compiler. In the case of gcc, it is enough to add an instruction that ensures the sizes are multiples of 4.
149
149
150
150
Modify the `dotprod()` function to add the multiples of 4 hint as shown below:
@@ -195,7 +195,7 @@ Is there anything else the compiler can do?
195
195
196
196
Modern compilers are very proficient at generating code that utilizes all available instructions, provided they have the right information.
197
197
198
-
For example, the `dotprod()` function operates on `int32_t` elements, what if you could limit the range to 8-bit?
198
+
For example, the `dotprod()` function operates on `int32_t` elements. What if you could limit the range to 8-bit?
199
199
200
200
There is an Armv8 ISA extension that [provides signed and unsigned dot product instructions](https://developer.arm.com/documentation/102651/a/What-are-dot-product-intructions-) to perform a dot product across 8-bit elements of 2 vectors and store the results in the 32-bit elements of the resulting vector.
You need to compile with the architecture flag to use the dot product instructions.
239
239
240
-
The assembly output will be quite larger as the use of `SDOT` can only work in the main loop where the size is a multiple of 16. The compiler will unroll the loop to use Advanced SIMD instructions if the size is greater than 8, and byte-handling instructions if the size is smaller.
240
+
The assembly output will be quite large as the use of `SDOT` can only work in the main loop where the size is a multiple of 16. The compiler will unroll the loop to use Advanced SIMD instructions if the size is greater than 8 and byte-handling instructions if the size is smaller.
241
241
242
242
You can eliminate the extra tail instructions by converting `N -= N % 4` to 8 or even 16 as shown below:
0 commit comments