You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: content/learning-paths/cross-platform/loop-reflowing/autovectorization-on-arm-2.md
+6-6Lines changed: 6 additions & 6 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -8,7 +8,7 @@ layout: learningpathall
8
8
9
9
The previous example using the `SDOT`/`UDOT` instructions is only one of the Arm-specific optimizations possible.
10
10
11
-
While it is not possible to demonstrate all of the specialized instructions offered by the Arm architecture, it's worth looking at another example:
11
+
While it is not possible to demonstrate all of the specialized instructions offered by the Arm architecture, it's worth looking at another example.
12
12
13
13
Below is a very simple loop, calculating what is known as a Sum of Absolute Differences (SAD). Such code is very common in video codecs and used in calculating differences between video frames.
14
14
@@ -37,9 +37,9 @@ int main() {
37
37
}
38
38
```
39
39
40
-
A hint to the compiler was added that the size is a multiple of 16 to avoid generating cases for smaller lengths. *This is only for demonstration purposes*.
40
+
A hint to the compiler was added that the size is a multiple of 16 to avoid generating cases for smaller lengths. *This is for demonstration purposes only*.
41
41
42
-
Save the code above to a file named `sadtest.c` and compile it:
42
+
Save the above code to a file named `sadtest.c` and compile it:
43
43
44
44
```bash
45
45
gcc -O3 -fno-inline sadtest.c -o sadtest
@@ -71,11 +71,11 @@ sad8:
71
71
ret
72
72
```
73
73
74
-
You can see that the compiler generates code that uses 3 specialized instructions that exist only on Arm: [`SABDL2`](https://developer.arm.com/documentation/ddi0596/2021-03/SIMD-FP-Instructions/SABDL--SABDL2--Signed-Absolute-Difference-Long-?lang=en), [`SABAL`](https://developer.arm.com/documentation/ddi0596/2021-03/SIMD-FP-Instructions/SABAL--SABAL2--Signed-Absolute-difference-and-Accumulate-Long-?lang=en) and [`SADALP`](https://developer.arm.com/documentation/ddi0596/2021-03/SIMD-FP-Instructions/SADALP--Signed-Add-and-Accumulate-Long-Pairwise-?lang=en).
74
+
You can see that the compiler generates code that uses 3 specialized instructions that exist only on Arm: [`SABDL2`](https://developer.arm.com/documentation/ddi0596/2021-03/SIMD-FP-Instructions/SABDL--SABDL2--Signed-Absolute-Difference-Long-?lang=en), [`SABAL`](https://developer.arm.com/documentation/ddi0596/2021-03/SIMD-FP-Instructions/SABAL--SABAL2--Signed-Absolute-difference-and-Accumulate-Long-?lang=en) and [`SADALP`](https://developer.arm.com/documentation/ddi0596/2021-03/SIMD-FP-Instructions/SADALP--Signed-Add-and-Accumulate-Long-Pairwise-?lang=en).
75
75
76
76
The accumulator variable is not 8-bit but 32-bit, so the typical SIMD implementation that would involve 16 x 8-bit subtractions, then 16 x absolute values and 16 x additions would not do, and a widening conversion to 32-bit would have to take place before the accumulation.
77
77
78
-
This would mean that 4x items at a time would be accumulated, but with the use of these instructions, the performance gain can be up to 16x faster than the original scalar code, or about 4x faster than the typical SIMD implementation.
78
+
This would mean that 4x items at a time would be accumulated but, with the use of these instructions, the performance gain can be up to 16x faster than the original scalar code, or about 4x faster than the typical SIMD implementation.
79
79
80
80
For completeness the SVE2 version will be provided, which does not depend on size being a multiple of 16.
81
81
@@ -126,7 +126,7 @@ You might ask why you should learn about autovectorization if you need to have s
126
126
127
127
Autovectorization is a tool. The goal is to minimize the effort required by developers and maximize the performance, while at the same time requiring low maintenance in terms of code size.
128
128
129
-
It is far easier to maintain hundreds or thousands of functions that are known to generate the fastest code using autovectorization, for all platforms, than it is to maintain the same number of functions in multiple versions for each supported architecture and SIMD engine.
129
+
It is far easier to maintain hundreds or thousands of functions that are known to generate the fastest code using autovectorization for all platforms, than it is to maintain the same number of functions in multiple versions for each supported architecture and SIMD engine.
130
130
131
131
As with most tools, the better you know how to use it, the better the results will be.
0 commit comments