Skip to content

Commit e7f90c0

Browse files
authored
Update autovectorization-on-arm-2.md
minor editorial amends
1 parent 458604c commit e7f90c0

File tree

1 file changed

+6
-6
lines changed

1 file changed

+6
-6
lines changed

content/learning-paths/cross-platform/loop-reflowing/autovectorization-on-arm-2.md

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@ layout: learningpathall
88

99
The previous example using the `SDOT`/`UDOT` instructions is only one of the Arm-specific optimizations possible.
1010

11-
While it is not possible to demonstrate all of the specialized instructions offered by the Arm architecture, it's worth looking at another example:
11+
While it is not possible to demonstrate all of the specialized instructions offered by the Arm architecture, it's worth looking at another example.
1212

1313
Below is a very simple loop, calculating what is known as a Sum of Absolute Differences (SAD). Such code is very common in video codecs and used in calculating differences between video frames.
1414

@@ -37,9 +37,9 @@ int main() {
3737
}
3838
```
3939
40-
A hint to the compiler was added that the size is a multiple of 16 to avoid generating cases for smaller lengths. *This is only for demonstration purposes*.
40+
A hint to the compiler was added that the size is a multiple of 16 to avoid generating cases for smaller lengths. *This is for demonstration purposes only*.
4141
42-
Save the code above to a file named `sadtest.c` and compile it:
42+
Save the above code to a file named `sadtest.c` and compile it:
4343
4444
```bash
4545
gcc -O3 -fno-inline sadtest.c -o sadtest
@@ -71,11 +71,11 @@ sad8:
7171
ret
7272
```
7373

74-
You can see that the compiler generates code that uses 3 specialized instructions that exist only on Arm: [`SABDL2`](https://developer.arm.com/documentation/ddi0596/2021-03/SIMD-FP-Instructions/SABDL--SABDL2--Signed-Absolute-Difference-Long-?lang=en), [`SABAL`](https://developer.arm.com/documentation/ddi0596/2021-03/SIMD-FP-Instructions/SABAL--SABAL2--Signed-Absolute-difference-and-Accumulate-Long-?lang=en) and [`SADALP`](https://developer.arm.com/documentation/ddi0596/2021-03/SIMD-FP-Instructions/SADALP--Signed-Add-and-Accumulate-Long-Pairwise-?lang=en).
74+
You can see that the compiler generates code that uses 3 specialized instructions that exist only on Arm: [`SABDL2`](https://developer.arm.com/documentation/ddi0596/2021-03/SIMD-FP-Instructions/SABDL--SABDL2--Signed-Absolute-Difference-Long-?lang=en), [`SABAL`](https://developer.arm.com/documentation/ddi0596/2021-03/SIMD-FP-Instructions/SABAL--SABAL2--Signed-Absolute-difference-and-Accumulate-Long-?lang=en) and [`SADALP`](https://developer.arm.com/documentation/ddi0596/2021-03/SIMD-FP-Instructions/SADALP--Signed-Add-and-Accumulate-Long-Pairwise-?lang=en).
7575

7676
The accumulator variable is not 8-bit but 32-bit, so the typical SIMD implementation that would involve 16 x 8-bit subtractions, then 16 x absolute values and 16 x additions would not do, and a widening conversion to 32-bit would have to take place before the accumulation.
7777

78-
This would mean that 4x items at a time would be accumulated, but with the use of these instructions, the performance gain can be up to 16x faster than the original scalar code, or about 4x faster than the typical SIMD implementation.
78+
This would mean that 4x items at a time would be accumulated but, with the use of these instructions, the performance gain can be up to 16x faster than the original scalar code, or about 4x faster than the typical SIMD implementation.
7979

8080
For completeness the SVE2 version will be provided, which does not depend on size being a multiple of 16.
8181

@@ -126,7 +126,7 @@ You might ask why you should learn about autovectorization if you need to have s
126126

127127
Autovectorization is a tool. The goal is to minimize the effort required by developers and maximize the performance, while at the same time requiring low maintenance in terms of code size.
128128

129-
It is far easier to maintain hundreds or thousands of functions that are known to generate the fastest code using autovectorization, for all platforms, than it is to maintain the same number of functions in multiple versions for each supported architecture and SIMD engine.
129+
It is far easier to maintain hundreds or thousands of functions that are known to generate the fastest code using autovectorization for all platforms, than it is to maintain the same number of functions in multiple versions for each supported architecture and SIMD engine.
130130

131131
As with most tools, the better you know how to use it, the better the results will be.
132132

0 commit comments

Comments
 (0)