Update autovectorization-on-arm-2.md

lizwar · web-flow · commit e7f90c0ef36d · 2024-01-19T11:55:27.000Z
minor editorial amends
diff --git a/content/learning-paths/cross-platform/loop-reflowing/autovectorization-on-arm-2.md b/content/learning-paths/cross-platform/loop-reflowing/autovectorization-on-arm-2.md
@@ -8,7 +8,7 @@ layout: learningpathall
 
 The previous example using the `SDOT`/`UDOT` instructions is only one of the Arm-specific optimizations possible.
 
-While it is not possible to demonstrate all of the specialized instructions offered by the Arm architecture, it's worth looking at another example:
+While it is not possible to demonstrate all of the specialized instructions offered by the Arm architecture, it's worth looking at another example.
 
 Below is a very simple loop, calculating what is known as a Sum of Absolute Differences (SAD). Such code is very common in video codecs and used in calculating differences between video frames.
 
@@ -37,9 +37,9 @@ int main() {
 }
 ```
 
-A hint to the compiler was added that the size is a multiple of 16 to avoid generating cases for smaller lengths. *This is only for demonstration purposes*.
+A hint to the compiler was added that the size is a multiple of 16 to avoid generating cases for smaller lengths. *This is for demonstration purposes only*.
 
-Save the code above to a file named `sadtest.c` and compile it:
+Save the above code to a file named `sadtest.c` and compile it:
 
 ```bash
 gcc -O3 -fno-inline sadtest.c -o sadtest
@@ -71,11 +71,11 @@ sad8:
         ret
 ```
 
- You can see that the compiler generates code that uses 3 specialized instructions that exist only on Arm: [`SABDL2`](https://developer.arm.com/documentation/ddi0596/2021-03/SIMD-FP-Instructions/SABDL--SABDL2--Signed-Absolute-Difference-Long-?lang=en), [`SABAL`](https://developer.arm.com/documentation/ddi0596/2021-03/SIMD-FP-Instructions/SABAL--SABAL2--Signed-Absolute-difference-and-Accumulate-Long-?lang=en) and [`SADALP`](https://developer.arm.com/documentation/ddi0596/2021-03/SIMD-FP-Instructions/SADALP--Signed-Add-and-Accumulate-Long-Pairwise-?lang=en).
+You can see that the compiler generates code that uses 3 specialized instructions that exist only on Arm: [`SABDL2`](https://developer.arm.com/documentation/ddi0596/2021-03/SIMD-FP-Instructions/SABDL--SABDL2--Signed-Absolute-Difference-Long-?lang=en), [`SABAL`](https://developer.arm.com/documentation/ddi0596/2021-03/SIMD-FP-Instructions/SABAL--SABAL2--Signed-Absolute-difference-and-Accumulate-Long-?lang=en) and [`SADALP`](https://developer.arm.com/documentation/ddi0596/2021-03/SIMD-FP-Instructions/SADALP--Signed-Add-and-Accumulate-Long-Pairwise-?lang=en).
 
 The accumulator variable is not 8-bit but 32-bit, so the typical SIMD implementation that would involve 16 x 8-bit subtractions, then 16 x absolute values and 16 x additions would not do, and a widening conversion to 32-bit would have to take place before the accumulation.
 
-This would mean that 4x items at a time would be accumulated, but with the use of these instructions, the performance gain can be up to 16x faster than the original scalar code, or about 4x faster than the typical SIMD implementation.
+This would mean that 4x items at a time would be accumulated but, with the use of these instructions, the performance gain can be up to 16x faster than the original scalar code, or about 4x faster than the typical SIMD implementation.
 
 For completeness the SVE2 version will be provided, which does not depend on size being a multiple of 16.
 
@@ -126,7 +126,7 @@ You might ask why you should learn about autovectorization if you need to have s
 
 Autovectorization is a tool. The goal is to minimize the effort required by developers and maximize the performance, while at the same time requiring low maintenance in terms of code size. 
 
-It is far easier to maintain hundreds or thousands of functions that are known to generate the fastest code using autovectorization, for all platforms, than it is to maintain the same number of functions in multiple versions for each supported architecture and SIMD engine. 
+It is far easier to maintain hundreds or thousands of functions that are known to generate the fastest code using autovectorization for all platforms, than it is to maintain the same number of functions in multiple versions for each supported architecture and SIMD engine. 
 
 As with most tools, the better you know how to use it, the better the results will be.