Skip to content

Commit 458604c

Browse files
authored
Update autovectorization-on-arm-1.md
minor editorial amends
1 parent b83a948 commit 458604c

File tree

1 file changed

+6
-6
lines changed

1 file changed

+6
-6
lines changed

content/learning-paths/cross-platform/loop-reflowing/autovectorization-on-arm-1.md

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -72,7 +72,7 @@ dotprod:
7272
ret
7373
```
7474

75-
You can see that it's a pretty standard implementation, doing one element at a time. The option `-fno-inline` is necessary to avoid inlining any code from the function `dot-prod()` into `main()` for performance reasons. In general, this is a good thing, but demonstrating the autovectorization process is more difficult if there is no easy way to distinguish the caller from the callee.
75+
You can see that it's a pretty standard implementation, doing one element at a time. The option `-fno-inline` is necessary to avoid inlining any code from the function `dot-prod()` into `main()` for performance reasons. In general, this is a good thing but demonstrating the autovectorization process is more difficult if there is no easy way to distinguish the caller from the callee.
7676

7777
Next, increase the optimization level to `-O3`, recompile, and observe the assembly output again:
7878

@@ -135,7 +135,7 @@ dotprod:
135135
b .L3
136136
```
137137

138-
The code is larger, but you can see that some autovectorization has taken place.
138+
The code is larger but you can see that some autovectorization has taken place.
139139

140140
The label `.L4` includes the main loop and you can see that the `mla` instruction is used to multiply and accumulate the dot products, 4 elements at a time.
141141

@@ -145,7 +145,7 @@ With the new code, you can expect a performance gain of about 4x.
145145

146146
You might be wondering if there is a way to hint to the compiler that the sizes are always going to be multiples of 4 and avoid the last part of the code.
147147

148-
The answer is *yes*, but it depends on the compiler. In the case of gcc, it is enough to add an instruction that ensures the sizes are multiples of 4.
148+
The answer is *yes* but it depends on the compiler. In the case of gcc, it is enough to add an instruction that ensures the sizes are multiples of 4.
149149

150150
Modify the `dotprod()` function to add the multiples of 4 hint as shown below:
151151

@@ -160,7 +160,7 @@ int32_t dotprod(int32_t *A, int32_t *B, size_t N) {
160160
}
161161
```
162162
163-
Compile again ith `-O3`:
163+
Compile again with `-O3`:
164164
165165
```bash
166166
gcc -O3 -fno-inline dotprod.c -o dotprod
@@ -195,7 +195,7 @@ Is there anything else the compiler can do?
195195

196196
Modern compilers are very proficient at generating code that utilizes all available instructions, provided they have the right information.
197197

198-
For example, the `dotprod()` function operates on `int32_t` elements, what if you could limit the range to 8-bit?
198+
For example, the `dotprod()` function operates on `int32_t` elements. What if you could limit the range to 8-bit?
199199

200200
There is an Armv8 ISA extension that [provides signed and unsigned dot product instructions](https://developer.arm.com/documentation/102651/a/What-are-dot-product-intructions-) to perform a dot product across 8-bit elements of 2 vectors and store the results in the 32-bit elements of the resulting vector.
201201

@@ -237,7 +237,7 @@ gcc -O3 -fno-inline -march=armv8-a+dotprod dotprod.c -o dotprod
237237

238238
You need to compile with the architecture flag to use the dot product instructions.
239239

240-
The assembly output will be quite larger as the use of `SDOT` can only work in the main loop where the size is a multiple of 16. The compiler will unroll the loop to use Advanced SIMD instructions if the size is greater than 8, and byte-handling instructions if the size is smaller.
240+
The assembly output will be quite large as the use of `SDOT` can only work in the main loop where the size is a multiple of 16. The compiler will unroll the loop to use Advanced SIMD instructions if the size is greater than 8 and byte-handling instructions if the size is smaller.
241241

242242
You can eliminate the extra tail instructions by converting `N -= N % 4` to 8 or even 16 as shown below:
243243

0 commit comments

Comments
 (0)