|
1 | 1 | --- |
2 | | -title: Vector extension code examples |
| 2 | +title: "Vector extension code examples" |
3 | 3 | weight: 4 |
4 | 4 |
|
5 | | -### FIXED, DO NOT MODIFY |
6 | | -layout: learningpathall |
| 5 | +# FIXED, DO NOT MODIFY |
| 6 | +layout: "learningpathall" |
7 | 7 | --- |
8 | 8 |
|
9 | | -## SAXPY Example code |
| 9 | +## SAXPY example code |
10 | 10 |
|
11 | | -As a way to provide some hands-on experience, you can study and run example code to better understand the vector extensions. The example used here is SAXPY. |
| 11 | +This page walks you through a SAXPY (Single-Precision A·X Plus Y) kernel implemented in plain C and with vector extensions on both Arm (NEON, SVE) and x86 (AVX2, AVX-512). You will see how to build and run each version and how the vector width affects throughput. |
12 | 12 |
|
13 | | -SAXPY stands for "Single-Precision A·X Plus Y" and is a fundamental operation in linear algebra. It computes the result of the equation `y[i] = a * x[i] + y[i]` for all elements in the arrays `x` and `y`. |
| 13 | +SAXPY computes `y[i] = a * x[i] + y[i]` across arrays `x` and `y`. It is widely used in numerical computing and is an accessible way to compare SIMD behavior across ISAs. |
14 | 14 |
|
15 | | -SAXPY is widely used in numerical computing, particularly in vectorized and parallelized environments, due to its simplicity and efficiency. |
| 15 | +{{% notice Tip %}} |
| 16 | +If a library already provides a tuned SAXPY (for example, BLAS), prefer that over hand-written kernels. These examples are for learning and porting. |
| 17 | +{{% /notice %}} |
16 | 18 |
|
17 | | -### Reference version |
18 | 19 |
|
19 | | -Below is a plain C implementation of SAXPY without any vector extensions. |
| 20 | +### Reference C version (no SIMD intrinsics) |
20 | 21 |
|
21 | | -This serves as a reference for the optimized examples provided later. |
| 22 | +Below is a plain C implementation of SAXPY without any vector extensions which serves as a reference baseline for the optimized examples provided later: |
22 | 23 |
|
23 | 24 | ```c |
24 | 25 | #include <stddef.h> |
@@ -65,13 +66,11 @@ gcc -O3 -o saxpy_plain saxpy_plain.c |
65 | 66 |
|
66 | 67 | You can use Clang for any of the examples by replacing `gcc` with `clang` on the command line. |
67 | 68 |
|
68 | | -### Arm NEON version (128-bit SIMD, 4 floats per operation) |
| 69 | +## Arm NEON version (128-bit SIMD, 4 floats per operation) |
69 | 70 |
|
70 | | -NEON operates on fixed 128-bit registers, able to process 4 single-precision float values simultaneously in every vector instruction. |
| 71 | +NEON uses fixed 128-bit registers, processing four `float` values per instruction. It is available on most Armv8-A devices and is excellent for accelerating loops and signal processing tasks in mobile and embedded workloads. |
71 | 72 |
|
72 | | -This extension is available on most Arm-based devices and is excellent for accelerating loops and signal processing tasks in mobile and embedded workloads. |
73 | | - |
74 | | -The example below processes 16 floats per iteration using four separate NEON operations to improve instruction-level parallelism and reduce loop overhead. |
| 73 | +The example below processes 16 floats per iteration using four separate NEON operations to improve instruction-level parallelism and reduce loop overhead: |
75 | 74 |
|
76 | 75 | ```c |
77 | 76 | #include <arm_neon.h> |
@@ -139,7 +138,13 @@ gcc -O3 -march=armv8-a+simd -o saxpy_neon saxpy_neon.c |
139 | 138 | ./saxpy_neon |
140 | 139 | ``` |
141 | 140 |
|
142 | | -### AVX2 (256-bit SIMD, 8 floats per operation) |
| 141 | +{{% notice optional_title %}} |
| 142 | +On AArch64, NEON is mandatory; the flag is shown for clarity. |
| 143 | +{{% /notice %}} |
| 144 | + |
| 145 | + |
| 146 | + |
| 147 | +## x86 AVX2 version (256-bit SIMD, 8 floats per operation) |
143 | 148 |
|
144 | 149 | AVX2 doubles the SIMD width compared to NEON, processing 8 single-precision floats at a time in 256-bit registers. |
145 | 150 |
|
@@ -214,6 +219,7 @@ SVE encourages writing vector-length agnostic code: the compiler automatically h |
214 | 219 | ```c |
215 | 220 | #include <arm_sve.h> |
216 | 221 | #include <stddef.h> |
| 222 | +#include <stdint.h> |
217 | 223 | #include <stdio.h> |
218 | 224 | #include <stdlib.h> |
219 | 225 |
|
@@ -270,13 +276,13 @@ gcc -O3 -march=armv8-a+sve -o saxpy_sve saxpy_sve.c |
270 | 276 | ./saxpy_sve |
271 | 277 | ``` |
272 | 278 |
|
273 | | -### AVX-512 (512-bit SIMD, 16 floats per operation) |
| 279 | +## x86 AVX-512 version (512-bit SIMD, 16 floats per operation) |
274 | 280 |
|
275 | 281 | AVX-512 provides the widest SIMD registers of mainstream x86 architectures, processing 16 single-precision floats per 512-bit operation. |
276 | 282 |
|
277 | 283 | AVX-512 availability varies across x86 processors. It's found on Intel Xeon server processors and some high-end desktop processors, as well as select AMD EPYC models. |
278 | 284 |
|
279 | | -For very large arrays and high-performance workloads, AVX-512 delivers extremely high throughput, with additional masking features for efficient tail processing. |
| 285 | +For large arrays and high-performance workloads, AVX-512 delivers extremely high throughput, with additional masking features for efficient tail processing. |
280 | 286 |
|
281 | 287 | ```c |
282 | 288 | #include <immintrin.h> |
@@ -341,12 +347,12 @@ gcc -O3 -mavx512f -o saxpy_avx512 saxpy_avx512.c |
341 | 347 | ./saxpy_avx512 |
342 | 348 | ``` |
343 | 349 |
|
344 | | -### Summary |
| 350 | +## Summary |
345 | 351 |
|
346 | 352 | Wider data lanes mean each operation processes more elements, offering higher throughput on supported hardware. However, actual performance depends on factors like memory bandwidth, the number of execution units, and workload characteristics. |
347 | 353 |
|
348 | 354 | Processors also improve performance by implementing multiple SIMD execution units rather than just making vectors wider. For example, Arm Neoverse V2 has 4 SIMD units while Neoverse N2 has 2 SIMD units. Modern CPUs often combine both approaches (wider vectors and multiple execution units) to maximize parallel processing capability. |
349 | 355 |
|
350 | 356 | Each vector extension requires different intrinsics, compilation flags, and programming approaches. While x86 and Arm vector extensions serve similar purposes and achieve comparable performance gains, you will need to understand the options and details to create portable code. |
351 | 357 |
|
352 | | -You should also look for existing libraries that already work across vector extensions before you get too deep into code porting. This is often a good way to leverage the available SIMD capabilities on your target hardware. |
| 358 | +You can also look for existing libraries that already work across vector extensions before you get too deep into code porting. This is often a good way to leverage the available SIMD capabilities on your target hardware. |
0 commit comments