Skip to content

Commit 8c55e5d

Browse files
Updates
1 parent bf00850 commit 8c55e5d

File tree

1 file changed

+26
-20
lines changed

1 file changed

+26
-20
lines changed

content/learning-paths/cross-platform/vectorization-comparison/2-code-examples.md

Lines changed: 26 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -1,24 +1,25 @@
11
---
2-
title: Vector extension code examples
2+
title: "Vector extension code examples"
33
weight: 4
44

5-
### FIXED, DO NOT MODIFY
6-
layout: learningpathall
5+
# FIXED, DO NOT MODIFY
6+
layout: "learningpathall"
77
---
88

9-
## SAXPY Example code
9+
## SAXPY example code
1010

11-
As a way to provide some hands-on experience, you can study and run example code to better understand the vector extensions. The example used here is SAXPY.
11+
This page walks you through a SAXPY (Single-Precision A·X Plus Y) kernel implemented in plain C and with vector extensions on both Arm (NEON, SVE) and x86 (AVX2, AVX-512). You will see how to build and run each version and how the vector width affects throughput.
1212

13-
SAXPY stands for "Single-Precision A·X Plus Y" and is a fundamental operation in linear algebra. It computes the result of the equation `y[i] = a * x[i] + y[i]` for all elements in the arrays `x` and `y`.
13+
SAXPY computes `y[i] = a * x[i] + y[i]` across arrays `x` and `y`. It is widely used in numerical computing and is an accessible way to compare SIMD behavior across ISAs.
1414

15-
SAXPY is widely used in numerical computing, particularly in vectorized and parallelized environments, due to its simplicity and efficiency.
15+
{{% notice Tip %}}
16+
If a library already provides a tuned SAXPY (for example, BLAS), prefer that over hand-written kernels. These examples are for learning and porting.
17+
{{% /notice %}}
1618

17-
### Reference version
1819

19-
Below is a plain C implementation of SAXPY without any vector extensions.
20+
### Reference C version (no SIMD intrinsics)
2021

21-
This serves as a reference for the optimized examples provided later.
22+
Below is a plain C implementation of SAXPY without any vector extensions which serves as a reference baseline for the optimized examples provided later:
2223

2324
```c
2425
#include <stddef.h>
@@ -65,13 +66,11 @@ gcc -O3 -o saxpy_plain saxpy_plain.c
6566

6667
You can use Clang for any of the examples by replacing `gcc` with `clang` on the command line.
6768

68-
### Arm NEON version (128-bit SIMD, 4 floats per operation)
69+
## Arm NEON version (128-bit SIMD, 4 floats per operation)
6970

70-
NEON operates on fixed 128-bit registers, able to process 4 single-precision float values simultaneously in every vector instruction.
71+
NEON uses fixed 128-bit registers, processing four `float` values per instruction. It is available on most Armv8-A devices and is excellent for accelerating loops and signal processing tasks in mobile and embedded workloads.
7172

72-
This extension is available on most Arm-based devices and is excellent for accelerating loops and signal processing tasks in mobile and embedded workloads.
73-
74-
The example below processes 16 floats per iteration using four separate NEON operations to improve instruction-level parallelism and reduce loop overhead.
73+
The example below processes 16 floats per iteration using four separate NEON operations to improve instruction-level parallelism and reduce loop overhead:
7574

7675
```c
7776
#include <arm_neon.h>
@@ -139,7 +138,13 @@ gcc -O3 -march=armv8-a+simd -o saxpy_neon saxpy_neon.c
139138
./saxpy_neon
140139
```
141140

142-
### AVX2 (256-bit SIMD, 8 floats per operation)
141+
{{% notice optional_title %}}
142+
On AArch64, NEON is mandatory; the flag is shown for clarity.
143+
{{% /notice %}}
144+
145+
146+
147+
## x86 AVX2 version (256-bit SIMD, 8 floats per operation)
143148

144149
AVX2 doubles the SIMD width compared to NEON, processing 8 single-precision floats at a time in 256-bit registers.
145150

@@ -214,6 +219,7 @@ SVE encourages writing vector-length agnostic code: the compiler automatically h
214219
```c
215220
#include <arm_sve.h>
216221
#include <stddef.h>
222+
#include <stdint.h>
217223
#include <stdio.h>
218224
#include <stdlib.h>
219225

@@ -270,13 +276,13 @@ gcc -O3 -march=armv8-a+sve -o saxpy_sve saxpy_sve.c
270276
./saxpy_sve
271277
```
272278

273-
### AVX-512 (512-bit SIMD, 16 floats per operation)
279+
## x86 AVX-512 version (512-bit SIMD, 16 floats per operation)
274280

275281
AVX-512 provides the widest SIMD registers of mainstream x86 architectures, processing 16 single-precision floats per 512-bit operation.
276282

277283
AVX-512 availability varies across x86 processors. It's found on Intel Xeon server processors and some high-end desktop processors, as well as select AMD EPYC models.
278284

279-
For very large arrays and high-performance workloads, AVX-512 delivers extremely high throughput, with additional masking features for efficient tail processing.
285+
For large arrays and high-performance workloads, AVX-512 delivers extremely high throughput, with additional masking features for efficient tail processing.
280286

281287
```c
282288
#include <immintrin.h>
@@ -341,12 +347,12 @@ gcc -O3 -mavx512f -o saxpy_avx512 saxpy_avx512.c
341347
./saxpy_avx512
342348
```
343349

344-
### Summary
350+
## Summary
345351

346352
Wider data lanes mean each operation processes more elements, offering higher throughput on supported hardware. However, actual performance depends on factors like memory bandwidth, the number of execution units, and workload characteristics.
347353

348354
Processors also improve performance by implementing multiple SIMD execution units rather than just making vectors wider. For example, Arm Neoverse V2 has 4 SIMD units while Neoverse N2 has 2 SIMD units. Modern CPUs often combine both approaches (wider vectors and multiple execution units) to maximize parallel processing capability.
349355

350356
Each vector extension requires different intrinsics, compilation flags, and programming approaches. While x86 and Arm vector extensions serve similar purposes and achieve comparable performance gains, you will need to understand the options and details to create portable code.
351357

352-
You should also look for existing libraries that already work across vector extensions before you get too deep into code porting. This is often a good way to leverage the available SIMD capabilities on your target hardware.
358+
You can also look for existing libraries that already work across vector extensions before you get too deep into code porting. This is often a good way to leverage the available SIMD capabilities on your target hardware.

0 commit comments

Comments
 (0)