Skip to content

Commit 60372f3

Browse files
Pending changes exported from your codespace
1 parent f889739 commit 60372f3

File tree

3 files changed

+32
-22
lines changed

3 files changed

+32
-22
lines changed

content/learning-paths/cross-platform/vectorization-comparison/1-vectorization.md

Lines changed: 25 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
---
2-
title: "Migrating SIMD code to the Arm architecture"
2+
title: "Migrate SIMD code to the Arm architecture"
33
weight: 3
44

55
# FIXED, DO NOT MODIFY
@@ -8,18 +8,20 @@ layout: "learningpathall"
88

99
## Vectorization on x86 and Arm
1010

11-
Migrating SIMD (Single Instruction, Multiple Data) code from x86 extensions to Arm extensions is an important task for software developers aiming to optimize performance on Arm platforms.
11+
Migrating SIMD (Single Instruction, Multiple Data) code from x86 extensions to Arm extensions is a key task for software developers aiming to optimize performance on Arm platforms.
1212

13-
Understanding the mapping from x86 instruction sets such as SSE, AVX, and AMX to Arm’s NEON, SVE, and SME extensions is essential for achieving portability and high performance. This Learning Path provides an overview to help you design a migration plan, leveraging Arm features such as scalable vector lengths and advanced matrix operations to adapt your code effectively.
13+
Understanding the mapping from x86 instruction sets such as SSE, AVX, and AMX to Arm’s NEON, SVE, and SME extensions is essential for achieving portability and high performance. This Learning Path provides an overview to help you design a migration plan in which you can leverage Arm features such as scalable vector lengths and advanced matrix operations to adapt your code effectively.
1414

1515
Vectorization is a key optimization strategy where one instruction processes multiple data elements simultaneously. It drives performance in High-Performance Computing (HPC), AI and ML, signal processing, and data analytics.
1616

1717
Both x86 and Arm processors offer rich SIMD capabilities, but they differ in philosophy and design. The x86 architecture provides fixed-width vector units of 128, 256, and 512 bits. The Arm architecture offers fixed-width vectors for NEON and scalable vectors for SVE and SME, ranging from 128 to 2048 bits.
1818

19-
If you are migrating SIMD software to Arm, understanding these differences helps you write portable, high-performance code.
19+
If you are migrating SIMD software to Arm, understanding these differences will help you write portable, high-performance code.
2020

2121
## Arm vector and matrix extensions
2222

23+
This section provides some more information about the Arm vector and matrix extensions and shows you when to use each, how they map from SSE/AVX/AMX, and what changes in your programming model (predication, gather/scatter, tiles, streaming mode).
24+
2325
### NEON
2426

2527
NEON is a 128-bit SIMD extension available across Armv8-A cores, including Neoverse and mobile. It is well suited to multimedia, DSP, and packet processing. Conceptually, NEON is closest to x86 SSE and AVX used in 128-bit mode, making it the primary target when migrating many SSE workloads. Compiler auto-vectorization to NEON is mature, reducing the need for manual intrinsics.
@@ -34,6 +36,8 @@ SME accelerates matrix multiplication and is similar in intent to AMX. Unlike AM
3436

3537
## x86 vector and matrix extensions
3638

39+
Here is a brief overview of the x86 families you’ll likely port from: SSE (128-bit), AVX/AVX-512 (256/512-bit with masking), and AMX (tile-based matrix compute). Use this to identify feature equivalents before mapping kernels to NEON, SVE/SVE2, or SME on Arm.
40+
3741
### Streaming SIMD Extensions (SSE)
3842

3943
The SSE instruction set provides 128-bit XMM registers and supports both integer and floating-point SIMD operations. Despite being an older technology, SSE remains a baseline for many libraries due to its widespread adoption.
@@ -50,21 +54,23 @@ AMX accelerates matrix operations with tile registers configured using a tile pa
5054

5155
## Comparison tables
5256

53-
### SSE vs. NEON
57+
Use these side-by-side tables to pick the right Arm target and plan refactors. They compare register width, predication/masking, gather/scatter, key operations, typical workloads, and limitations for SSE ↔ NEON, AVX/AVX-512 ↔ SVE/SVE2, and AMX ↔ SME.
58+
59+
### A comparison of SSE and NEON
5460

5561
| Feature | SSE | NEON |
5662
|---|---|---|
5763
| **Register width** | 128-bit (XMM) | 128-bit (Q) |
5864
| **Vector length model** | Fixed 128 bits | Fixed 128 bits |
5965
| **Predication or masking** | Minimal, no dedicated mask registers | No dedicated mask registers; use bitwise selects and conditionals |
60-
| **Gather or scatter** | No native gather or scatter; gather in AVX2 and scatter in AVX-512 | No native gather or scatter; emulate in software |
66+
| **Gather/scatter** | No native gather/scatter; gather in AVX2 and scatter in AVX-512 | No native gather/scatter; emulate in software |
6167
| **Instruction set scope** | Arithmetic, logical, shuffle, convert, basic SIMD | Arithmetic, logical, shuffle, saturating ops; cryptography via Armv8 Cryptography Extensions (AES and SHA) |
6268
| **Floating-point support** | Single and double precision | Single and double precision |
6369
| **Typical applications** | Legacy SIMD, general vector arithmetic | Multimedia, DSP, cryptography, embedded compute |
6470
| **Extensibility** | Extended by AVX, AVX2, and AVX-512 | Fixed at 128-bit; scalable vectors provided by SVE as a separate extension |
65-
| **Programming model** | Intrinsics in C or C plus plus; assembly for hotspots | Intrinsics widely used; inline assembly less common |
71+
| **Programming model** | Intrinsics in C/C++; assembly for hotspots | Intrinsics widely used; inline assembly less common |
6672

67-
### AVX vs. SVE (SVE2)
73+
### A comparison of AVX and SVE (SVE2)
6874

6975
| Feature | x86: AVX or AVX-512 | Arm: SVE or SVE2 |
7076
|---|---|---|
@@ -80,7 +86,7 @@ AMX accelerates matrix operations with tile registers configured using a tile pa
8086
SVE2 extends SVE with richer integer and DSP capabilities for general-purpose and media workloads.
8187
{{% /notice %}}
8288

83-
### AMX vs. SME
89+
### A comparison of AMX and SME
8490

8591
| Feature | x86: AMX | Arm: SME |
8692
|---|---|---|
@@ -92,7 +98,9 @@ SVE2 extends SVE with richer integer and DSP capabilities for general-purpose an
9298
| **Best suited for** | AI and ML training and inference, GEMM and convolution kernels | AI and ML training and inference, scientific and HPC dense linear algebra |
9399
| **Limitations** | Hardware and software availability limited to specific CPUs | Emerging hardware support; compiler and library support evolving |
94100

95-
## Key differences for developers
101+
## The key differences for developers
102+
103+
The most significant changes when porting include moving from fixed-width SIMD to vector-length-agnostic loop structures, replacing mask-register control with predicate-driven control, and adjusting memory access patterns and compiler flags. Review this section first to minimize rework and preserve portable performance.
96104

97105
### Vector length model
98106

@@ -106,16 +114,18 @@ x86 intrinsics are extensive, and AVX-512 adds masks and lane controls that incr
106114

107115
AMX provides fixed-geometry tile compute optimized for dot products. SME extends Arm’s scalable model with outer-product math, scalable tiles, and streaming mode. Both AMX and SME are currently available on a limited set of platforms.
108116

109-
### Overall summary
117+
## Summary
110118

111119
Migrating from x86 SIMD to Arm entails adopting Arm’s scalable and predicated programming model with SVE and SME for forward-portable performance, while continuing to use NEON for fixed-width SIMD similar to SSE.
112120

113121
## Migration tools
114122

115123
Several libraries help translate or abstract SIMD intrinsics to speed up migration. Coverage varies, and some features have no direct analogue.
116124

117-
- **sse2neon:** open-source header that maps many SSE2 intrinsics to NEON equivalents. Good for getting code building quickly. Review generated code for performance. <https://github.com/DLTcollab/sse2neon>
118-
- **SIMD Everywhere (SIMDe):** header-only portability layer that implements many x86 and Arm intrinsics across ISAs, with scalar fallbacks when SIMD is unavailable. <https://github.com/simd-everywhere/simde>
119-
- **Google Highway (hwy):** portable SIMD library and APIs that target multiple ISAs, including NEON, SVE where supported, and AVX, without per-ISA code paths. <https://github.com/google/highway>
125+
Here are some of the tools available and their key features:
126+
127+
- Sse2neon: an open-source header that maps many SSE2 intrinsics to NEON equivalents. Good for getting code building quickly. Review generated code for performance. See the [sse2neon GitHub repository](https://github.com/DLTcollab/sse2neon).
128+
- SIMD Everywhere (SIMDe): a header-only portability layer that implements many x86 and Arm intrinsics across ISAs, with scalar fallbacks when SIMD is unavailable. See the [simde-everywhere GitHub repository](https://github.com/simd-everywhere/simde).
129+
- Google Highway (hwy): a portable SIMD library and APIs that target multiple ISAs, including NEON, SVE where supported, and AVX, without per-ISA code paths. See the [Google highway GitHub repository](https://github.com/google/highway).
120130

121-
For more on cross-platform intrinsics, see [Porting architecture-specific intrinsics](/learning-paths/cross-platform/intrinsics/).
131+
For more on cross-platform intrinsics, see the Learning Path [Porting architecture-specific intrinsics](/learning-paths/cross-platform/intrinsics/).

content/learning-paths/cross-platform/vectorization-comparison/2-code-examples.md

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
---
2-
title: "Vector extension code examples"
2+
title: "Explore vector extension code examples"
33
weight: 4
44

55
# FIXED, DO NOT MODIFY
@@ -13,11 +13,11 @@ This page walks you through a SAXPY (Single-Precision A·X Plus Y) kernel implem
1313
SAXPY computes `y[i] = a * x[i] + y[i]` across arrays `x` and `y`. It is widely used in numerical computing and is an accessible way to compare SIMD behavior across ISAs.
1414

1515
{{% notice Tip %}}
16-
If a library already provides a tuned SAXPY (for example, BLAS), prefer that over hand-written kernels. These examples are for learning and porting.
16+
If a library already provides a tuned SAXPY (for example, BLAS), use that over hand-written kernels. These examples are for learning and porting.
1717
{{% /notice %}}
1818

1919

20-
### Reference C version (no SIMD intrinsics)
20+
## Reference C version (no SIMD intrinsics)
2121

2222
Below is a plain C implementation of SAXPY without any vector extensions which serves as a reference baseline for the optimized examples provided later:
2323

@@ -57,7 +57,7 @@ int main() {
5757
}
5858
```
5959
60-
Use a text editor to copy the code to a file `saxpy_plain.c` and build and run the code using:
60+
Use a text editor to copy the code to a file called `saxpy_plain.c` and build and run the code using:
6161
6262
```bash
6363
gcc -O3 -o saxpy_plain saxpy_plain.c
@@ -138,7 +138,7 @@ gcc -O3 -march=armv8-a+simd -o saxpy_neon saxpy_neon.c
138138
./saxpy_neon
139139
```
140140

141-
{{% notice optional_title %}}
141+
{{% notice Note %}}
142142
On AArch64, NEON is mandatory; the flag is shown for clarity.
143143
{{% /notice %}}
144144

@@ -208,7 +208,7 @@ gcc -O3 -mavx2 -mfma -o saxpy_avx2 saxpy_avx2.c
208208
./saxpy_avx2
209209
```
210210

211-
### Arm SVE (hardware dependent: 4 to 16+ floats per operation)
211+
## Arm SVE (hardware dependent: 4 to 16+ floats per operation)
212212

213213
Arm SVE lets the hardware determine the register width, which can range from 128 up to 2048 bits. This means each operation can process from 4 to 64 single-precision floats at a time, depending on the implementation.
214214

content/learning-paths/cross-platform/vectorization-comparison/_index.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@ title: "Migrate x86-64 SIMD to Arm64"
33

44
minutes_to_complete: 30
55

6-
who_is_this_for: Advanced software developers migrating vectorized (SIMD) code from x86-64 to Arm64.
6+
who_is_this_for: This is an advanced topic for developers migrating vectorized (SIMD) code from x86-64 to Arm64.
77

88
learning_objectives:
99
- Identify how Arm vector extensions including NEON, Scalable Vector Extension (SVE), and Scalable Matrix Extension (SME) map to vector extensions from other architectures

0 commit comments

Comments
 (0)