feat: Migrate codebase from x86 AVX2 to ARM64 NEON by JoeStech · Pull Request #1 · JoeStech/docker-blog-arm-migration

JoeStech · 2026-01-20T21:56:53Z

Summary

This PR migrates the matrix operations codebase from x86-only (AVX2) to support ARM64 (NEON) architecture, enabling deployment on ARM-based infrastructure like AWS Graviton, Ampere Altra, Azure Cobalt, and Apple Silicon.

Changes Made

1. Docker Configuration

Base Image: Replaced centos:6 (x86-only, EOL) with ubuntu:22.04 (multi-arch supported)
Build System: Added TARGETARCH build argument for multi-architecture Docker builds
Compiler Flags:
- ARM64: -march=armv8-a+simd for NEON SIMD
- x86-64: -mavx2 for AVX2 (unchanged)

2. Code Changes

matrix_operations.cpp

x86 AVX2	ARM NEON	Description
`_mm256_setzero_pd()`	`vdupq_n_f64(0.0)`	Zero vector initialization
`_mm256_loadu_pd()`	`vld1q_f64()`	Unaligned vector load
`_mm256_mul_pd() + _mm256_add_pd()`	`vfmaq_f64()`	Fused multiply-add (better perf)
`_mm256_extractf128_pd()` + horizontal add	`vpaddd_f64()`	Horizontal sum reduction

main.cpp

Removed #error that blocked non-x86 compilation
Added ARM64 architecture detection and messaging
Added scalar fallback for other architectures

3. Architecture Support

The code now supports three compilation modes:

ARM64 (__aarch64__): NEON SIMD (128-bit, 2 doubles)
x86-64 (__x86_64__): AVX2 SIMD (256-bit, 4 doubles)
Generic: Scalar fallback

Performance Predictions

ARM64 (NEON) vs x86 (AVX2)

SIMD Width: NEON processes 2 doubles vs AVX2's 4 doubles per instruction
Compensating Factors:
- ARM's vfmaq_f64 fused multiply-add reduces instruction count
- Modern ARM cores (Graviton 3/4, M3) have excellent memory subsystems
- Higher clock efficiency on ARM

Expected Performance

For 200x200 matrix multiplication:

x86-64 (AVX2): Baseline performance
ARM64 (NEON): ~80-120% of x86 performance depending on CPU
Graviton 3/4: Likely comparable or faster due to FMA optimization

Cost Savings (AWS)

Migrating from x86 to ARM (Graviton) typically provides:

~20% cost reduction for equivalent performance
Better price/performance for memory-bound workloads
Lower power consumption per operation

Tools Used

Tool	Purpose
`migrate-ease-scan (cpp)`	Detected 10 x86 intrinsic issues requiring migration
`skopeo`	Verified ubuntu:22.04 ARM64 support, centos:6 incompatibility
`knowledge_base_search`	Found NEON intrinsic equivalents (vfmaq_f64, vpaddd_f64, etc.)

Validation Steps

Build Test (ARM64):

docker buildx build --platform linux/arm64 -t benchmark:arm64 .

Build Test (x86-64):

docker buildx build --platform linux/amd64 -t benchmark:amd64 .

Multi-arch Build:

docker buildx build --platform linux/amd64,linux/arm64 -t benchmark:multi .

Runtime Verification:

docker run --rm benchmark:arm64
# Should output: "Running on ARM64 architecture with NEON optimizations"

Migration Scan Results

Issues Found: 10 total
- IncompatibleHeaderFileIssue: 1 (immintrin.h)
- PreprocessorErrorIssue: 1 (#error directive)
- IntrinsicIssue: 8 (AVX2 intrinsics)

All issues resolved ✅

Breaking Changes

None - the code maintains full backward compatibility with x86-64 systems.

Future Enhancements

Consider SVE/SVE2 for Graviton 3+ (variable-length vectors)
Add runtime CPU feature detection
Benchmark comparison between architectures

- Replace centos:6 base image with ubuntu:22.04 (ARM64 compatible) - Convert AVX2 intrinsics to ARM NEON equivalents in matrix_operations.cpp - Add architecture detection for portable builds (ARM64, x86-64, generic) - Use vfmaq_f64 (fused multiply-add) for better ARM64 performance - Add multi-arch Docker build support with TARGETARCH Intrinsic mappings: - _mm256_setzero_pd -> vdupq_n_f64(0.0) - _mm256_loadu_pd -> vld1q_f64 - _mm256_mul_pd + _mm256_add_pd -> vfmaq_f64 (FMA) - Horizontal sum via AVX extract -> vpaddd_f64 Tools used: migrate-ease-scan, skopeo, knowledge_base_search

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Migrate codebase from x86 AVX2 to ARM64 NEON#1

feat: Migrate codebase from x86 AVX2 to ARM64 NEON#1
JoeStech wants to merge 1 commit intomainfrom
feature/arm64-migration

JoeStech commented Jan 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

JoeStech commented Jan 20, 2026

Summary

Changes Made

1. Docker Configuration

2. Code Changes

matrix_operations.cpp

main.cpp

3. Architecture Support

Performance Predictions

ARM64 (NEON) vs x86 (AVX2)

Expected Performance

Cost Savings (AWS)

Tools Used

Validation Steps

Migration Scan Results

Breaking Changes

Future Enhancements

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant