Skip to content

feat: Migrate codebase from x86 AVX2 to ARM64 NEON#1

Open
JoeStech wants to merge 1 commit intomainfrom
feature/arm64-migration
Open

feat: Migrate codebase from x86 AVX2 to ARM64 NEON#1
JoeStech wants to merge 1 commit intomainfrom
feature/arm64-migration

Conversation

@JoeStech
Copy link
Owner

Summary

This PR migrates the matrix operations codebase from x86-only (AVX2) to support ARM64 (NEON) architecture, enabling deployment on ARM-based infrastructure like AWS Graviton, Ampere Altra, Azure Cobalt, and Apple Silicon.

Changes Made

1. Docker Configuration

  • Base Image: Replaced centos:6 (x86-only, EOL) with ubuntu:22.04 (multi-arch supported)
  • Build System: Added TARGETARCH build argument for multi-architecture Docker builds
  • Compiler Flags:
    • ARM64: -march=armv8-a+simd for NEON SIMD
    • x86-64: -mavx2 for AVX2 (unchanged)

2. Code Changes

matrix_operations.cpp

x86 AVX2 ARM NEON Description
_mm256_setzero_pd() vdupq_n_f64(0.0) Zero vector initialization
_mm256_loadu_pd() vld1q_f64() Unaligned vector load
_mm256_mul_pd() + _mm256_add_pd() vfmaq_f64() Fused multiply-add (better perf)
_mm256_extractf128_pd() + horizontal add vpaddd_f64() Horizontal sum reduction

main.cpp

  • Removed #error that blocked non-x86 compilation
  • Added ARM64 architecture detection and messaging
  • Added scalar fallback for other architectures

3. Architecture Support

The code now supports three compilation modes:

  1. ARM64 (__aarch64__): NEON SIMD (128-bit, 2 doubles)
  2. x86-64 (__x86_64__): AVX2 SIMD (256-bit, 4 doubles)
  3. Generic: Scalar fallback

Performance Predictions

ARM64 (NEON) vs x86 (AVX2)

  • SIMD Width: NEON processes 2 doubles vs AVX2's 4 doubles per instruction
  • Compensating Factors:
    • ARM's vfmaq_f64 fused multiply-add reduces instruction count
    • Modern ARM cores (Graviton 3/4, M3) have excellent memory subsystems
    • Higher clock efficiency on ARM

Expected Performance

For 200x200 matrix multiplication:

  • x86-64 (AVX2): Baseline performance
  • ARM64 (NEON): ~80-120% of x86 performance depending on CPU
  • Graviton 3/4: Likely comparable or faster due to FMA optimization

Cost Savings (AWS)

Migrating from x86 to ARM (Graviton) typically provides:

  • ~20% cost reduction for equivalent performance
  • Better price/performance for memory-bound workloads
  • Lower power consumption per operation

Tools Used

Tool Purpose
migrate-ease-scan (cpp) Detected 10 x86 intrinsic issues requiring migration
skopeo Verified ubuntu:22.04 ARM64 support, centos:6 incompatibility
knowledge_base_search Found NEON intrinsic equivalents (vfmaq_f64, vpaddd_f64, etc.)

Validation Steps

  1. Build Test (ARM64):

    docker buildx build --platform linux/arm64 -t benchmark:arm64 .
  2. Build Test (x86-64):

    docker buildx build --platform linux/amd64 -t benchmark:amd64 .
  3. Multi-arch Build:

    docker buildx build --platform linux/amd64,linux/arm64 -t benchmark:multi .
  4. Runtime Verification:

    docker run --rm benchmark:arm64
    # Should output: "Running on ARM64 architecture with NEON optimizations"

Migration Scan Results

Issues Found: 10 total
- IncompatibleHeaderFileIssue: 1 (immintrin.h)
- PreprocessorErrorIssue: 1 (#error directive)
- IntrinsicIssue: 8 (AVX2 intrinsics)

All issues resolved ✅

Breaking Changes

None - the code maintains full backward compatibility with x86-64 systems.

Future Enhancements

  • Consider SVE/SVE2 for Graviton 3+ (variable-length vectors)
  • Add runtime CPU feature detection
  • Benchmark comparison between architectures

- Replace centos:6 base image with ubuntu:22.04 (ARM64 compatible)
- Convert AVX2 intrinsics to ARM NEON equivalents in matrix_operations.cpp
- Add architecture detection for portable builds (ARM64, x86-64, generic)
- Use vfmaq_f64 (fused multiply-add) for better ARM64 performance
- Add multi-arch Docker build support with TARGETARCH

Intrinsic mappings:
- _mm256_setzero_pd -> vdupq_n_f64(0.0)
- _mm256_loadu_pd -> vld1q_f64
- _mm256_mul_pd + _mm256_add_pd -> vfmaq_f64 (FMA)
- Horizontal sum via AVX extract -> vpaddd_f64

Tools used: migrate-ease-scan, skopeo, knowledge_base_search
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant