-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Background
LLVM IR has first-class vector types (`<4 x i32>`, `<8 x float>`, etc.) and vector instructions (`shufflevector`, `insertelement`, `extractelement`). The parser and IR types already support these, but the x86 codegen backend currently has no lowering for them — any vector instruction panics or emits a NOP.
Vectorization is one of the highest-leverage optimizations in a compiler. A single AVX2 instruction can process 8 floats simultaneously, delivering 8× throughput for numerical workloads.
Goals
Phase 1 — Scalar lowering of vector IR (baseline)
Scalarize vector operations into scalar loops so they at least produce correct (if slow) code. This is the fallback when the target doesn't support SIMD.
Phase 2 — SSE4.2 codegen
Lower common vector patterns to SSE4.2 instructions:
<4 x i32>arithmetic →PADDD,PSUBD,PMULLD<4 x float>arithmetic →ADDPS,MULPS,DIVPS<2 x double>arithmetic →ADDPD,MULPD- Loads/stores →
MOVDQU,MOVAPS
Phase 3 — AVX2 codegen
Extend to 256-bit registers for 2× throughput on modern CPUs:
<8 x i32>,<8 x float>,<4 x double>VPADDD,VMULPS, etc.
Phase 4 — Auto-vectorization pass
A middle-end pass that detects scalar loops over arrays and transforms them into vector IR, enabling the backend to emit SIMD instructions even for scalar source programs.
Acceptance criteria
- All vector
InstrKindvariants produce valid (non-panicking) x86 output at O0 (scalarized) - SSE4.2 lowering for the 12 most common patterns, each with an encoding test
- At least one end-to-end test: vector
.ll→ ELF → link → run → correct result - Feature-gated: vectorization only emitted when target CPU supports it (via a
TargetFeaturesflag)