Skip to content

Feature: NVIDIA Vera / Olympus SVE2+FP8 backendΒ #319

@ashvardanian

Description

@ashvardanian

Describe what you are looking for

NVIDIA Vera's Olympus cores are ARMv9.2-A with 6x 128-bit SVE2 engines per core. Key ISA extensions: FP8DOT2, FP8DOT4, FAMINMAX, LUT, SVE2_BITPERM. No SME β€” this is a wide SVE2 machine, not a matrix-engine architecture.

The defining new capability is native FP8 dot-product instructions. FDOT with FP8DOT2 takes two vectors of FP8 (E4M3 or E5M2), does pairwise multiply-add, accumulates into FP16. FP8DOT4 does 4-way accumulation into FP32. First CPU with hardware FP8 arithmetic β€” no LUT or software conversion. The svefp8 backend adds the FP8 tier and FAMINMAX on top of existing SVE2 dispatch paths.

dot/ and dots/

The highest-impact new kernel is nk_dot_e4m3_svefp8 using svdot_f16_f8 (FP8DOT2). Two SVE vectors of e4m3 produce FP16 results in hardware β€” no LUT, no upcast. With 6 engines at 128-bit, each core processes 96 e4m3 elements per cycle. The e5m2 variant uses the same instruction. For f32 output, FP8DOT4 accumulates directly into f32 via nk_dot_e4m3_f32_svefp8.

Batched dots/ state machines (e4m3x4, e5m2x4) use FP8DOT2/DOT4 in the inner loop with replicated accumulators per output lane. This is where Olympus has a unique advantage over every other backend β€” hardware FP8 dot products at the batched tile level.

spatial/ and spatials/

The new Olympus-specific kernels are nk_euclidean_e4m3_svefp8 and nk_angular_e4m3_svefp8 β€” distances directly on FP8 embeddings. For Euclidean, compute a^2 - 2ab + b^2 where the ab cross-term uses FP8DOT2 and squared norms are precomputed. Cosine needs three FP8 dot products (ab, a^2, b^2) in parallel β€” 6 SVE engines handle this without register pressure.

FAMINMAX (svaminmax) computes absolute min and max in one instruction. Useful for reduce/ kernels and for clamping log inputs in probability/ (KL divergence, Jensen-Shannon) where NaN avoidance currently needs separate min+max ops.

Batched spatials/ uses standard SVE2 tiled approach with FP8DOT2/DOT4 in the inner loop. The 6-wide frontend issues multiple independent dot-product streams per cycle, so batched variants get ILP even without a matrix engine.

set/ and sets/

nk_hamming_u1_svefp8 uses SVE2 sveor + svcnt with predication. SVE2_BITPERM adds svbext/svbdep for bit extract/deposit that could accelerate packed binary ops, but the primary u1 path remains XOR+popcount.

Can you contribute to the implementation?

  • I can contribute

Is your feature request specific to a certain interface?

It applies to everything

Contact Details

No response

Is there an existing issue for this?

  • I have searched the existing issues

Code of Conduct

  • I agree to follow this project's Code of Conduct

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions