-
Notifications
You must be signed in to change notification settings - Fork 111
Description
Describe what you are looking for
NVIDIA Vera's Olympus cores are ARMv9.2-A with 6x 128-bit SVE2 engines per core. Key ISA extensions: FP8DOT2, FP8DOT4, FAMINMAX, LUT, SVE2_BITPERM. No SME β this is a wide SVE2 machine, not a matrix-engine architecture.
The defining new capability is native FP8 dot-product instructions. FDOT with FP8DOT2 takes two vectors of FP8 (E4M3 or E5M2), does pairwise multiply-add, accumulates into FP16. FP8DOT4 does 4-way accumulation into FP32. First CPU with hardware FP8 arithmetic β no LUT or software conversion. The svefp8 backend adds the FP8 tier and FAMINMAX on top of existing SVE2 dispatch paths.
dot/ and dots/
The highest-impact new kernel is nk_dot_e4m3_svefp8 using svdot_f16_f8 (FP8DOT2). Two SVE vectors of e4m3 produce FP16 results in hardware β no LUT, no upcast. With 6 engines at 128-bit, each core processes 96 e4m3 elements per cycle. The e5m2 variant uses the same instruction. For f32 output, FP8DOT4 accumulates directly into f32 via nk_dot_e4m3_f32_svefp8.
Batched dots/ state machines (e4m3x4, e5m2x4) use FP8DOT2/DOT4 in the inner loop with replicated accumulators per output lane. This is where Olympus has a unique advantage over every other backend β hardware FP8 dot products at the batched tile level.
spatial/ and spatials/
The new Olympus-specific kernels are nk_euclidean_e4m3_svefp8 and nk_angular_e4m3_svefp8 β distances directly on FP8 embeddings. For Euclidean, compute a^2 - 2ab + b^2 where the ab cross-term uses FP8DOT2 and squared norms are precomputed. Cosine needs three FP8 dot products (ab, a^2, b^2) in parallel β 6 SVE engines handle this without register pressure.
FAMINMAX (svaminmax) computes absolute min and max in one instruction. Useful for reduce/ kernels and for clamping log inputs in probability/ (KL divergence, Jensen-Shannon) where NaN avoidance currently needs separate min+max ops.
Batched spatials/ uses standard SVE2 tiled approach with FP8DOT2/DOT4 in the inner loop. The 6-wide frontend issues multiple independent dot-product streams per cycle, so batched variants get ILP even without a matrix engine.
set/ and sets/
nk_hamming_u1_svefp8 uses SVE2 sveor + svcnt with predication. SVE2_BITPERM adds svbext/svbdep for bit extract/deposit that could accelerate packed binary ops, but the primary u1 path remains XOR+popcount.
Can you contribute to the implementation?
- I can contribute
Is your feature request specific to a certain interface?
It applies to everything
Contact Details
No response
Is there an existing issue for this?
- I have searched the existing issues
Code of Conduct
- I agree to follow this project's Code of Conduct