Feature: NVIDIA Vera / Olympus SVE2+FP8 backend

### Describe what you are looking for

NVIDIA Vera's Olympus cores are ARMv9.2-A with 6x 128-bit SVE2 engines per core. Key ISA extensions: FP8DOT2, FP8DOT4, FAMINMAX, LUT, SVE2_BITPERM. No SME — this is a wide SVE2 machine, not a matrix-engine architecture.

The defining new capability is native FP8 dot-product instructions. `FDOT` with FP8DOT2 takes two vectors of FP8 (E4M3 or E5M2), does pairwise multiply-add, accumulates into FP16. FP8DOT4 does 4-way accumulation into FP32. First CPU with hardware FP8 arithmetic — no LUT or software conversion. The `svefp8` backend adds the FP8 tier and FAMINMAX on top of existing SVE2 dispatch paths.


## dot/ and dots/

The highest-impact new kernel is `nk_dot_e4m3_svefp8` using `svdot_f16_f8` (FP8DOT2). Two SVE vectors of e4m3 produce FP16 results in hardware — no LUT, no upcast. With 6 engines at 128-bit, each core processes 96 e4m3 elements per cycle. The e5m2 variant uses the same instruction. For f32 output, FP8DOT4 accumulates directly into f32 via `nk_dot_e4m3_f32_svefp8`.

Batched `dots/` state machines (e4m3x4, e5m2x4) use FP8DOT2/DOT4 in the inner loop with replicated accumulators per output lane. This is where Olympus has a unique advantage over every other backend — hardware FP8 dot products at the batched tile level.


## spatial/ and spatials/

The new Olympus-specific kernels are `nk_euclidean_e4m3_svefp8` and `nk_angular_e4m3_svefp8` — distances directly on FP8 embeddings. For Euclidean, compute `a^2 - 2ab + b^2` where the `ab` cross-term uses FP8DOT2 and squared norms are precomputed. Cosine needs three FP8 dot products (ab, a^2, b^2) in parallel — 6 SVE engines handle this without register pressure.

FAMINMAX (`svaminmax`) computes absolute min and max in one instruction. Useful for `reduce/` kernels and for clamping log inputs in `probability/` (KL divergence, Jensen-Shannon) where NaN avoidance currently needs separate min+max ops.

Batched `spatials/` uses standard SVE2 tiled approach with FP8DOT2/DOT4 in the inner loop. The 6-wide frontend issues multiple independent dot-product streams per cycle, so batched variants get ILP even without a matrix engine.


## set/ and sets/

`nk_hamming_u1_svefp8` uses SVE2 `sveor` + `svcnt` with predication. SVE2_BITPERM adds `svbext`/`svbdep` for bit extract/deposit that could accelerate packed binary ops, but the primary u1 path remains XOR+popcount.


### Can you contribute to the implementation?

- [x] I can contribute

### Is your feature request specific to a certain interface?

It applies to everything

### Contact Details

_No response_

### Is there an existing issue for this?

- [x] I have searched the existing issues

### Code of Conduct

- [x] I agree to follow this project's Code of Conduct

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature: NVIDIA Vera / Olympus SVE2+FP8 backend #319

Describe what you are looking for

dot/ and dots/

spatial/ and spatials/

set/ and sets/

Can you contribute to the implementation?

Is your feature request specific to a certain interface?

Contact Details

Is there an existing issue for this?

Code of Conduct

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Feature: NVIDIA Vera / Olympus SVE2+FP8 backend #319

Description

Describe what you are looking for

dot/ and dots/

spatial/ and spatials/

set/ and sets/

Can you contribute to the implementation?

Is your feature request specific to a certain interface?

Contact Details

Is there an existing issue for this?

Code of Conduct

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions