You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Add native FP8/BF8 WMMA support for RDNA4 GPUs, which have hardware support
for FP8 matrix operations via the llvm.amdgcn.wmma.f32.16x16x32.fp8
intrinsic.
FP8 support by RDNA generation:
- RDNA4: Native FP8/BF8 WMMA (16x16x32 shape)
- RDNA3: Requires emulation via FP16/BF16 conversion - future work
- RDNA1/2: Not supported (requires fallback paths)
The FP8 intrinsic follows the established RDNA WMMA pattern:
llvm.amdgcn.wmma.<accum>.<M>x<N>x<K>.<input_type>
For FP8 with FP32 accumulation on RDNA4:
llvm.amdgcn.wmma.f32.16x16x32.fp8
Supported FP8 formats (per AMD RDNA4 ISA Section 7.5):
- FP8 E4M3 (float8_e4m3fn): 4 exp bits, 3 mantissa, ExpBias=7-8
- FP8 E5M2 (float8_e5m2): 5 exp bits, 2 mantissa, ExpBias=15
- BF8 E5M2: 5 exp bits, 2 mantissa, ExpBias=16
This enables FP8 quantized model inference on RDNA4 GPUs with native
hardware acceleration, while maintaining compile-time guards to prevent
unsupported operations on earlier RDNA generations.
0 commit comments