Skip to content

Enable FP8 (E4M3/E5M2) kernel execution for Radeon AI PRO R9700 (gfx1201, RDNA4) #359

@meven3000

Description

@meven3000

Is your feature request related to a problem? Please describe.

Currently, FP8-capable RDNA4 GPUs (e.g., Radeon AI PRO R9700) are unable to execute FP8 kernels within ROCm-based frameworks.
When using Transformer Engine 2.1.0 for ROCm, any FP8 autocast attempt fails with:

AssertionError: Device arch gfx94x or gfx95x required for FP8 execution

This appears to stem from a hardcoded architecture whitelist inside the Transformer Engine ROCm backend, which excludes newer RDNA4 devices even though they advertise FP8 instruction support in hardware.

Describe the solution you'd like

Allow FP8 execution (E4M3/E5M2) on gfx1201 (RDNA4) GPUs by:

Detecting FP8 arithmetic capability dynamically (via HIP/ISA flags) instead of a fixed arch check, or

Extending the internal whitelist to include gfx1200–gfx1202 (RDNA4 class devices).

This would unlock hardware FP8 acceleration for professional Radeon AI and workstation cards that implement FP8 units, providing significant VRAM and throughput gains.

Describe alternatives you've considered

NA - Not possible other than moving to Nvidia.
Noting although not specifically states on ROCM 7.x Doco, FP8 does work without issue on Pytorch for the hardware for KV cache. TE would however add additional memory reduction and potential speed improvements.

Additional context

GPU: AMD Radeon AI PRO R9700 (gfx1201, RDNA4, 32 GB)

ROCm: 7.0.2

Driver: 6.16.6 (Linux 6.8 kernel)

Transformer Engine (ROCm): 2.1.0

PyTorch: 2.8.0 + ROCm 7.0.2 container

FP8 features are already present in ROCm binaries (transformer_engine_rocm-2.1.0 and hipBLASLt), but the runtime rejects RDNA4 architectures.
Enabling this path would expand the usability of ROCm’s Transformer Engine and unify FP8 coverage across both CDNA3 and RDNA4 products.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions