@@ -775,33 +775,34 @@ fn _has_gpu_fp32_tensor_cores() -> Bool:
775775
776776@always_inline (" nodebug" )
777777fn _has_gpu_bf16_fma () -> Bool:
778- """ Returns True if the GPU supports BF16 outputs with FMA operations.
778+ """ Returns True if the GPU supports BF16 FMA operations.
779779
780- This checks whether the GPU can perform BF16 × BF16 → BF16 operations
781- using scalar/vector FMA instructions (not tensor cores).
780+ This checks whether the GPU can perform BF16 × BF16 operations using
781+ scalar/vector FMA instructions (not tensor cores). On some platforms,
782+ this may use FP32 emulation internally.
782783
783784 Returns True for:
784- - NVIDIA GPUs (all architectures support BF16 FMA)
785- - AMD CDNA GPUs with MFMA (MI300X, MI355X)
785+ - NVIDIA GPUs (all architectures support native BF16 FMA)
786+ - AMD CDNA GPUs with MFMA (MI300X, MI355X - native BF16 support)
787+ - AMD RDNA GPUs (RDNA3+ - emulated via FP32 accumulation)
786788 - Apple GPUs (M-series support BF16 operations)
787789
788- Returns False for:
789- - AMD RDNA GPUs - these require FP32 accumulation for BF16 FMA.
790- BF16 outputs are only supported via WMMA (tensor cores), which
791- LLVM cannot lower yet. For FMA operations, RDNA requires
792- BF16 inputs with FP32 outputs.
790+ Implementation notes:
791+ - RDNA3 hardware supports BF16 via v_wmma_* instructions, but LLVM
792+ cannot lower these intrinsics yet. For FMA operations, the compiler
793+ automatically promotes BF16 to FP32, performs FP32 computation, then
794+ converts back to BF16. This emulation provides correct results with
795+ some performance overhead.
796+ - CDNA uses native v_mfma_* instructions for BF16.
793797
794798 Note:
795799 This is specifically for FMA (non-tensor-core) operations.
796800 For tensor core BF16 support, use _has_gpu_tensor_cores().
797801
798802 Returns:
799- True if the GPU supports BF16 output with FMA operations.
803+ True if the GPU supports BF16 FMA operations (native or emulated) .
800804 """
801- # NVIDIA: All GPUs support BF16 FMA
802- # AMD: Only CDNA (MFMA) supports BF16 outputs; RDNA requires FP32 accumulation
803- # Apple: M-series GPUs support BF16 operations
804- return is_nvidia_gpu() or _has_amd_tensor_cores() or is_apple_gpu()
805+ return is_nvidia_gpu() or has_amd_gpu_accelerator() or is_apple_gpu()
805806
806807
807808@always_inline (" nodebug" )
0 commit comments