ROCm Codegen on RDNA3 vs. RDNA4 #22492

kuhar · 2025-10-31T01:57:14Z

kuhar
Oct 31, 2025
Maintainer

I've started looking at the codegen quality across RDNA3 and RDNA4. Both targets are supported to the similar extent: we can target (dense) WMMA instructions and use virtually identical configuration logic at the time of writing.

For the purpose of this comparison, I'm using two workstation cards:

Looking at GEMM/Conv performance w/ bf16 element types:

Metric	RDNA3 (W7900 DS)	RDNA4 (R9700)
Compute Units	96 CUs	64 CUs
Peak BF16 TFLOPs	123 TFLOP/s	191 TFLOP/s
TFLOPs per CU	1.28 TFLOP/s	2.98 TFLOP/s
Calculated Peak Clock	2.50 GHz	2.92 GHz
Memory Interface	384-bit	256-bit
Peak Memory Bandwidth	864 GB/s	640 GB/s
MALL	96 MB	64 MB

WMMA BF16 Instructions
V_WMMA_F32_16X16X16_BF16
— Execution Cycles	32	16
— FLOPs/WGP/cycle	1024	2048
V_WMMA_BF16_16X16X16_BF16
— Execution Cycles	32	16
— FLOPs/WGP/cycle	1024	2048

RDNA3 vs RDNA4 Matmul Performance Comparison

Using the following BOO driver commands:

# gemmfp16 --a_w M --a_h K --b_w N
# Dense
gemmfp16 --a_w 2048 --a_h 2048 --b_w 2048 --transB 1
gemmfp16 --a_w 4096 --a_h 4096 --b_w 4096 --transB 1
gemmfp16 --a_w 8192 --a_h 1024 --b_w 2048 --transB 1
gemmfp16 --a_w 8192 --a_h 8192 --b_w 2048 --transB 1
# Skinny
gemmfp16 --a_w 8192 --a_h 1024 --b_w 1 --transB 1
gemmfp16 --a_w 1280 --a_h 8192 --b_w 1 --transB 1
gemmfp16 --a_w 8192 --a_h 1024 --b_w 4 --transB 1
gemmfp16 --a_w 1280 --a_h 8192 --b_w 4 --transB 1

No tuning

Dispatch	RDNA3 Mean Time	RDNA4 Mean Time	Speedup (RDNA4 vs RDNA3)
`matmul_like_2048x2048x2048_f16xf16xf32`	294.57 µs	292.77 µs	1.01x
`matmul_like_4096x4096x4096_f16xf16xf32`	1728.02 µs	1922.76 µs	0.90x
`matmul_like_8192x2048x1024_f16xf16xf32`	401.74 µs	409.96 µs	0.98x
`matmul_like_8192x2048x8192_f16xf16xf32`	3617.81 µs	5315.98 µs	0.68x
`matvec_like_8192x1024_f16xf16xf32`	17.56 µs	15.76 µs	1.11x
`matvec_like_1280x8192_f16xf16xf32`	24.18 µs	32.13 µs	0.75x
`matmul_like_8192x4x1024_f16xf16xf32`	22.46 µs	19.54 µs	1.15x
`matmul_like_1280x4x8192_f16xf16xf32`	17.57 µs	38.30 µs	0.46x

RDNA3 vs RDNA4 Convolution Performance Comparison

Using the following BOO driver commands:

#11.33 us -> 7.35 us (matmul)
convbfp16 -n 3 -c 2016 -H 1 -W 1 -k 224 -y 1 -x 1 -p 0 -q 0 -u 1 -v 1 -l 1 -j 1 --in_layout NHWC --fil_layout NHWC --out_layout NHWC -m conv -g 1 -F 1 -t 1
#8.13 us -> 6.98 us (matmul)
convbfp16 -n 4 -c 2016 -H 1 -W 1 -k 224 -y 1 -x 1 -p 0 -q 0 -u 1 -v 1 -l 1 -j 1 --in_layout NHWC --fil_layout NHWC --out_layout NHWC -m conv -g 1 -F 1 -t 1
#64.20 us ->51.78 us (matmul)
convbfp16 -n 16 -c 576 -H 8 -W 32 -k 576 -y 1 -x 1 -p 0 -q 0 -u 1 -v 1 -l 1 -j 1 -g 1 --in_layout NHWC --fil_layout NHWC --out_layout NHWC -t 1 -b 0 -F 1
#7904.38 us -> 7261.92 us (matmul)
convbfp16 -n 16 -c 768 -H 48 -W 32 -k 2048 -y 3 -x 3 -p 1 -q 1 -u 1 -v 1 -l 1 -j 1 -g 1 --in_layout NHWC --fil_layout NHWC --out_layout NHWC -t 1 -b 0 -F 1
#7823.46 us -> 7382.99 us  (conv)
convbfp16 -n 16 -c 768 -H 48 -W 32 -k 2048 -y 3 -x 3 -p 1 -q 1 -u 1 -v 1 -l 1 -j 1 -g 1 --in_layout NHWC --fil_layout NHWC --out_layout NHWC -t 1 -b 0 -F 2
#2535.53 us -> 1813 us (conv)
convbfp16 -n 16 -c 576 -H 48 -W 32 -k 576 -y 3 -x 3 -p 1 -q 1 -u 1 -v 1 -l 1 -j 1 -g 1 --in_layout NHWC --fil_layout NHWC --out_layout NHWC -t 1 -b 0 -F 2

No tuning

Dispatch	RDNA3 Mean Time	RDNA4 Mean Time	Speedup (RDNA4 vs RDNA3)
`matmul_like_3x224x2016_bf16xbf16xf32`	3.97 µs	10.43 µs	0.38x
`matmul_like_4x224x2016_bf16xbf16xf32`	13.66 µs	7.63 µs	1.79x
`matmul_like_4096x576x576_bf16xbf16xf32`	37.15 µs	55.12 µs	0.67x
`conv_16x48x32x2048x3x3x768_bf16xbf16xf32` (forward)	9201.02 µs	6726.16 µs	1.37x
`conv_16x48x32x768x3x3x2048_bf16xbf16xf32` (backward)	9220.36 µs	7767.48 µs	1.19x
`conv_16x48x32x576x3x3x576_bf16xbf16xf32` (backward)	1853.47 µs	2171.65 µs	0.85x

Tuned

RDNA3 vs RDNA4 Convolution Performance Comparison (Tuned)

Dispatch	RDNA3 Tuned Mean Time	RDNA4 Tuned Mean Time	Speedup (RDNA4 vs RDNA3)
`matmul_like_3x224x2016_bf16xbf16xf32`	3.98 µs	8.15 µs	0.49x
`matmul_like_4x224x2016_bf16xbf16xf32`	6.84 µs	7.63 µs	0.89x
`matmul_like_4096x576x576_bf16xbf16xf32`	29.10 µs	42.14 µs	0.69x
`conv_16x48x32x2048x3x3x768_bf16xbf16xf32` (forward)	7604.66 µs	6426.00 µs	1.18x
`conv_16x48x32x768x3x3x2048_bf16xbf16xf32` (backward)	8117.72 µs	7528.72 µs	1.08x
`conv_16x48x32x576x3x3x576_bf16xbf16xf32` (backward)	1680.58 µs	1676.23 µs	1.00x

TBC

kuhar · 2025-10-31T02:15:42Z

kuhar
Oct 31, 2025
Maintainer Author

cc: @MaheshRavishankar

0 replies

kuhar · 2025-10-31T18:17:00Z

kuhar
Oct 31, 2025
Maintainer Author

Tuning specs used for convs:
tuning_spec_rdna3.mlir.txt
tuning_spec_rdna4.mlir.txt

0 replies

kuhar · 2026-02-19T15:08:58Z

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ROCm Codegen on RDNA3 vs. RDNA4 #22492

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 3 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

ROCm Codegen on RDNA3 vs. RDNA4 #22492

Uh oh!

Uh oh!

kuhar Oct 31, 2025 Maintainer

RDNA3 vs RDNA4 Matmul Performance Comparison

No tuning

RDNA3 vs RDNA4 Convolution Performance Comparison

No tuning

Tuned

RDNA3 vs RDNA4 Convolution Performance Comparison (Tuned)

Replies: 3 comments

Uh oh!

kuhar Oct 31, 2025 Maintainer Author

Uh oh!

kuhar Oct 31, 2025 Maintainer Author

Uh oh!

kuhar Feb 19, 2026 Maintainer Author

kuhar
Oct 31, 2025
Maintainer

kuhar
Oct 31, 2025
Maintainer Author

kuhar
Oct 31, 2025
Maintainer Author

kuhar
Feb 19, 2026
Maintainer Author