Improve VALU FP16 test in roofline benchmark #2985

benrichard-amd · 2026-01-30T16:59:48Z

Motivation

Update VALU FMA benchmark so that FP16 numbers are closer to peak

Technical Details

FP16 result was very low, like ~0.25X FP32 on MI300X/MI350X. On MI100 it should be ~2x FP32, and on MI300/MI350 should be ~1x FP32.
Update the VALU FMA test to use vector types. This hints the compiler should use packed math when available, and allows for more instruction-level parallelism.
Also assigned different number of iterations for different types, to keep the running time under control, as different types have different rates.
I checked the disassembly, packed math is used for FP16 and FP32. Clang has an option to disable packed FP32 math, if we want to do that.

Old (MI350X):

Peak VALU FLOPs (FP16), GPU ID: 0, workgroupSize:256, workgroups:32768, experiments:100, FLOP:549755813888, duration:15.5 ms, mean:35490.2 GFLOPS, stdev=43.0 GFLOPS
100% [||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||]
Peak VALU FLOPs (FP32), GPU ID: 0, workgroupSize:256, workgroups:32768, experiments:100, FLOP:274877906944, duration:2.0 ms, mean:135885.7 GFLOPS, stdev=3189.5 GFLOPS
100% [||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||]
Peak VALU FLOPs (FP64), GPU ID: 0, workgroupSize:256, workgroups:32768, experiments:100, FLOP:137438953472, duration:2.0 ms, mean:69220.3 GFLOPS, stdev=1413.7 GFLOPS
100% [||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||]
Peak VALU IOPs (INT8), GPU ID: 0, workgroupSize:256, workgroups:32768, experiments:100, IOP:1099511627776, duration:14.8 ms, mean:74510.3 GOPS, stdev=36.4 GFLOPS
100% [||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||]
Peak VALU IOPs (INT32), GPU ID: 0, workgroupSize:256, workgroups:32768, experiments:100, IOP:274877906944, duration:4.2 ms, mean:66154.8 GOPS, stdev=674.7 GFLOPS
100% [||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||]
Peak VALU IOPs (INT64), GPU ID: 0, workgroupSize:256, workgroups:32768, experiments:100, IOP:137438953472, duration:7.8 ms, mean:17709.6 GOPS, stdev=48.6 GFLOPS

New (MI350X):

Peak VALU FLOPs (FP16), GPU ID: 0, workgroupSize:256, workgroups:32768, experiments:100, FLOP:4398046511104, duration:30.8 ms, mean:142627.6 GFLOPS, stdev=86.2 GFLOPS
100% [||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||]
Peak VALU FLOPs (FP32), GPU ID: 0, workgroupSize:256, workgroups:32768, experiments:100, FLOP:4398046511104, duration:31.1 ms, mean:141336.3 GFLOPS, stdev=585.4 GFLOPS
100% [||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||]
Peak VALU FLOPs (FP64), GPU ID: 0, workgroupSize:256, workgroups:32768, experiments:100, FLOP:2199023255552, duration:31.0 ms, mean:70997.5 GFLOPS, stdev=87.0 GFLOPS
100% [||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||]
Peak VALU IOPs (INT8), GPU ID: 0, workgroupSize:256, workgroups:32768, experiments:100, IOP:2199023255552, duration:36.5 ms, mean:60258.1 GOPS, stdev=97.4 GFLOPS
100% [||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||]
Peak VALU IOPs (INT32), GPU ID: 0, workgroupSize:256, workgroups:32768, experiments:100, IOP:2199023255552, duration:36.1 ms, mean:60906.7 GOPS, stdev=574.0 GFLOPS
100% [||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||]
Peak VALU IOPs (INT64), GPU ID: 0, workgroupSize:256, workgroups:32768, experiments:100, IOP:1099511627776, duration:62.1 ms, mean:17699.0 GOPS, stdev=20.1 GFLOPS

Test Plan

Verify FP16 is close to FP32 on MI300X/MI350X.
Verify FP16 ix 2X FP32 on MI100
Verify other scores are not negatively affected

Test Result

Tested on MI100, MI325X and MI350X.

Submission Checklist

Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

Use vector type and multiple variables to improve ILP.

Still get similar performance

Adjust iterations

skyreflectedinmirrors · 2026-01-30T17:10:38Z

projects/rocprofiler-compute/src/utils/benchmark.py

+    vec4<T> x0 = {(T)1,(T)2,(T)3,(T)4};
+
+    for(int i = 0; i < count; i++) {
+        for(int j = 0; j < nFMA / 4; j++) {


Probably should guard with a static_assert(nFMA%4 ==0) check

vedithal-amd · 2026-01-30T17:31:37Z

projects/rocprofiler-compute/src/utils/benchmark.py

 }
 """


 def flops_bench(device: int, type: str, unit: str, rate: int) -> PerfMetrics:
+    nFMA = 1024


Comment for what this actually means?

vedithal-amd · 2026-01-30T17:39:33Z

projects/rocprofiler-compute/src/utils/benchmark.py

 flops_kernel_selector = {
-    "FP16": ["flops_benchmark<__half, 1024>", sizeof(c_short)],


shouldn't these use nFMA var instead of hardcode, could make nFMA global?

vedithal-amd · 2026-01-30T17:41:40Z

projects/rocprofiler-compute/src/utils/benchmark.py

    num_experiments = DEFAULT_NUM_EXPERIMENTS
    workgroup_size = DEFAULT_WORKGROUP_SIZE
-    dataset_size = DEFAULT_DATASET_SIZE


Remove this global var is not needed

vedithal-amd · 2026-01-30T17:56:42Z

Will also need a CHANGLEOG update to say improved valu fp16 roofline peak

vedithal-amd · 2026-01-30T18:02:25Z

Public reference for VALU FP 16 FLOPS for MI 355X: https://www.amd.com/en/products/accelerators/instinct/mi350/mi355x.html

jamessiddeley-amd · 2026-01-30T18:44:35Z

Address review comments for VALU FP16 benchmark improvements:

Added VALU_NFMA global constant with couple comments, updated flops_kernel_selector to use global, added static_assert for vec4 alignment, and updated CHANGELOG

benrichard-amd added 5 commits January 27, 2026 16:13

Improve roofline FLOPS kernel

f61f202

Use vector type and multiple variables to improve ILP.

Fix typo in loop

437d4a1

Use 1 vector instead of 4

ffb3f78

Still get similar performance

Remove debug code

16d5d9c

Restore number of workgroups to 128 per CU

e43a699

Adjust iterations

benrichard-amd requested a review from a team as a code owner January 30, 2026 16:59

github-actions bot added the project: rocprofiler-compute label Jan 30, 2026

feizheng10 approved these changes Jan 30, 2026

View reviewed changes

skyreflectedinmirrors reviewed Jan 30, 2026

View reviewed changes

systems-assistant bot added the organization: ROCm label Jan 30, 2026

vedithal-amd requested changes Jan 30, 2026

View reviewed changes

addressed comments on Ben's behalf

a1f1f6f

jamessiddeley-amd requested review from a team and prbasyal-amd as code owners January 30, 2026 18:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve VALU FP16 test in roofline benchmark #2985

Improve VALU FP16 test in roofline benchmark #2985

Uh oh!

benrichard-amd commented Jan 30, 2026

Uh oh!

skyreflectedinmirrors Jan 30, 2026

Uh oh!

vedithal-amd Jan 30, 2026

Uh oh!

vedithal-amd Jan 30, 2026

Uh oh!

vedithal-amd Jan 30, 2026

Uh oh!

vedithal-amd commented Jan 30, 2026

Uh oh!

vedithal-amd commented Jan 30, 2026

Uh oh!

jamessiddeley-amd commented Jan 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

		flops_kernel_selector = {
		"FP16": ["flops_benchmark<__half, 1024>", sizeof(c_short)],

Improve VALU FP16 test in roofline benchmark #2985

Are you sure you want to change the base?

Improve VALU FP16 test in roofline benchmark #2985

Uh oh!

Conversation

benrichard-amd commented Jan 30, 2026

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

Uh oh!

skyreflectedinmirrors Jan 30, 2026

Choose a reason for hiding this comment

Uh oh!

vedithal-amd Jan 30, 2026

Choose a reason for hiding this comment

Uh oh!

vedithal-amd Jan 30, 2026

Choose a reason for hiding this comment

Uh oh!

vedithal-amd Jan 30, 2026

Choose a reason for hiding this comment

Uh oh!

vedithal-amd commented Jan 30, 2026

Uh oh!

vedithal-amd commented Jan 30, 2026

Uh oh!

jamessiddeley-amd commented Jan 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants