Skip to content

Conversation

@benrichard-amd
Copy link
Contributor

Motivation

Update VALU FMA benchmark so that FP16 numbers are closer to peak

Technical Details

  • FP16 result was very low, like ~0.25X FP32 on MI300X/MI350X. On MI100 it should be ~2x FP32, and on MI300/MI350 should be ~1x FP32.

  • Update the VALU FMA test to use vector types. This hints the compiler should use packed math when available, and allows for more instruction-level parallelism.

  • Also assigned different number of iterations for different types, to keep the running time under control, as different types have different rates.

  • I checked the disassembly, packed math is used for FP16 and FP32. Clang has an option to disable packed FP32 math, if we want to do that.

Old (MI350X):

Peak VALU FLOPs (FP16), GPU ID: 0, workgroupSize:256, workgroups:32768, experiments:100, FLOP:549755813888, duration:15.5 ms, mean:35490.2 GFLOPS, stdev=43.0 GFLOPS
100% [||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||]
Peak VALU FLOPs (FP32), GPU ID: 0, workgroupSize:256, workgroups:32768, experiments:100, FLOP:274877906944, duration:2.0 ms, mean:135885.7 GFLOPS, stdev=3189.5 GFLOPS
100% [||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||]
Peak VALU FLOPs (FP64), GPU ID: 0, workgroupSize:256, workgroups:32768, experiments:100, FLOP:137438953472, duration:2.0 ms, mean:69220.3 GFLOPS, stdev=1413.7 GFLOPS
100% [||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||]
Peak VALU IOPs (INT8), GPU ID: 0, workgroupSize:256, workgroups:32768, experiments:100, IOP:1099511627776, duration:14.8 ms, mean:74510.3 GOPS, stdev=36.4 GFLOPS
100% [||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||]
Peak VALU IOPs (INT32), GPU ID: 0, workgroupSize:256, workgroups:32768, experiments:100, IOP:274877906944, duration:4.2 ms, mean:66154.8 GOPS, stdev=674.7 GFLOPS
100% [||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||]
Peak VALU IOPs (INT64), GPU ID: 0, workgroupSize:256, workgroups:32768, experiments:100, IOP:137438953472, duration:7.8 ms, mean:17709.6 GOPS, stdev=48.6 GFLOPS

New (MI350X):

Peak VALU FLOPs (FP16), GPU ID: 0, workgroupSize:256, workgroups:32768, experiments:100, FLOP:4398046511104, duration:30.8 ms, mean:142627.6 GFLOPS, stdev=86.2 GFLOPS
100% [||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||]
Peak VALU FLOPs (FP32), GPU ID: 0, workgroupSize:256, workgroups:32768, experiments:100, FLOP:4398046511104, duration:31.1 ms, mean:141336.3 GFLOPS, stdev=585.4 GFLOPS
100% [||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||]
Peak VALU FLOPs (FP64), GPU ID: 0, workgroupSize:256, workgroups:32768, experiments:100, FLOP:2199023255552, duration:31.0 ms, mean:70997.5 GFLOPS, stdev=87.0 GFLOPS
100% [||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||]
Peak VALU IOPs (INT8), GPU ID: 0, workgroupSize:256, workgroups:32768, experiments:100, IOP:2199023255552, duration:36.5 ms, mean:60258.1 GOPS, stdev=97.4 GFLOPS
100% [||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||]
Peak VALU IOPs (INT32), GPU ID: 0, workgroupSize:256, workgroups:32768, experiments:100, IOP:2199023255552, duration:36.1 ms, mean:60906.7 GOPS, stdev=574.0 GFLOPS
100% [||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||]
Peak VALU IOPs (INT64), GPU ID: 0, workgroupSize:256, workgroups:32768, experiments:100, IOP:1099511627776, duration:62.1 ms, mean:17699.0 GOPS, stdev=20.1 GFLOPS

Test Plan

  • Verify FP16 is close to FP32 on MI300X/MI350X.
  • Verify FP16 ix 2X FP32 on MI100
  • Verify other scores are not negatively affected

Test Result

Tested on MI100, MI325X and MI350X.

Submission Checklist

Use vector type and multiple variables to improve ILP.
Still get similar performance
vec4<T> x0 = {(T)1,(T)2,(T)3,(T)4};

for(int i = 0; i < count; i++) {
for(int j = 0; j < nFMA / 4; j++) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably should guard with a static_assert(nFMA%4 ==0) check

}
"""


def flops_bench(device: int, type: str, unit: str, rate: int) -> PerfMetrics:
nFMA = 1024
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment for what this actually means?

flops_kernel_selector = {
"FP16": ["flops_benchmark<__half, 1024>", sizeof(c_short)],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't these use nFMA var instead of hardcode, could make nFMA global?

num_experiments = DEFAULT_NUM_EXPERIMENTS
workgroup_size = DEFAULT_WORKGROUP_SIZE
dataset_size = DEFAULT_DATASET_SIZE
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this global var is not needed

@vedithal-amd
Copy link
Contributor

Will also need a CHANGLEOG update to say improved valu fp16 roofline peak

@vedithal-amd
Copy link
Contributor

Public reference for VALU FP 16 FLOPS for MI 355X: https://www.amd.com/en/products/accelerators/instinct/mi350/mi355x.html

@jamessiddeley-amd
Copy link
Contributor

Address review comments for VALU FP16 benchmark improvements:

Added VALU_NFMA global constant with couple comments, updated flops_kernel_selector to use global, added static_assert for vec4 alignment, and updated CHANGELOG

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants