-
Notifications
You must be signed in to change notification settings - Fork 124
Improve VALU FP16 test in roofline benchmark #2985
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Conversation
Use vector type and multiple variables to improve ILP.
Still get similar performance
Adjust iterations
| vec4<T> x0 = {(T)1,(T)2,(T)3,(T)4}; | ||
|
|
||
| for(int i = 0; i < count; i++) { | ||
| for(int j = 0; j < nFMA / 4; j++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably should guard with a static_assert(nFMA%4 ==0) check
| } | ||
| """ | ||
|
|
||
|
|
||
| def flops_bench(device: int, type: str, unit: str, rate: int) -> PerfMetrics: | ||
| nFMA = 1024 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment for what this actually means?
| flops_kernel_selector = { | ||
| "FP16": ["flops_benchmark<__half, 1024>", sizeof(c_short)], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't these use nFMA var instead of hardcode, could make nFMA global?
| num_experiments = DEFAULT_NUM_EXPERIMENTS | ||
| workgroup_size = DEFAULT_WORKGROUP_SIZE | ||
| dataset_size = DEFAULT_DATASET_SIZE |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove this global var is not needed
|
Will also need a CHANGLEOG update to say improved valu fp16 roofline peak |
|
Public reference for VALU FP 16 FLOPS for MI 355X: https://www.amd.com/en/products/accelerators/instinct/mi350/mi355x.html |
|
Address review comments for VALU FP16 benchmark improvements: Added VALU_NFMA global constant with couple comments, updated |
Motivation
Update VALU FMA benchmark so that FP16 numbers are closer to peak
Technical Details
FP16 result was very low, like ~0.25X FP32 on MI300X/MI350X. On MI100 it should be ~2x FP32, and on MI300/MI350 should be ~1x FP32.
Update the VALU FMA test to use vector types. This hints the compiler should use packed math when available, and allows for more instruction-level parallelism.
Also assigned different number of iterations for different types, to keep the running time under control, as different types have different rates.
I checked the disassembly, packed math is used for FP16 and FP32. Clang has an option to disable packed FP32 math, if we want to do that.
Old (MI350X):
New (MI350X):
Test Plan
Test Result
Tested on MI100, MI325X and MI350X.
Submission Checklist