-
Notifications
You must be signed in to change notification settings - Fork 22
Current scaling: two-stage HIP amax kernel #369
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
619fc5c to
9e6586f
Compare
This reverts commit 7d4054e.
| const bool UseBlockAmax = | ||
| (block_amax != nullptr) && | ||
| (block_capacity >= num_blocks) && | ||
| !nvte_use_atomic_amax(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
block_amax is expected to be nullptr if nvte_use_atomic_amax() is True so it is redundant
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed the logic in eba552e.
|
See #384 for the GH actions CI. |
wangye805
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
|
||
| #ifdef __HIP_PLATFORM_AMD__ | ||
|
|
||
| size_t nvte_amax_workspace_size(size_t N) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ambiguous workspace_size - in TE it is usually byte size but here number of float32 elements is returned.
It should either return bytes and cast to float only when launch kernels, or method should be renamed to indicate it is float elements number.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I changed it to nvte_amax_workspace_num_blocks in b07edf6.
Description
Implements a two-stage HIP kernel for the amax operation, as an alternative to the original implementation that uses atomic reductions. Make the two-stage kernel the default implementation. Users can use
export NVTE_USE_ATOMIC_AMAX=1to use the atomic amax kernel.The corresponding Triton kernel is at #385.
Fixes https://github.com/ROCm/frameworks-internal/issues/14303.
See https://github.com/ROCm/frameworks-internal/issues/14303#issuecomment-3554900809 for a performance analysis.
TODO:
nvte_compute_amaxFIXMEs in the codeType of change
Changes
Please list the changes introduced in this PR:
Checklist: