Skip to content

Conversation

@matthiasdiener
Copy link
Contributor

@matthiasdiener matthiasdiener commented Nov 12, 2025

Description

Implements a two-stage HIP kernel for the amax operation, as an alternative to the original implementation that uses atomic reductions. Make the two-stage kernel the default implementation. Users can use export NVTE_USE_ATOMIC_AMAX=1 to use the atomic amax kernel.

The corresponding Triton kernel is at #385.

Fixes https://github.com/ROCm/frameworks-internal/issues/14303.

See https://github.com/ROCm/frameworks-internal/issues/14303#issuecomment-3554900809 for a performance analysis.

TODO:

  • Fix other call sites of nvte_compute_amax
  • Address FIXMEs in the code

Type of change

  • Documentation change (change only to the documentation, either a fix or a new content)
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Infra/Build change
  • Code refactoring

Changes

Please list the changes introduced in this PR:

  • Change A
  • Change B

Checklist:

  • I have read and followed the contributing guidelines
  • The functionality is complete
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

@matthiasdiener matthiasdiener self-assigned this Nov 14, 2025
@matthiasdiener matthiasdiener marked this pull request as ready for review November 24, 2025 16:47
const bool UseBlockAmax =
(block_amax != nullptr) &&
(block_capacity >= num_blocks) &&
!nvte_use_atomic_amax();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

block_amax is expected to be nullptr if nvte_use_atomic_amax() is True so it is redundant

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed the logic in eba552e.

@matthiasdiener
Copy link
Contributor Author

See #384 for the GH actions CI.

@matthiasdiener matthiasdiener changed the title Current scaling: two-stage amax kernel Current scaling: two-stage HIP amax kernel Dec 1, 2025
Copy link
Collaborator

@wangye805 wangye805 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM


#ifdef __HIP_PLATFORM_AMD__

size_t nvte_amax_workspace_size(size_t N) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ambiguous workspace_size - in TE it is usually byte size but here number of float32 elements is returned.
It should either return bytes and cast to float only when launch kernels, or method should be renamed to indicate it is float elements number.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I changed it to nvte_amax_workspace_num_blocks in b07edf6.

@matthiasdiener matthiasdiener merged commit d9b4003 into ROCm:dev Dec 3, 2025
5 checks passed
@matthiasdiener matthiasdiener deleted the speedup-amax-kernel branch December 3, 2025 05:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants