Current scaling: two-stage HIP amax kernel #369

matthiasdiener · 2025-11-12T20:16:12Z

Description

Implements a two-stage HIP kernel for the amax operation, as an alternative to the original implementation that uses atomic reductions. Make the two-stage kernel the default implementation. Users can use export NVTE_USE_ATOMIC_AMAX=1 to use the atomic amax kernel.

The corresponding Triton kernel is at #385.

Fixes https://github.com/ROCm/frameworks-internal/issues/14303.

See https://github.com/ROCm/frameworks-internal/issues/14303#issuecomment-3554900809 for a performance analysis.

TODO:

Fix other call sites of nvte_compute_amax
Address FIXMEs in the code

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Please list the changes introduced in this PR:

Change A
Change B

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

transformer_engine/common/recipe/current_scaling.cu

transformer_engine/pytorch/csrc/extensions/activation.cpp

transformer_engine/common/include/transformer_engine/recipe.h

transformer_engine/common/recipe/current_scaling.cu

transformer_engine/pytorch/csrc/extensions/activation.cpp

transformer_engine/common/include/transformer_engine/recipe.h

transformer_engine/pytorch/csrc/extensions/activation.cpp

transformer_engine/pytorch/csrc/extensions/bias.cpp

transformer_engine/pytorch/csrc/extensions/cast.cpp

This reverts commit 7d4054e.

transformer_engine/common/include/transformer_engine/recipe.h

ipanfilo · 2025-11-26T00:43:30Z

transformer_engine/common/recipe/current_scaling.cu

+  const bool UseBlockAmax =
+      (block_amax != nullptr) &&
+      (block_capacity >= num_blocks) &&
+      !nvte_use_atomic_amax();


block_amax is expected to be nullptr if nvte_use_atomic_amax() is True so it is redundant

Changed the logic in eba552e.

matthiasdiener · 2025-11-26T21:04:47Z

See #384 for the GH actions CI.

transformer_engine/pytorch/csrc/common.cpp

transformer_engine/common/recipe/current_scaling.cu

transformer_engine/common/include/transformer_engine/recipe.h

This reverts commit 1357d4b.

wangye805

LGTM

ipanfilo · 2025-12-02T20:47:44Z

transformer_engine/common/recipe/current_scaling.cu

+
+#ifdef __HIP_PLATFORM_AMD__
+
+size_t nvte_amax_workspace_size(size_t N) {


Ambiguous workspace_size - in TE it is usually byte size but here number of float32 elements is returned.
It should either return bytes and cast to float only when launch kernels, or method should be renamed to indicate it is float elements number.

Thanks, I changed it to nvte_amax_workspace_num_blocks in b07edf6.

matthiasdiener added 3 commits November 12, 2025 14:10

Current scaling: two-stage amax kernel

c15d93b

Merge branch 'dev' into speedup-amax-kernel

51fab36

bugfix graph capture

ae35e4c

matthiasdiener self-assigned this Nov 14, 2025

matthiasdiener added 10 commits November 17, 2025 10:36

Merge branch 'dev' into speedup-amax-kernel

77a68a7

outline workspace allocation

c0d8e73

Merge branch 'dev' into speedup-amax-kernel

6c3507d

Proper allocation of workspace

3c9de07

Merge branch 'dev' into speedup-amax-kernel

91249cc

add a test to compare the accuracy of both amax implementations

be0e0c8

add possibility to force using previous (atomic) kernel

bce34da

Merge branch 'dev' into speedup-amax-kernel

8c388cc

add copyrights

6388604

don't add extra template to kernel

9e6586f

matthiasdiener force-pushed the speedup-amax-kernel branch from 619fc5c to 9e6586f Compare November 20, 2025 23:06

matthiasdiener added 3 commits November 21, 2025 15:03

make amax_kernel_threads usable in pytorch

18292bf

update remaining calls to nvte_compute_amax

a389455

Merge branch 'dev' into speedup-amax-kernel

d87ab8a

matthiasdiener marked this pull request as ready for review November 24, 2025 16:47

matthiasdiener requested review from ipanfilo, wangye805 and wenchenvincent as code owners November 24, 2025 16:47

additional copyrights

fd5dead

ipanfilo reviewed Nov 24, 2025

View reviewed changes

transformer_engine/common/recipe/current_scaling.cu Outdated Show resolved Hide resolved

transformer_engine/pytorch/csrc/extensions/activation.cpp Outdated Show resolved Hide resolved

matthiasdiener added 3 commits November 24, 2025 14:52

avoid workspace allocations if NVTE_USE_ATOMIC_AMAX is set

16d3bf9

Merge branch 'dev' into speedup-amax-kernel

50b34aa

remove use_block_amax parameter, more cleanups

ef532b1

matthiasdiener mentioned this pull request Nov 25, 2025

CI: Fix failures on forked PRs and centralize Docker image config #380

Merged

wangye805 requested changes Nov 25, 2025

View reviewed changes

Factor workspace allocation into function

f933ef3

root and others added 2 commits November 25, 2025 23:39

Revert "expand test slightly"

63cff98

This reverts commit 7d4054e.

guard by HIP macro, address review comments

c7d44a7

ipanfilo reviewed Nov 26, 2025

View reviewed changes

bugfix workspace.data.dptr

f92b926

matthiasdiener mentioned this pull request Nov 26, 2025

[DO NOT MERGE] Speedup amax kernel CI test #384

Closed

matthiasdiener added 3 commits November 26, 2025 13:33

various cleanups

eba552e

Merge branch 'dev' into speedup-amax-kernel

0d6a177

simplify types in allocate_amax_workspace

8eda427

matthiasdiener requested review from ipanfilo and wangye805 November 26, 2025 21:04

matthiasdiener added 2 commits December 1, 2025 10:07

Merge branch 'dev' into speedup-amax-kernel

6990928

fix indentation

9ee618f

matthiasdiener mentioned this pull request Dec 1, 2025

Current scaling: two-stage Triton amax kernel #385

Merged

13 tasks

ipanfilo approved these changes Dec 1, 2025

View reviewed changes

matthiasdiener changed the title ~~Current scaling: two-stage amax kernel~~ Current scaling: two-stage HIP amax kernel Dec 1, 2025

Merge branch 'dev' into speedup-amax-kernel

77b1bc3

wangye805 requested changes Dec 2, 2025

View reviewed changes

matthiasdiener added 4 commits December 2, 2025 10:30

Use private implementation of DIVUP

1357d4b

define amax_kernel_threads on non-AMD

01b61b5

Revert "Use private implementation of DIVUP"

ed16f8f

This reverts commit 1357d4b.

Factor out workspace size calculation

95dcbdf

wangye805 approved these changes Dec 2, 2025

View reviewed changes

ipanfilo requested changes Dec 2, 2025

View reviewed changes

matthiasdiener added 2 commits December 2, 2025 14:56

change name

b07edf6

add copyright

233eb0a

matthiasdiener requested a review from ipanfilo December 2, 2025 22:03

ipanfilo approved these changes Dec 3, 2025

View reviewed changes

matthiasdiener merged commit d9b4003 into ROCm:dev Dec 3, 2025
5 checks passed

matthiasdiener deleted the speedup-amax-kernel branch December 3, 2025 05:19


		#ifdef __HIP_PLATFORM_AMD__

		size_t nvte_amax_workspace_size(size_t N) {

Current scaling: two-stage HIP amax kernel #369

Current scaling: two-stage HIP amax kernel #369

Uh oh!

Conversation

matthiasdiener commented Nov 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

Changes

Checklist:

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ipanfilo Nov 26, 2025

Choose a reason for hiding this comment

Uh oh!

matthiasdiener Nov 26, 2025

Choose a reason for hiding this comment

Uh oh!

matthiasdiener commented Nov 26, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

wangye805 left a comment

Choose a reason for hiding this comment

Uh oh!

ipanfilo Dec 2, 2025

Choose a reason for hiding this comment

Uh oh!

matthiasdiener Dec 2, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

matthiasdiener commented Nov 12, 2025 •

edited

Loading