Current scaling: two-stage Triton amax kernel #385

matthiasdiener · 2025-11-26T22:15:53Z

Description

The corresponding 2-stage HIP kernel amax implementation is in #369.

Partially addresses https://github.com/ROCm/frameworks-internal/issues/14303.

See https://github.com/ROCm/frameworks-internal/issues/14303#issuecomment-3554900809 for a performance analysis.

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Added a 2-stage Triton amax implementation, optional, but enabled by default
- Disable by setting export NVTE_USE_ATOMIC_AMAX=1 (this will use the previous atomic implementation)

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

This reverts commit 7d4054e.

transformer_engine/common/recipe/current_scaling.cu

…gineTensor

transformer_engine/pytorch/triton_kernels/cast_transpose.py

wangye805 · 2025-12-15T16:14:34Z

ci/pytorch.sh

    NVTE_USE_CAST_TRANSPOSE_TRITON=1 run_default_fa_lbl "triton" 1 test_float8_current_scaling_exact.py
    NVTE_USE_ATOMIC_AMAX=1 run_default_fa 3 test_numerics.py
    NVTE_USE_ATOMIC_AMAX=1 run_default_fa 3 test_fusible_ops.py
+    NVTE_USE_ATOMIC_AMAX=1 NVTE_USE_CAST_TRANSPOSE_TRITON=1 run_default_fa 3 test_numerics.py


The triton path is not enabled by default so I think you will need to test with both NVTE_USE_ATOMIC_AMAX=1 and NVTE_USE_ATOMIC_AMAX=0 when NVTE_USE_CAST_TRANSPOSE_TRITON is 1.

Also not sure about the runtime cost of adding two new pytests in level 3

The triton path is not enabled by default so I think you will need to test with both NVTE_USE_ATOMIC_AMAX=1 and NVTE_USE_ATOMIC_AMAX=0 when NVTE_USE_CAST_TRANSPOSE_TRITON is 1.

I added both cases in d7259d1.

Also not sure about the runtime cost of adding two new pytests in level 3

test_numerics.py takes about 5 min, test_fusible_ops.py takes about 1 min (on gfx942), times 2 since we run it with NVTE_USE_ATOMIC_AMAX=0 and =1. Perhaps adding just the test in 188b7ca is enough?

5 mins sounds okay for level 3. @ipanfilo , what do you think?

After discussing with @wenchenvincent, we concluded that it is worth keeping the extra tests around.

ci/pytorch.sh

matthiasdiener · 2025-12-18T10:43:39Z

ci/pytorch.sh

    NVTE_USE_CAST_TRANSPOSE_TRITON=1 run_default_fa_lbl "triton" 1 test_float8_current_scaling_exact.py
    NVTE_USE_ATOMIC_AMAX=1 run_default_fa 3 test_numerics.py
    NVTE_USE_ATOMIC_AMAX=1 run_default_fa 3 test_fusible_ops.py
+    NVTE_USE_ATOMIC_AMAX=1 NVTE_USE_CAST_TRANSPOSE_TRITON=1 run_default_fa 3 test_numerics.py


The triton path is not enabled by default so I think you will need to test with both NVTE_USE_ATOMIC_AMAX=1 and NVTE_USE_ATOMIC_AMAX=0 when NVTE_USE_CAST_TRANSPOSE_TRITON is 1.

I added both cases in d7259d1.

Also not sure about the runtime cost of adding two new pytests in level 3

test_numerics.py takes about 5 min, test_fusible_ops.py takes about 1 min (on gfx942), times 2 since we run it with NVTE_USE_ATOMIC_AMAX=0 and =1. Perhaps adding just the test in 188b7ca is enough?

transformer_engine/pytorch/triton_kernels/cast_transpose.py

ci/pytorch.sh

matthiasdiener · 2025-12-18T15:34:14Z

transformer_engine/pytorch/triton_kernels/cast.py

    Quantizes the input tensor using a specified quantizer,
    with an option to utilize Triton-based `cast_transpose` for performance.
    """
-    from ..tensor.float8_tensor import Float8CurrentScalingQuantizer


@wangye805 Do you remember why current scaling was disabled here (in #374)?

I recall I moved this line to

TransformerEngine/transformer_engine/pytorch/triton_kernels/cast.py

Line 59 in 7f6682c

from ..tensor.float8_tensor import Float8CurrentScalingQuantizer

instead of disabling curent scaling quantizer entirely, in order to resolve a circular inclusion issue

matthiasdiener and others added 30 commits November 12, 2025 14:10

Current scaling: two-stage amax kernel

c15d93b

Merge branch 'dev' into speedup-amax-kernel

51fab36

bugfix graph capture

ae35e4c

Merge branch 'dev' into speedup-amax-kernel

77a68a7

outline workspace allocation

c0d8e73

Merge branch 'dev' into speedup-amax-kernel

6c3507d

Proper allocation of workspace

3c9de07

Merge branch 'dev' into speedup-amax-kernel

91249cc

add a test to compare the accuracy of both amax implementations

be0e0c8

add possibility to force using previous (atomic) kernel

bce34da

Merge branch 'dev' into speedup-amax-kernel

8c388cc

2-stage Triton amax

73c8d4e

add copyrights

6388604

don't add extra template to kernel

9e6586f

make amax_kernel_threads usable in pytorch

18292bf

update remaining calls to nvte_compute_amax

a389455

Merge branch 'dev' into speedup-amax-kernel

d87ab8a

Merge branch 'speedup-amax-kernel' into speedup-amax-triton

7d9ee16

additional copyrights

fd5dead

avoid workspace allocations if NVTE_USE_ATOMIC_AMAX is set

16d3bf9

Merge branch 'dev' into speedup-amax-kernel

50b34aa

remove use_block_amax parameter, more cleanups

ef532b1

Factor workspace allocation into function

f933ef3

expand test slightly

7d4054e

Revert "expand test slightly"

63cff98

This reverts commit 7d4054e.

guard by HIP macro, address review comments

c7d44a7

bugfix workspace.data.dptr

f92b926

various cleanups

eba552e

Merge branch 'dev' into speedup-amax-kernel

0d6a177

Merge branch 'speedup-amax-kernel' into speedup-amax-triton

19901a0

matthiasdiener changed the base branch from speedup-amax-kernel to dev November 26, 2025 22:53

ipanfilo reviewed Nov 27, 2025

View reviewed changes

transformer_engine/common/recipe/current_scaling.cu Outdated Show resolved Hide resolved

matthiasdiener changed the title ~~Speedup amax triton~~ [WIP] Speedup amax triton Nov 27, 2025

matthiasdiener added 4 commits December 1, 2025 10:07

Merge branch 'dev' into speedup-amax-kernel

6990928

fix indentation

9ee618f

Merge branch 'speedup-amax-kernel' into speedup-amax-triton

853bb77

undo non-triton changes

cf402b1

matthiasdiener force-pushed the speedup-amax-triton branch from 43cf8ab to cf402b1 Compare December 1, 2025 19:06

matthiasdiener changed the title ~~[WIP] Speedup amax triton~~ Current scaling: two-stage Triton amax kernel Dec 1, 2025

matthiasdiener mentioned this pull request Dec 1, 2025

Current scaling: two-stage HIP amax kernel #369

Merged

15 tasks

matthiasdiener marked this pull request as ready for review December 1, 2025 20:00

matthiasdiener requested review from wangye805 and wenchenvincent as code owners December 1, 2025 20:00

matthiasdiener requested a review from ipanfilo December 1, 2025 20:00

wangye805 and others added 5 commits December 7, 2025 20:29

[ROCm] use at::empty(0, fp32) as amax workspace for makeTransformerEn…

2c9cc65

…gineTensor

Merge branch 'dev' into speedup-amax-triton

e41e1d4

Merge branch 'yewang12/amax-workspace-fix' into speedup-amax-triton

862ec74

add more tests

35f2d38

Merge branch 'dev' into speedup-amax-triton

1cbb68f

wangye805 requested changes Dec 15, 2025

View reviewed changes

matthiasdiener added 5 commits December 15, 2025 16:48

add more tests and re-add comment

d7259d1

Merge branch 'dev' into speedup-amax-triton

42c7ac3

Merge branch 'dev' into speedup-amax-triton

ef31ef7

restore FP8 current scaling support

25c91e8

add test comparing atomic amax and 2-stage

188b7ca

matthiasdiener commented Dec 18, 2025

View reviewed changes

matthiasdiener requested a review from wangye805 December 18, 2025 16:57

wangye805 approved these changes Dec 18, 2025

View reviewed changes

matthiasdiener merged commit bdd6c63 into dev Dec 20, 2025
5 of 6 checks passed

matthiasdiener deleted the speedup-amax-triton branch December 20, 2025 09:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Current scaling: two-stage Triton amax kernel #385

Current scaling: two-stage Triton amax kernel #385

matthiasdiener commented Nov 26, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

wangye805 Dec 15, 2025

Uh oh!

matthiasdiener Dec 18, 2025

Uh oh!

wangye805 Dec 18, 2025

Uh oh!

matthiasdiener Dec 20, 2025

Uh oh!

Uh oh!

matthiasdiener Dec 18, 2025

Uh oh!

Uh oh!

Uh oh!

matthiasdiener Dec 18, 2025

Uh oh!

wangye805 Dec 18, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Current scaling: two-stage Triton amax kernel #385

Current scaling: two-stage Triton amax kernel #385

Conversation

matthiasdiener commented Nov 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

Changes

Checklist:

Uh oh!

Uh oh!

Uh oh!

wangye805 Dec 15, 2025

Choose a reason for hiding this comment

Uh oh!

matthiasdiener Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

wangye805 Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

matthiasdiener Dec 20, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

matthiasdiener Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

matthiasdiener Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

wangye805 Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

matthiasdiener commented Nov 26, 2025 •

edited

Loading