Update build system for AOTriton `0.11b` and upgrade FWD call to V3 API #360

Micky774 · 2025-11-03T20:38:13Z

Description

Update build system for AOTriton 0.11b and upgrade FWD call to V3 API

Fixes # (issue)

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Please list the changes introduced in this PR:

Change A
Change B

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

transformer_engine/common/CMakeLists.txt

xinyazhang · 2025-11-03T22:45:06Z

transformer_engine/common/CMakeLists.txt

-      string(REPLACE ";" "," ARCH_LIST_COMMA_STR "${CMAKE_HIP_ARCHITECTURES}")
+      set(__AOTRITON_VER "0.11b")
+      set(__AOTRITON_SHA256
+          "a2a974e0ad929a5e5827c0f896c59bda4872459cbaf8dd8e0a00407f404491cf"  # rocm7.0


This can be removed.
TE should never download the runtime and must build runtime from source with custom suffix, due to potential conflict with libaotriton shipped by pytorch.

xinyazhang · 2025-11-03T22:47:36Z

transformer_engine/common/fused_attn_rocm/fused_attn_aotriton.cpp

  bool is_training, float scaling_factor, float dropout_probability,
-  NVTE_QKV_Layout layout,
-  NVTE_Bias_Type bias_type, NVTE_Mask_Type mask_type,
+  int window_size_left, int window_size_right, NVTE_QKV_Layout layout,


Consider using std::optional
All integer values are valid inputs for AOTriton's SWA (Hence I sometimes refer it as "Generic SWA")

transformer_engine/common/fused_attn_rocm/fused_attn_aotriton.cpp

xinyazhang · 2025-11-03T22:51:04Z

transformer_engine/common/fused_attn_rocm/fused_attn_aotriton.cpp

 void fused_attn_aotriton_fwd_qkvpacked(
  size_t b, size_t h, size_t max_seqlen, size_t d,
  bool is_training, float attn_scale, float dropout, 
+  uint64_t window_left, uint64_t window_right,


I think it should be int64_t as the default (non-swa) window-sizes from NV upstream is -1

I've set it to int32_t since that's ultimately what AOTriton uses.

xinyazhang · 2025-11-03T22:51:26Z

transformer_engine/common/fused_attn_rocm/fused_attn_aotriton.cpp

 void fused_attn_aotriton_fwd(
  size_t b, size_t h_q, size_t h_kv, size_t max_seqlen_q, size_t max_seqlen_kv, size_t d,
  bool is_training, float attn_scale, float dropout, 
+  uint64_t window_left, uint64_t window_right,


xinyazhang · 2025-11-03T22:52:07Z

transformer_engine/common/fused_attn_rocm/fused_attn_aotriton.h

 void fused_attn_aotriton_fwd_qkvpacked(
  size_t b, size_t h, size_t max_seqlen, size_t d,
  bool is_training, float attn_scale, float dropout, 
+  uint64_t window_left, uint64_t window_right,


As I mentioned above, use std::optiona<int> for window sizes.

transformer_engine/common/fused_attn_rocm/fused_attn_aotriton.cpp

wangye805 · 2025-11-04T15:06:36Z

transformer_engine/common/fused_attn_rocm/fused_attn_aotriton.cpp

+    varlen_type = 1;
+  }
+
+  int window_left = 0;


Should we define window_left/right to be aotriton::v3::flash::WindowValue since we already introduce this type in line 244?

No, these values are for generic SWA and any integer is valid input.

transformer_engine/common/fused_attn_rocm/fused_attn_aotriton.cpp

wangye805 · 2025-11-04T15:12:06Z

transformer_engine/common/fused_attn_rocm/fused_attn_aotriton.cpp

 void fused_attn_aotriton_fwd_qkvpacked(
  size_t b, size_t h, size_t max_seqlen, size_t d,
  bool is_training, float attn_scale, float dropout, 
+  uint64_t window_left, uint64_t window_right,


I think it should be int64_t as the default (non-swa) window-sizes from NV upstream is -1

ipanfilo · 2025-11-04T14:32:26Z

transformer_engine/common/fused_attn_rocm/fused_attn_aotriton.cpp

+  );
+  // Next we guard against an initial workspace-allocation which occurs in the
+  // JAX TE extension. We check for both pointers being null while retaining
+  // shape data, indicating the use of dummy data in the allocation pass.


Is it specific to JAX? How does Torch extension behave?

transformer_engine/common/CMakeLists.txt

transformer_engine/common/fused_attn_rocm/fused_attn_aotriton.cpp

ipanfilo · 2025-11-04T17:56:39Z

transformer_engine/common/fused_attn_rocm/fused_attn_aotriton.cpp

  size_t workspace_size = 0;
+  bool pad_between_seqs = get_pad_between_seqs(
+        input_cu_seqlens_q,
+        input_cu_seqlens_kv,


I'm confused, how seqlens_kv serves as padded seqlens_q?

transformer_engine/common/CMakeLists.txt

ipanfilo · 2025-11-04T19:21:03Z

transformer_engine/common/CMakeLists.txt

+        )
+      set(__AOTRITON_IMAGE_SHA256_LIST
+        "3a06a99971dddb7703a30378f1c5d6b41468d926ea51821156d1b6857b985bc4" # amd-gfx942
+        "27fc21f6761d57987a700436de8cf29cbdd9eeee91318dfed596eeb147d219ad" # amd-gfx950


Why do not add other archs?

I wasn't sure what our support matrix was for archs

wangye805 · 2025-11-05T22:30:13Z

transformer_engine/common/fused_attn_rocm/fused_attn_aotriton.cpp


  //TODO: release after TE integrates swa into AOTriton
  bool is_no_mask_window_size= window_size_left == -1 && window_size_right == -1;
  bool is_causal_mask_window_size = window_size_left ==-1 && window_size_right ==0;


Looks like we don't support general SWA feature right now

transformer_engine/common/fused_attn_rocm/fused_attn_aotriton.cpp

Micky774 · 2025-11-12T17:18:53Z

@xinyazhang could take a second-pass whenever you get the chance? Let me know if there's anything else you'd like addressed, or if you're satisfied with the changes. Thanks!

xinyazhang

LGTM
However, @Micky774 I just rolled out 0.11.1b yesterday which fixes the linker script incompatibility issue and restores Navi31 support (unsure about TE's users about it). You probably want to use them instead ROCm/pytorch#2801

wenchenvincent · 2025-11-14T02:26:36Z

@Micky774 Could you address Ilya's comments? Also please merge latest dev to this PR.

Micky774 added 5 commits October 24, 2025 16:22

Initial commit

d10fa92

Updated to build from source by default

eef7dc0

Updated for V3 API

cc68ab7

Fixed build, reverted AOTriton bwd changes (now V2)

4455361

Removed alterations

2586b18

Micky774 requested review from ipanfilo, wangye805 and wenchenvincent as code owners November 3, 2025 20:38

wangye805 requested a review from xinyazhang November 3, 2025 22:12

Removed lazy tensor wrapper

aa80f81

xinyazhang requested changes Nov 3, 2025

View reviewed changes

wangye805 requested changes Nov 4, 2025

View reviewed changes

ipanfilo reviewed Nov 4, 2025

View reviewed changes

Micky774 added 3 commits November 4, 2025 12:23

Streamlined cmakelist, other PR review feedback adressed

9a91b9e

Removed pad_between_seqs

023deb4

Updated typing to be more explicit

6b8dbe5

ipanfilo reviewed Nov 4, 2025

View reviewed changes

Minor streamlining and formatting

68303d0

wangye805 requested changes Nov 5, 2025

View reviewed changes

Micky774 added 2 commits November 6, 2025 13:22

Simplified window size func for current non-SWA support

6788a16

Removed accidental include

182101a

wangye805 requested changes Nov 10, 2025

View reviewed changes

transformer_engine/common/fused_attn_rocm/fused_attn_aotriton.cpp Outdated Show resolved Hide resolved

Updated window values for causal

917e3c3

wangye805 approved these changes Nov 10, 2025

View reviewed changes

xinyazhang approved these changes Nov 12, 2025

View reviewed changes

Update AOTriton to 0.11.1b

d6e46c1

Update build system for AOTriton 0.11b and upgrade FWD call to V3 API #360

Are you sure you want to change the base?

Update build system for AOTriton 0.11b and upgrade FWD call to V3 API #360

Conversation

Micky774 commented Nov 3, 2025

Description

Type of change

Changes

Checklist:

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xinyazhang Nov 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Micky774 commented Nov 12, 2025

Uh oh!

xinyazhang left a comment

Choose a reason for hiding this comment

Uh oh!

wenchenvincent commented Nov 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Update build system for AOTriton `0.11b` and upgrade FWD call to V3 API #360

Update build system for AOTriton `0.11b` and upgrade FWD call to V3 API #360

xinyazhang Nov 3, 2025 •

edited

Loading