Skip to content

Conversation

@VigneshwarJ
Copy link
Contributor

@VigneshwarJ VigneshwarJ commented Jun 16, 2025

Permlane_swap instruction depends on exec mask, added isConvergent flag to prevent sinking of instruction.

Fixes: SWDEV-537232

Permlane_swap instruction depends on exec mask, added isConvergent flag
to prevent sinking of instruction.

Fixes SWDEV-537232
@llvmbot
Copy link
Member

llvmbot commented Jun 16, 2025

@llvm/pr-subscribers-backend-amdgpu

Author: Vigneshwar Jayakumar (VigneshwarJ)

Changes

Permlane_swap instruction depends on exec mask, added isConvergent flag to prevent sinking of instruction.

Fixes SWDEV-537232


Full diff: https://github.com/llvm/llvm-project/pull/144423.diff

3 Files Affected:

  • (modified) llvm/lib/Target/AMDGPU/VOP1Instructions.td (+5-1)
  • (modified) llvm/lib/Target/AMDGPU/VOPInstructions.td (+1)
  • (modified) llvm/test/CodeGen/AMDGPU/machine-sink-ignorable-exec-use.mir (+61)
diff --git a/llvm/lib/Target/AMDGPU/VOP1Instructions.td b/llvm/lib/Target/AMDGPU/VOP1Instructions.td
index 7fdd951ecbd3c..d6718c556db1e 100644
--- a/llvm/lib/Target/AMDGPU/VOP1Instructions.td
+++ b/llvm/lib/Target/AMDGPU/VOP1Instructions.td
@@ -775,7 +775,8 @@ defm V_PRNG_B32 : VOP1Inst <"v_prng_b32", VOP_I32_I32, int_amdgcn_prng_b32>;
 
 let Constraints = "$vdst = $vdst_in, $src0_out = $src0",
      DisableEncoding="$vdst_in,$src0_out",
-     SchedRW = [Write32Bit, Write32Bit] in {
+     SchedRW = [Write32Bit, Write32Bit],
+     isConvergent = 1 in {
 let SubtargetPredicate = HasPermlane16Swap in {
 defm V_PERMLANE16_SWAP_B32 : VOP1Inst<"v_permlane16_swap_b32", VOP_PERMLANE_SWAP>;
 }
@@ -1550,8 +1551,11 @@ defm V_CVT_PK_F32_FP8    : VOP1_Real_NoDstSel_SDWA_gfx9<0x56>;
 defm V_CVT_PK_F32_BF8    : VOP1_Real_NoDstSel_SDWA_gfx9<0x57>;
 
 defm V_PRNG_B32            : VOP1_Real_gfx9 <0x58>;
+
+let isConvergent = 1 in {
 defm V_PERMLANE16_SWAP_B32 : VOP1_OpSel_Real_e32e64_gfx9<0x059>;
 defm V_PERMLANE32_SWAP_B32 : VOP1_OpSel_Real_e32e64_gfx9<0x05a>;
+}
 
 class MovDPP8Pattern<Predicate Pred, Instruction Inst, ValueType vt> : GCNPat <
   (vt (int_amdgcn_mov_dpp8 vt:$src, timm:$dpp8)),
diff --git a/llvm/lib/Target/AMDGPU/VOPInstructions.td b/llvm/lib/Target/AMDGPU/VOPInstructions.td
index 952ee2fe2c955..c0c874300ef0f 100644
--- a/llvm/lib/Target/AMDGPU/VOPInstructions.td
+++ b/llvm/lib/Target/AMDGPU/VOPInstructions.td
@@ -15,6 +15,7 @@ class LetDummies {
   bit isConvertibleToThreeAddress;
   bit isMoveImm;
   bit isReMaterializable;
+  bit isConvergent;
   bit isAsCheapAsAMove;
   bit FPDPRounding;
   Predicate SubtargetPredicate;
diff --git a/llvm/test/CodeGen/AMDGPU/machine-sink-ignorable-exec-use.mir b/llvm/test/CodeGen/AMDGPU/machine-sink-ignorable-exec-use.mir
index 0fc31ea9d6437..3cdba54026ccc 100644
--- a/llvm/test/CodeGen/AMDGPU/machine-sink-ignorable-exec-use.mir
+++ b/llvm/test/CodeGen/AMDGPU/machine-sink-ignorable-exec-use.mir
@@ -733,3 +733,64 @@ body:             |
     liveins: $vgpr0, $vgpr1, $vgpr2_vgpr3, $vcc
     S_ENDPGM 0
 ...
+---
+name: test_no_sink_permlane_swap
+tracksRegLiveness: true
+machineFunctionInfo:
+  isEntryFunction: true
+body: |
+  ; GFX9-LABEL: name: test_no_sink_permlane_swap
+  ; GFX9: bb.0:
+  ; GFX9-NEXT:   successors: %bb.2(0x40000000), %bb.1(0x40000000)
+  ; GFX9-NEXT:   liveins: $vgpr0
+  ; GFX9-NEXT: {{  $}}
+  ; GFX9-NEXT:   [[COPY:%[0-9]+]]:vgpr_32 = COPY $vgpr0
+  ; GFX9-NEXT:   [[V_MOV_B32_e32_:%[0-9]+]]:vgpr_32 = V_MOV_B32_e32 0, implicit $exec
+  ; GFX9-NEXT:   [[S_MOV_B64_:%[0-9]+]]:sreg_64 = S_MOV_B64 0
+  ; GFX9-NEXT:   [[COPY1:%[0-9]+]]:vreg_64 = COPY [[S_MOV_B64_]]
+  ; GFX9-NEXT:   [[GLOBAL_LOAD_DWORD:%[0-9]+]]:vgpr_32 = GLOBAL_LOAD_DWORD killed [[COPY1]], 0, 0, implicit $exec :: (load (s32), addrspace 1)
+  ; GFX9-NEXT:   [[V_PERMLANE32_SWAP_B32_e64_:%[0-9]+]]:vgpr_32, [[V_PERMLANE32_SWAP_B32_e64_1:%[0-9]+]]:vgpr_32 = V_PERMLANE32_SWAP_B32_e64 [[GLOBAL_LOAD_DWORD]], [[GLOBAL_LOAD_DWORD]], 0, 0, implicit $exec
+  ; GFX9-NEXT:   [[COPY2:%[0-9]+]]:vgpr_32(s32) = COPY $vgpr0
+  ; GFX9-NEXT:   [[S_MOV_B32_:%[0-9]+]]:sreg_32 = S_MOV_B32 1
+  ; GFX9-NEXT:   [[V_CMP_LT_I32_e64_:%[0-9]+]]:sreg_64 = V_CMP_LT_I32_e64 [[COPY2]](s32), [[S_MOV_B32_]], implicit $exec
+  ; GFX9-NEXT:   [[SI_IF:%[0-9]+]]:sreg_64 = SI_IF [[V_CMP_LT_I32_e64_]], %bb.2, implicit-def dead $exec, implicit-def dead $scc, implicit $exec
+  ; GFX9-NEXT:   S_BRANCH %bb.1
+  ; GFX9-NEXT: {{  $}}
+  ; GFX9-NEXT: bb.1:
+  ; GFX9-NEXT:   successors: %bb.2(0x80000000)
+  ; GFX9-NEXT: {{  $}}
+  ; GFX9-NEXT:   [[V_MAX_I32_e64_:%[0-9]+]]:vgpr_32 = V_MAX_I32_e64 [[V_PERMLANE32_SWAP_B32_e64_]], [[V_PERMLANE32_SWAP_B32_e64_1]], implicit $exec
+  ; GFX9-NEXT: {{  $}}
+  ; GFX9-NEXT: bb.2:
+  ; GFX9-NEXT:   successors: %bb.3(0x80000000)
+  ; GFX9-NEXT: {{  $}}
+  ; GFX9-NEXT:   [[PHI:%[0-9]+]]:vgpr_32 = PHI [[V_MOV_B32_e32_]], %bb.0, [[V_MAX_I32_e64_]], %bb.1
+  ; GFX9-NEXT:   SI_END_CF [[SI_IF]], implicit-def dead $exec, implicit-def dead $scc, implicit $exec
+  ; GFX9-NEXT: {{  $}}
+  ; GFX9-NEXT: bb.3:
+  ; GFX9-NEXT:   S_ENDPGM 0, implicit [[PHI]]
+  bb.0:
+    liveins: $vgpr0
+    %1:vgpr_32 = COPY $vgpr0
+    %3:vgpr_32 = V_MOV_B32_e32 0, implicit $exec
+    %5:sreg_64 = S_MOV_B64 0
+    %7:vreg_64 = COPY %5
+    %9:vgpr_32 = GLOBAL_LOAD_DWORD killed %7, 0, 0, implicit $exec :: (load (s32), addrspace 1)
+    %10:vgpr_32, %11:vgpr_32 = V_PERMLANE32_SWAP_B32_e64 %9:vgpr_32, %9:vgpr_32, 0, 0, implicit $exec
+    %15:vgpr_32(s32) = COPY $vgpr0
+    %16:sreg_32 = S_MOV_B32 1
+    %17:sreg_64 = V_CMP_LT_I32_e64 %15(s32), %16, implicit $exec
+    %18:sreg_64 = COPY %17
+    %19:sreg_64 = SI_IF %18, %bb.2, implicit-def dead $exec, implicit-def dead $scc, implicit $exec
+    S_BRANCH %bb.1
+
+  bb.1:
+    %20:vgpr_32 = V_MAX_I32_e64 %10:vgpr_32, %11:vgpr_32, implicit $exec
+
+  bb.2:
+    %22:vgpr_32 = PHI %3, %bb.0, %20, %bb.1
+    SI_END_CF %19, implicit-def dead $exec, implicit-def dead $scc, implicit $exec
+
+  bb.3:
+    S_ENDPGM 0, implicit %22
+...

Copy link
Contributor

@jayfoad jayfoad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quoting myself from #142962: This seems reasonable, given that all the permlane intrinsics are marked as convergent. (Maybe TableGen should check that instruction selection patterns preserve the "convergent" property, like it does for some other properties.)

Should also do this for:
V_PERMLANE64_B32
V_PERMLANE16_VAR_B32
V_PERMLANEX16_VAR_B32

@VigneshwarJ
Copy link
Contributor Author

@jayfoad, Should I merge this and create a new PR for the other PERMLANE instructions you mentioned?

@jayfoad
Copy link
Contributor

jayfoad commented Jun 17, 2025

Should I merge this and create a new PR for the other PERMLANE instructions you mentioned?

Sure

%10:vgpr_32, %11:vgpr_32 = V_PERMLANE32_SWAP_B32_e64 %9:vgpr_32, %9:vgpr_32, 0, 0, implicit $exec
%15:vgpr_32(s32) = COPY $vgpr0
%16:sreg_32 = S_MOV_B32 1
%17:sreg_64 = V_CMP_LT_I32_e64 %15(s32), %16, implicit $exec
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
%17:sreg_64 = V_CMP_LT_I32_e64 %15(s32), %16, implicit $exec
%17:sreg_64 = V_CMP_LT_I32_e64 %15, %16, implicit $exec

Drop the accidental LLTs

; GFX9-NEXT: S_ENDPGM 0, implicit [[PHI]]
bb.0:
liveins: $vgpr0
%1:vgpr_32 = COPY $vgpr0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should run -run-pass=none to compact the register numbers

@VigneshwarJ VigneshwarJ requested a review from arsenm June 17, 2025 17:15
@arsenm arsenm merged commit 1b83f10 into llvm:main Jun 20, 2025
7 checks passed
ThomasRaoux pushed a commit to triton-lang/triton that referenced this pull request Jun 25, 2025
This picks up a bug fix for AMDGPU v_permlane_swap:

llvm/llvm-project#144423
Without this fix, the v_permlane_swap is wrongly sunk.

Along the way we need to fix API changes:

Add header file for the class IRBuilder
Add missing default parameter in convertFuncOpToLLVMFuncOp
searlmc1 pushed a commit to ROCm/llvm-project that referenced this pull request Jun 26, 2025
…4423) (llvm#2765)

Permlane_swap instruction depends on exec mask, added isConvergent flag
to prevent sinking of instruction.

Fixes: SWDEV-537232
ptrojahn pushed a commit to ptrojahn/triton that referenced this pull request Jul 10, 2025
This picks up a bug fix for AMDGPU v_permlane_swap:

llvm/llvm-project#144423
Without this fix, the v_permlane_swap is wrongly sunk.

Along the way we need to fix API changes:

Add header file for the class IRBuilder
Add missing default parameter in convertFuncOpToLLVMFuncOp
AlexAUT pushed a commit to ROCm/triton that referenced this pull request Jul 31, 2025
This picks up a bug fix for AMDGPU v_permlane_swap:

llvm/llvm-project#144423
Without this fix, the v_permlane_swap is wrongly sunk.

Along the way we need to fix API changes:

Add header file for the class IRBuilder
Add missing default parameter in convertFuncOpToLLVMFuncOp
jataylo pushed a commit to ROCm/triton that referenced this pull request Aug 1, 2025
* [AMD] Avoid async load to pipeline for less than 32bit load (triton-lang#7250)

We can only use AsyncCopy if the final load width can be >= 4 bytes. 
`triton::canBeConvertedToAsyncLoad` checks that the vecSize of the
source is large enough. Additionally we need to ensure the register to
shared layout (blocked+shared) does have enough contiguous elements
since we cannot scatter into LDS.

Before this PR we will abort compilation instead of falling back to
pipelining through registers.

* [AMD] Pipeline small tensors w/ registers only on GFX950 (triton-lang#7171)

Fixes a perf regression on gfx942 but preserves functionality for
gfx950 (and above).

* Reland "[AMD] Optimize reduction with v_permlane intrinsics in GFX950" (triton-lang#7321)

triton-lang#7291 fixed the
LLVM issue that caused correctness problems. Now
we can reland this patch.

* [Pipeliner] Expose core pipeliner utilities to reuse in AMD backend (triton-lang#7222)

This PR exposes (via header) core pipelining utilities/helpers:

```c++
bool hasGpuBarriers(scf::ForOp forOp);
bool isSafeToPipeline(scf::ForOp forOp);
llvm::MapVector<Operation *, std::pair<int, Operation *>>
loadOpsToIndirectionLevel(scf::ForOp forOp, bool pipelineWithoutDot,
                          triton::ModuleAxisInfoAnalysis &axisInfoAnalysis,
                          int numStages, bool filterSmall = true);
void scheduleDistanceOneDependencies(scf::ForOp forOp,
                                     CoarseSchedule &schedule);
void scheduleRemainingToLastStage(scf::ForOp forOp, CoarseSchedule &schedule,
                                  CoarseSchedule::Cluster afterPrologue);
```

They are directly useable by AMD's pipeliner. 

Note, this is basically NFC for AMD because AMD's pipeliner simply had
copy-paste of the same functions from ~last year.


Small API changes:

1. On NV we do not pipeline small loads (vec width < 32b). On AMD we do.
The choice is made inside `isPipeliningBeneficial` inside
`loadOpsToIndirectionLevel`. To support AMD I have added a flag
`filterSmall`.
2. On AMD the load `use`s (computed as a matter of course in
`loadOpsToIndirectionLevel`) are used (no pun intended) whereas on NV
they are not. To support AMD I keep those `use`s in the
`llvm::MapVector<Operation *, std::pair<int, Operation *>>` return from
`loadOpsToIndirectionLevel`.

These two small changes are the only "non-NFC" changes.

* [AMD] Retire local prefetch schedule hint variant (triton-lang#7395)

This variant was from some prior experiments. We have a better way
to implement later.

* [AMD] Retire TritonAMDGPU_OpIdxAttr and TritonAMDGPU_InstCounter  (triton-lang#7476)

triton-lang#7395 retired the local
prefetch schedule variant. This made `TritonAMDGPU_OpIdxAttr` and
`TritonAMDGPU_InstCounter` unused which are removed by this PR.

* [AMD][NFC] Split createAndSchedule* in stream pipeliner(triton-lang#7514)

Splits `createAndScheduleAsyncCopy` and `createAndScheduleStreamCopy` to
make it reusable if we want to schedule the ops differently in a future
PR.

* [AMD] Refactor StreamPipeliner to use more common functions (triton-lang#7526)

Further refactoring of Streampipeliner.cpp to use more common pipeliner
functionality: `triton::createAllocation`,
`triton::createSingleBufferView`, `triton::replaceWithSharedLoad` and a
bit of general cleanup.

Overall NFC except:
- The order of LocalDealloc is reversed now
- The memdesc of the subview additionally includes the allocSize

Also we had no lit test checking that the LocalLoad consumes the
AsyncToken so I adjusted one to include the check.

* [AMD] NFC: Refactor stream pipeliner to better encapsulate functions (triton-lang#7540)

Mostly moves code around to reduce the dependencies between functions
and further splits up functions doing more than one thing
(`createAndSchedule*`,` preprocessAndBuildSchedule`). This will also
allow us to use more common pipeliner functionality in a future PR, e.g.
`createAsyncCopy`.

* [FA] Set vecSize=nonKDim for V shared layout to avoid bank conflicts

I'll submit a PR upstream later.

* [GEMM] Add combine dot_scaled and addF

* [AMD][NFC] Consolidate initialization in initSchedule for pipeliner (triton-lang#7556)

Moves all initializations of stages to `initSchedule`. Missed this one
in the last PRs.

* [AMD] NFC: Drop version minor for AMD MFMA layout (triton-lang#7285)

AMD's MFMA layout does not need version minor information like NVIDIA.
It always defaults to 0 in the current codebase. The PR drops version
minor and change to a single `version` parameter for MFMA layout.

* [AMD] Add tilesPerWarp parameter to mfma layout (triton-lang#7283)

This PR introduces the tilesPerWarp parameter to the MFMA layout.
Previously, the MFMA layout assumed that each warp within a CTA tile
computed a single MFMA tile.
When the tensor was larger than a single CTA tile, these tiles were
repeated across the tensor.
In this setup, the output tiles computed by each wave were strided by
the number of warps
per CTA in both row and column dimensions.

For instance, with 16 MFMA tiles and warpsPerCTA = [2, 2], the
distribution of
warps across the MFMA tiles looked like:

w0 w1 w0 w1
w2 w3 w2 w3
w0 w1 w0 w1
w2 w3 w2 w3

The new tilesPerWarp parameter allows each warp to compute contiguous
MFMA tiles
in the row and/or column dimensions. Using the same example with
tilesPerWarp = [2, 2], the layout becomes:

w0 w0 w1 w1
w0 w0 w1 w1
w2 w2 w3 w3
w2 w2 w3 w3

While this is a general enhancement, the main motivation for introducing
this parameter
is to improve memory access efficiency for scale tensors in scaled dot
operations.
Specific patterns and use cases will be implemented in follow-up PRs.

---------

Co-authored-by: Ognjen Plavsic <[email protected]>
Co-authored-by: Lei Zhang <[email protected]>

* [AMD] Add support for pingpong GEMM using async_copy

* [BACKEND] combineRedundantWaitOps should not combine across loops/branches (triton-lang#7593)

`combineRedundantWaitOps` did skip over branches/loops, so if we end up
with something like:

```mlir
ttg.async_wait
scf.for
  ....
  scf.yield
ttg.async_wait
```
we merge the async_waits in the prologue and epilogue because we do not
find a `ttg.commit_group` in between. This PR stops the forward search
if we encounter a branch/loop. I can also walk through all successor
blocks if we think this is worth the effort.

This problem was not triggered before because the `ttg.async_wait` was
scheduled in the same stage as its user(s) so we ended up with no
`ttg.async_wait` in the prologue or there was another prefetch after it
in the prologue.

Since triton-lang#7458 we might place the
`ttg.async_wait` in the previous stage compared to its user(s) so we
might end up with the problematic IR.

* [AMD][NFC] Group scheduling functions in StreamPipeliner (triton-lang#7607)

NFC: Groups all scheduling related function to a namespace to prepare
for additional scheduling variants.

* [AMD] Add pingpong transformation for chained dot schedule (triton-lang#7638)

Adds support to enable pingpong for loops scheduled with the new
`ChainedDotSchedule` introduced by
triton-lang#7601.

The schedule already places the ops in the correct order so we just have
to insert the sync ops to ensure proper pingpong'ing.

* [AMD] Fix pingpong ChainedDot for empty second memory cluster (triton-lang#7694)

triton-lang#7638 introduced a null
pointer access (during review adjustments) if the second memory cluster
is empty or if there are no memory clusters at all.

Added a lit test to catch it and revert to the old logic.

* [AMD] Remove bypass permute optimization for AsyncCopy (triton-lang#7704)

We can only bypass ds_bpermute to apply the swizzling if lanes loading
the same row read a contiguous chunk of memory from HBM, which we cannot
infer when lowering to LLVM.
The current selection does only check if the elements for each lane are
contiguous which is not strict enough.

* [AMD] Add ChainedDotSchedule to StreamPipeliner (triton-lang#7601)

Adds a new scheduling variant which kicks in for loop which have 2
chained dots and `num_stages==4`. It places the two dots in consecutive
stages so we can interleave operations using the result of the first dot
with both dots in the loop, a pseudo example IR:

```
   %1 = tt.dot ...
   %2 = arith.addf %1, %arg1
   %3 = arith.subf %2, %arg2
   %4 = tt.dot %X, %Y, %3
```
Which could result in the following pseudo schedule (ignoring mem ops)
to interleave with both dots:
```
   stage N,   Cluster0: [%1 = tt.dot, %3 = arith.subf]
   stage N+1, Cluster1: [%4 = tt.dot, %2 = arith.addf]
```

As a first step the schedule splits the op chain between dot1 and dot2
when it encounters an operation which has more than 2 users. This aims
to avoid adding too many loop carried dependencies but does not
guarantee a good work balance between the two clusters. In future PRs we
might make this more sophisticated.

* [AMD] Add scale preshuffling and opSel implementation (triton-lang#7603)

This PR implements test kernel for efficient scale packing for CDNA4
arch as well as opSel for scaled MFMA instructions.

Scaled MFMA instructions expect scale operands as 32-bit values,
even though each individual scale is only 8 bits.
To reduce register usage, we pack 4 scales into a single 32-bit value
and use the opSel field to select the appropriate byte during execution.
Packing is done along the K dimension first. if there aren’t enough
values in K, we continue along the non-K dimension.

---------

Co-authored-by: Ognjen Plavsic <[email protected]>

* [AMD] Enable Pingpong by default on gfx950 arch (triton-lang#7697)

List of enabling conditions
- FP/BF16 GEMM with M,N>64 tilesize when num_stages=3 and num_warps=8
- GEMM using `dot_scaled` with M=N=256 tile size when num_stages=2 and
num_warps=8
- FA with num_stages=4
Only with using async_copy.

* [Backend] Bump to llvm/llvm-project@570885128351 (triton-lang#7291)

This picks up a bug fix for AMDGPU v_permlane_swap:

llvm/llvm-project#144423
Without this fix, the v_permlane_swap is wrongly sunk.

Along the way we need to fix API changes:

Add header file for the class IRBuilder
Add missing default parameter in convertFuncOpToLLVMFuncOp

---------

Co-authored-by: Maksim Levental <[email protected]>
Co-authored-by: Yi Qian <[email protected]>
Co-authored-by: Lei Zhang <[email protected]>
Co-authored-by: Lixun Zhang <[email protected]>
Co-authored-by: Jungwook Park <[email protected]>
Co-authored-by: Pengzhan Zhao <[email protected]>
Co-authored-by: plognjen <[email protected]>
Co-authored-by: Ognjen Plavsic <[email protected]>
Co-authored-by: Zeng Wu <[email protected]>
dshi7 pushed a commit to facebookexperimental/triton that referenced this pull request Aug 8, 2025
This picks up a bug fix for AMDGPU v_permlane_swap:

llvm/llvm-project#144423
Without this fix, the v_permlane_swap is wrongly sunk.

Along the way we need to fix API changes:

Add header file for the class IRBuilder
Add missing default parameter in convertFuncOpToLLVMFuncOp
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants