Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
17 changes: 17 additions & 0 deletions llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -1278,6 +1278,23 @@ bool WaitcntGeneratorPreGFX12::applyPreexistingWaitcnt(
if (Opcode == AMDGPU::S_WAITCNT) {
unsigned IEnc = II.getOperand(0).getImm();
AMDGPU::Waitcnt OldWait = AMDGPU::decodeWaitcnt(IV, IEnc);

// These pseudo waitcnt instructions are only needed to synchronize DS
// operations with direct LDS loads that use vmcnt. We can safely relax
// them when no outstanding direct LDS loads exist, even if other vmcnt
// events are pending.
if (II.getOpcode() == AMDGPU::S_WAITCNT_DIRECT_LDS_LOAD_soft &&
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does the opcode need to identify specifically direct loads to LDS? We can do it with just S_WAITCNT, right? As far as I can see, the memory model is this: release fences must wait for any direct loads to LDS if those fences carry the LDS address space. Then the opcode should be "AMDGPU::S_WAITCNT_LDS". It's more interesting that actually we only need to do this on release fences and not on pure acquire fences.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the waitcnt pass we are not really looking at fences and atomics directly. The idea is to rely on the "MemoryLegalizer" pass to insert soft waitcnt which can be optimized. If they are just regular waitcnt how can I differentiate from other "soft" vmcnt(0) that are added for reasons besides direct load to LDS?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's a problem with the current separation of concerns between the memory legalizer and the waitcnt inserter. When the inserter sees a vmcnt(0), it is unsafe to relax it without complete knowledge of why it was inserted. There are two ways we could fix this:

  1. Make the legalizer safe by default, which means it always inserts a zero count wherever a wait of any kind is required. Then make sure that the inserter can track every relevant operation including buffer invalidates. It can then safely relax the zero count based on its scores.
  2. Make the legalizer aggressive by default, which means it always inserts a ~0 count wherever a wait of any kind is required. Then make sure the inserter does the opposite, which is to put a non-trivial count based on scores.

In either case, we only ever need one S_WAITCNT_soft opcode for everything, because the inserter has complete knowledge of all operations that depend on a wait.

I am very much in favour of option 1. But for now, we could maybe modify the legalizer so that it puts a vmcnt(~0) at any workgroup-scoped S_WAITCNT_soft. Then the inserter can change it to vmcnt(0) if it sees direct loads preceding such a soft waitcnt.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's a problem with the current separation of concerns between the memory legalizer and the waitcnt inserter.

I don't understand why you call this a "problem". Maybe it doesn't work in exactly the way you would like, but I think it does work perfectly well.

The legalizer is safe by default, which means it always inserts a zero count wherever a wait of any kind is required. The inserter can safely relax the zero count based on its scores, but this is implemented without the inserter having to know anything about fences/invalidates/etc specifically; instead, everything the inserter needs to know is in the waitcnt instruction itself.

For most waitcnts, no extra information is needed; it's just a regular waitcnt. For the new case in this patch, the extra information is "this is a wait on vmcnt but I'm only interested in VMEM load-to-lds instructions".

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. It's good to know that the legalizer is meant to be safe by default. The problem is that the waitcnt inserter does not have complete knowledge when it relaxes the waitcnt. (This is besides the fact that the inserter is also supposed to restore legality between a load and its use, or a store and its operands, etc). In particular, the legalizer will insert vmcnt(0) after an invalidate, but there is an explicit comment in the inserter which says that it should not be aware of invalidates. That's what makes things difficult. The inserter can safely relax a vmcnt(0) if it also correctly tracks the invalidates. That's the part that is a problem right now.

TrySimplify) {
unsigned RegNo = SQ_MAX_PGM_VGPRS + EXTRA_VGPR_LDS;
AMDGPU::Waitcnt LDSDirectWait;
ScoreBrackets.determineWait(LOAD_CNT, RegNo, LDSDirectWait);
// Relax waitcnt to only wait on inflight direct LDS loads.
if (LDSDirectWait.LoadCnt > OldWait.LoadCnt) {
OldWait.LoadCnt = LDSDirectWait.LoadCnt;
Modified = true;
}
}

if (TrySimplify)
ScoreBrackets.simplifyWaitcnt(OldWait);
Wait = Wait.combined(OldWait);
Expand Down
1 change: 1 addition & 0 deletions llvm/lib/Target/AMDGPU/SIInstrInfo.h
Original file line number Diff line number Diff line change
Expand Up @@ -1010,6 +1010,7 @@ class SIInstrInfo final : public AMDGPUGenInstrInfo {
static unsigned getNonSoftWaitcntOpcode(unsigned Opcode) {
switch (Opcode) {
case AMDGPU::S_WAITCNT_soft:
case AMDGPU::S_WAITCNT_DIRECT_LDS_LOAD_soft:
return AMDGPU::S_WAITCNT;
case AMDGPU::S_WAITCNT_VSCNT_soft:
return AMDGPU::S_WAITCNT_VSCNT;
Expand Down
6 changes: 6 additions & 0 deletions llvm/lib/Target/AMDGPU/SOPInstructions.td
Original file line number Diff line number Diff line change
Expand Up @@ -1608,6 +1608,12 @@ let OtherPredicates = [HasImageInsts] in {
def S_WAIT_DSCNT_soft : SOPP_Pseudo <"s_soft_wait_dscnt", (ins s16imm:$simm16), "$simm16">;
def S_WAIT_KMCNT_soft : SOPP_Pseudo <"s_soft_wait_kmcnt", (ins s16imm:$simm16), "$simm16">;
}
// Soft waitcnt for direct loads to LDS from global memory. These waits may be
// relaxed or removed entirely based on current in-flight memory operations
// and their relation to these direct LDS loads. For example, if global loads
// to LDS are mixed with global loads not writing to LDS, a wait may only be
// necessary for the LDS-writing loads to synchronize with other LDS operations.
def S_WAITCNT_DIRECT_LDS_LOAD_soft : SOPP_Pseudo <"s_soft_waitcnt" , (ins SWaitCnt:$simm16), "$simm16">;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This instruction is named S_WAITCNT_DIRECT_LDS_LOAD_soft but we don't have a S_WAITCNT_DIRECT_LDS_LOAD. I don't think it's right to call this a "soft" waitcnt. It should have another name, IMO.


def S_SETHALT : SOPP_Pseudo <"s_sethalt" , (ins i32imm:$simm16), "$simm16",
[(int_amdgcn_s_sethalt timm:$simm16)]>;
Expand Down
170 changes: 170 additions & 0 deletions llvm/test/CodeGen/AMDGPU/lds-dma-waitcnt.mir
Original file line number Diff line number Diff line change
Expand Up @@ -117,3 +117,173 @@ body: |
S_ENDPGM 0

...

# Soft waitcnt should be honored here.
# GCN-LABEL: name: buffer_load_dword_lds_ds_read_soft_wait
# GCN: BUFFER_LOAD_DWORD_LDS_IDXEN
# GCN-NEXT: S_WAITCNT 3952
# vmcnt(0)
# GCN-NEXT: S_BARRIER
---
name: buffer_load_dword_lds_ds_read_soft_wait
body: |
bb.0:
$m0 = S_MOV_B32 0
BUFFER_LOAD_DWORD_LDS_IDXEN $vgpr0, $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr4, 4, 0, 0, implicit $exec, implicit $m0 :: (load (s32) from `ptr addrspace(1) poison` + 4), (store (s32) into `ptr addrspace(3) poison` + 4)
S_WAITCNT_DIRECT_LDS_LOAD_soft 3952
S_BARRIER
$vgpr0 = DS_READ_B32_gfx9 $vgpr1, 0, 0, implicit $m0, implicit $exec :: (load (s32) from `ptr addrspace(3) poison`)
S_ENDPGM 0

...

# No need for waitcnt.
# GCN-LABEL: name: buffer_store_lds_dword_ds_read_soft_wait
# GCN: BUFFER_STORE_LDS_DWORD
# GCN-NEXT: S_BARRIER
---
name: buffer_store_lds_dword_ds_read_soft_wait
body: |
bb.0:
$m0 = S_MOV_B32 0
BUFFER_STORE_LDS_DWORD $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr4, 4, 0, 0, implicit $exec, implicit $m0 :: (load (s32) from `ptr addrspace(3) poison` + 4), (store (s32) into `ptr addrspace(1) poison` + 4)
S_WAITCNT_DIRECT_LDS_LOAD_soft 3952
S_BARRIER
$vgpr0 = DS_READ_B32_gfx9 $vgpr1, 0, 0, implicit $m0, implicit $exec :: (load (s32) from `ptr addrspace(3) poison`)
S_ENDPGM 0

...

# Soft waitcnt should mean vmcnt(1) before the barrier and vmcnt(0) after.
# GCN-LABEL: name: series_of_buffer_load_dword_lds_ds_read_soft_wait
# GCN: BUFFER_LOAD_DWORD_LDS_IDXEN
# GCN-NEXT: BUFFER_LOAD_DWORD_LDS_IDXEN
# GCN-NEXT: BUFFER_LOAD_DWORD_LDS_IDXEN
# GCN-NEXT: S_WAITCNT 3953
# vmcnt(1)
# GCN-NEXT: S_BARRIER
# GCN-NEXT: S_WAITCNT 3952
# vmcnt(0)
# GCN-NEXT: DS_READ_B32_gfx9
---
name: series_of_buffer_load_dword_lds_ds_read_soft_wait
body: |
bb.0:
$m0 = S_MOV_B32 0
BUFFER_LOAD_DWORD_LDS_IDXEN $vgpr0, $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr4, 0, 0, 0, implicit $exec, implicit $m0 :: (load (s32) from `ptr addrspace(1) poison`), (store (s32) into `ptr addrspace(3) poison`)
BUFFER_LOAD_DWORD_LDS_IDXEN $vgpr0, $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr4, 4, 0, 0, implicit $exec, implicit $m0 :: (load (s32) from `ptr addrspace(1) poison` + 4), (store (s32) into `ptr addrspace(3) poison` + 4)
BUFFER_LOAD_DWORD_LDS_IDXEN $vgpr0, $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr4, 8, 0, 0, implicit $exec, implicit $m0 :: (load (s32) from `ptr addrspace(1) poison` + 8), (store (s32) into `ptr addrspace(3) poison` + 8)
S_WAITCNT_DIRECT_LDS_LOAD_soft 3953
S_BARRIER
$vgpr0 = DS_READ_B32_gfx9 $vgpr1, 0, 0, implicit $m0, implicit $exec :: (load (s32) from `ptr addrspace(3) poison`)
S_ENDPGM 0

...

# No waitcnt before the barrier because counter is too high
# GCN-LABEL: name: buffer_load_dword_lds_ds_read_soft_wait_redundant
# GCN: BUFFER_LOAD_DWORD_LDS_IDXEN
# GCN-NEXT: S_BARRIER
# GCN-NEXT: S_WAITCNT 3952
# vmcnt(0)
# GCN-NEXT: DS_READ_B32_gfx9
---
name: buffer_load_dword_lds_ds_read_soft_wait_redundant
body: |
bb.0:
$m0 = S_MOV_B32 0
BUFFER_LOAD_DWORD_LDS_IDXEN $vgpr0, $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr4, 4, 0, 0, implicit $exec, implicit $m0 :: (load (s32) from `ptr addrspace(1) poison` + 4), (store (s32) into `ptr addrspace(3) poison` + 4)
S_WAITCNT_DIRECT_LDS_LOAD_soft 3953
S_BARRIER
$vgpr0 = DS_READ_B32_gfx9 $vgpr1, 0, 0, implicit $m0, implicit $exec :: (load (s32) from `ptr addrspace(3) poison`)
S_ENDPGM 0

...

# Combine waitcnt.
# GCN-LABEL: name: series_of_buffer_load_dword_lds_ds_read_soft_wait_repeat
# GCN: BUFFER_LOAD_DWORD_LDS_IDXEN
# GCN-NEXT: BUFFER_LOAD_DWORD_LDS_IDXEN
# GCN-NEXT: BUFFER_LOAD_DWORD_LDS_IDXEN
# GCN-NEXT: S_WAITCNT 3953
# vmcnt(1)
# GCN-NEXT: S_BARRIER
# GCN-NEXT: S_WAITCNT 3952
# vmcnt(0)
# GCN-NEXT: DS_READ_B32_gfx9
---
name: series_of_buffer_load_dword_lds_ds_read_soft_wait_repeat
body: |
bb.0:
$m0 = S_MOV_B32 0
BUFFER_LOAD_DWORD_LDS_IDXEN $vgpr0, $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr4, 0, 0, 0, implicit $exec, implicit $m0 :: (load (s32) from `ptr addrspace(1) poison`), (store (s32) into `ptr addrspace(3) poison`)
BUFFER_LOAD_DWORD_LDS_IDXEN $vgpr0, $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr4, 4, 0, 0, implicit $exec, implicit $m0 :: (load (s32) from `ptr addrspace(1) poison` + 4), (store (s32) into `ptr addrspace(3) poison` + 4)
BUFFER_LOAD_DWORD_LDS_IDXEN $vgpr0, $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr4, 8, 0, 0, implicit $exec, implicit $m0 :: (load (s32) from `ptr addrspace(1) poison` + 8), (store (s32) into `ptr addrspace(3) poison` + 8)
S_WAITCNT_DIRECT_LDS_LOAD_soft 3953
S_WAITCNT_DIRECT_LDS_LOAD_soft 3953
S_BARRIER
S_WAITCNT_DIRECT_LDS_LOAD_soft 3953
S_WAITCNT_DIRECT_LDS_LOAD_soft 3953
$vgpr0 = DS_READ_B32_gfx9 $vgpr1, 0, 0, implicit $m0, implicit $exec :: (load (s32) from `ptr addrspace(3) poison`)
S_ENDPGM 0

...

# Merge waitcnt.
# GCN-LABEL: name: series_of_buffer_load_dword_lds_ds_read_soft_wait_merge
# GCN: BUFFER_LOAD_DWORD_LDS_IDXEN
# GCN-NEXT: BUFFER_LOAD_DWORD_LDS_IDXEN
# GCN-NEXT: BUFFER_LOAD_DWORD_LDS_IDXEN
# GCN-NEXT: S_WAITCNT 3953
# vmcnt(1)
# GCN-NEXT: S_BARRIER
# GCN-NEXT: S_WAITCNT 3952
# vmcnt(0)
# GCN-NEXT: DS_READ_B32_gfx9
---
name: series_of_buffer_load_dword_lds_ds_read_soft_wait_merge
body: |
bb.0:
$m0 = S_MOV_B32 0
BUFFER_LOAD_DWORD_LDS_IDXEN $vgpr0, $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr4, 0, 0, 0, implicit $exec, implicit $m0 :: (load (s32) from `ptr addrspace(1) poison`), (store (s32) into `ptr addrspace(3) poison`)
BUFFER_LOAD_DWORD_LDS_IDXEN $vgpr0, $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr4, 4, 0, 0, implicit $exec, implicit $m0 :: (load (s32) from `ptr addrspace(1) poison` + 4), (store (s32) into `ptr addrspace(3) poison` + 4)
BUFFER_LOAD_DWORD_LDS_IDXEN $vgpr0, $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr4, 8, 0, 0, implicit $exec, implicit $m0 :: (load (s32) from `ptr addrspace(1) poison` + 8), (store (s32) into `ptr addrspace(3) poison` + 8)
S_WAITCNT_DIRECT_LDS_LOAD_soft 3954
S_WAITCNT_DIRECT_LDS_LOAD_soft 3953
S_BARRIER
S_WAITCNT_DIRECT_LDS_LOAD_soft 3952
S_WAITCNT_DIRECT_LDS_LOAD_soft 3952
$vgpr0 = DS_READ_B32_gfx9 $vgpr1, 0, 0, implicit $m0, implicit $exec :: (load (s32) from `ptr addrspace(3) poison`)
S_ENDPGM 0

...


# Handle the preexisting waitcnt.
# GCN-LABEL: name: series_of_buffer_load_dword_lds_ds_read_soft_wait_preexisting
# GCN: BUFFER_LOAD_DWORD_LDS_IDXEN
# GCN-NEXT: BUFFER_LOAD_DWORD_LDS_IDXEN
# GCN-NEXT: S_WAITCNT 0
# GCN-NEXT: BUFFER_LOAD_DWORD_LDS_IDXEN
# GCN-NEXT: S_BARRIER
# GCN-NEXT: S_WAITCNT 3952
# vmcnt(0)
# GCN-NEXT: DS_READ_B32_gfx9
---
name: series_of_buffer_load_dword_lds_ds_read_soft_wait_preexisting
body: |
bb.0:
$m0 = S_MOV_B32 0
BUFFER_LOAD_DWORD_LDS_IDXEN $vgpr0, $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr4, 0, 0, 0, implicit $exec, implicit $m0 :: (load (s32) from `ptr addrspace(1) poison`), (store (s32) into `ptr addrspace(3) poison`)
BUFFER_LOAD_DWORD_LDS_IDXEN $vgpr0, $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr4, 4, 0, 0, implicit $exec, implicit $m0 :: (load (s32) from `ptr addrspace(1) poison` + 4), (store (s32) into `ptr addrspace(3) poison` + 4)
S_WAITCNT 0
BUFFER_LOAD_DWORD_LDS_IDXEN $vgpr0, $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr4, 8, 0, 0, implicit $exec, implicit $m0 :: (load (s32) from `ptr addrspace(1) poison` + 8), (store (s32) into `ptr addrspace(3) poison` + 8)
S_WAITCNT_DIRECT_LDS_LOAD_soft 3953
S_WAITCNT_DIRECT_LDS_LOAD_soft 3953
S_BARRIER
S_WAITCNT_DIRECT_LDS_LOAD_soft 3953
S_WAITCNT_DIRECT_LDS_LOAD_soft 3953
$vgpr0 = DS_READ_B32_gfx9 $vgpr1, 0, 0, implicit $m0, implicit $exec :: (load (s32) from `ptr addrspace(3) poison`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
$vgpr0 = DS_READ_B32_gfx9 $vgpr1, 0, 0, implicit $m0, implicit $exec :: (load (s32) from `ptr addrspace(3) poison`)
$vgpr0 = DS_READ_B32_gfx9 $vgpr1, 0, 0, implicit $m0, implicit $exec :: (load (s32), addrspace 3)

Should fix all the MMOs to have address space directly and drop the filler poison values

S_ENDPGM 0

...
Loading