Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
22 changes: 17 additions & 5 deletions llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -1086,13 +1086,17 @@ void WaitcntBrackets::updateByEvent(WaitEventType E, MachineInstr &Inst) {
}
}
}
if (Slot || LDSDMAStores.size() == NUM_LDS_VGPRS - 1)
if (Slot)
break;
// The slot may not be valid because it can be >= NUM_LDS_VGPRS which
// means the scoreboard cannot track it. We still want to preserve the
// MI in order to check alias information, though.
LDSDMAStores.push_back(&Inst);
Slot = LDSDMAStores.size();
break;
}
setRegScore(FIRST_LDS_VGPR + Slot, T, CurrScore);
if (Slot < NUM_LDS_VGPRS)
setRegScore(FIRST_LDS_VGPR + Slot, T, CurrScore);
if (Slot)
setRegScore(FIRST_LDS_VGPR, T, CurrScore);
Comment on lines +1098 to 1101
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious: can Slot ever be 0 when we exit the loop above? It's not clear to me if these "if"s handle that case correctly, if it can ever happen.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so. Worst case we'll call setRegScore twice. This will be fixed in #162077

}
Expand Down Expand Up @@ -2010,15 +2014,23 @@ bool SIInsertWaitcnts::generateWaitcntInstBefore(MachineInstr &MI,
if (Ptr && Memop->getAAInfo()) {
const auto &LDSDMAStores = ScoreBrackets.getLDSDMAStores();
for (unsigned I = 0, E = LDSDMAStores.size(); I != E; ++I) {
if (MI.mayAlias(AA, *LDSDMAStores[I], true))
if (MI.mayAlias(AA, *LDSDMAStores[I], true)) {
if ((I + 1) >= NUM_LDS_VGPRS) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you move this up to avoid the AA check if it the limit

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We still need to do the AA check if we hit the limit. If we don't alias with any LDSDMAStore, even those above the limit, we don't need to wait

// We didn't have enough slot to track this LDS DMA store, it
// has been tracked using the common RegNo (FIRST_LDS_VGPR).
ScoreBrackets.determineWait(LOAD_CNT, RegNo, Wait);
break;
}

ScoreBrackets.determineWait(LOAD_CNT, RegNo + I + 1, Wait);
}
}
} else {
ScoreBrackets.determineWait(LOAD_CNT, RegNo, Wait);
}
if (Memop->isStore()) {

if (Memop->isStore())
ScoreBrackets.determineWait(EXP_CNT, RegNo, Wait);
}
}

// Loop over use and def operands.
Expand Down
2 changes: 2 additions & 0 deletions llvm/test/CodeGen/AMDGPU/lds-dma-waits.ll
Original file line number Diff line number Diff line change
Expand Up @@ -223,6 +223,7 @@ define amdgpu_kernel void @buffer_load_lds_dword_10_arrays(<4 x i32> %rsrc, i32
; GFX9-NEXT: s_waitcnt vmcnt(2)
; GFX9-NEXT: ds_read_b32 v7, v9 offset:1792
; GFX9-NEXT: ; wave barrier
; GFX9-NEXT: s_waitcnt vmcnt(0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this wait needed? Or is the test deliberately using more slots than SIInsertWaitcnts can track, so we have to be conservative?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that's precisely what I was trying to fix. The wait is required here because we ran out of tracking slots.
#162077 increases the number of tracking slots so this will be improved

; GFX9-NEXT: ds_read_b32 v8, v9 offset:2048
; GFX9-NEXT: ; wave barrier
; GFX9-NEXT: ds_read_b32 v9, v9 offset:2304
Expand Down Expand Up @@ -288,6 +289,7 @@ define amdgpu_kernel void @buffer_load_lds_dword_10_arrays(<4 x i32> %rsrc, i32
; GFX10-NEXT: s_waitcnt vmcnt(2)
; GFX10-NEXT: ds_read_b32 v7, v9 offset:1792
; GFX10-NEXT: ; wave barrier
; GFX10-NEXT: s_waitcnt vmcnt(0)
; GFX10-NEXT: ds_read_b32 v8, v9 offset:2048
; GFX10-NEXT: ; wave barrier
; GFX10-NEXT: ds_read_b32 v9, v9 offset:2304
Expand Down
Loading