Skip to content
Merged
Show file tree
Hide file tree
Changes from 12 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
80 changes: 73 additions & 7 deletions llvm/lib/Target/AMDGPU/SILoadStoreOptimizer.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -840,8 +840,16 @@ void SILoadStoreOptimizer::CombineInfo::setMI(MachineBasicBlock::iterator MI,
Offset = I->getOperand(OffsetIdx).getImm();
}

if (InstClass == TBUFFER_LOAD || InstClass == TBUFFER_STORE)
if (InstClass == TBUFFER_LOAD || InstClass == TBUFFER_STORE) {
Format = LSO.TII->getNamedOperand(*I, AMDGPU::OpName::format)->getImm();
const AMDGPU::GcnBufferFormatInfo *Info =
AMDGPU::getGcnBufferFormatInfo(Format, *LSO.STM);

// Use 2-byte element size if the tbuffer format is 16-bit.
// Use 1-byte element size if the tbuffer format is 8-bit.
if (Info)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can getGcnBufferFormatInfo ever fail here? Why?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it will cause some lit tests to fail. I think this is a problem with the test itself, for example, if we run it with gfx900 but the test is meant to target gfx10, the format won't be found and Info will be null. This will trigger failures, like in gfx10_tbuffer_load_x_xyz when run on gfx900.
So I think it makes sense to keep the null check for compatibility. right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really we ought to split the tests into separate files for gfx9 and gfx10.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should I separate this and create a new NFC PR?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think separating merge-tbuffer.mir makes sense, but we still have these lit tests that need to be separated:

Failed Tests (6):
  LLVM :: CodeGen/AMDGPU/llvm.amdgcn.raw.ptr.tbuffer.load.ll
  LLVM :: CodeGen/AMDGPU/llvm.amdgcn.raw.ptr.tbuffer.store.ll
  LLVM :: CodeGen/AMDGPU/llvm.amdgcn.raw.tbuffer.load.ll
  LLVM :: CodeGen/AMDGPU/llvm.amdgcn.raw.tbuffer.store.ll
  LLVM :: CodeGen/AMDGPU/llvm.amdgcn.struct.ptr.tbuffer.load.ll
  LLVM :: CodeGen/AMDGPU/llvm.amdgcn.struct.tbuffer.load.ll

Should I continue to do this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be better to change this pass so that if !Info then the instruction is not added to the list of mergeable instructions. I.e. either change setMI so that it can return failure, or check that getGcnBufferFormatInfo succeeds before calling setMI. I think this is safer than treating unknown formats as if they are 32-bit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Jay, I tried changing setMI to return a bool, but it causes some DS_Load tests to fail. Also, if we want to check getGcnBufferFormatInfo before calling setMI, we would need to check whether the instruction is a tbuffer load/store first , otherwise it could also affect DS_Load behavior.
I believe the current approach is still safe. Even if Info is null and the instruction is added to the mergeable list, we still check whether the format info is valid before merging. If it's null, the instruction won't be merged.
So from a correctness standpoint, I think the current logic is safe.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Jay, I’ve already updated the patch. I now check that getGcnBufferFormatInfo succeeds before calling setMI. :-)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So now you don't need the if (Info) check here, right?

And there are similar checks in offsetsCanBeCombined that can be removed.

EltSize = Info->BitsPerComp / 8;
}

Width = getOpcodeWidth(*I, *LSO.TII);

Expand Down Expand Up @@ -1060,13 +1068,46 @@ bool SILoadStoreOptimizer::offsetsCanBeCombined(CombineInfo &CI,
Info0->NumFormat != Info1->NumFormat)
return false;

// TODO: Should be possible to support more formats, but if format loads
// are not dword-aligned, the merged load might not be valid.
Comment on lines -1063 to -1064
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will your patch merge two 16-bit loads at offsets 2 and 4, into a single 32-bit load at offset 2?

If it does that then the merged load is not dword aligned. Is that allowed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I understand your concern. I’ve verified section 9.5 “Alignment” of the RDNA 3 Shader Instruction Set Architecture manual, which states:

Formatted ops such as BUFFER_LOAD_FORMAT_* must be aligned as follows:
• 1-byte formats → 1-byte alignment
• 2-byte formats → 2-byte alignment
• 4-byte and larger formats → 4-byte alignment

I’ve therefore added an explicit alignment check and a new Lit test, gfx11_tbuffer_load_x_off2_off4_16bit_no_merge.

Reference:
https://www.amd.com/content/dam/amd/en/documents/radeon-tech-docs/instruction-set-architectures/rdna3-shader-instruction-set-architecture-feb-2023_0.pdf

if (Info0->BitsPerComp != 32)
// For 8-bit or 16-bit formats there is no 3-component variant.
// If NumCombinedComponents is 3, try the 4-component format and use XYZ.
// Example:
// tbuffer_load_format_x + tbuffer_load_format_x + tbuffer_load_format_x
// ==> tbuffer_load_format_xyz with format:[BUF_FMT_16_16_16_16_SNORM]
unsigned NumCombinedComponents = CI.Width + Paired.Width;
unsigned CombinedBufferFormat =
getBufferFormatWithCompCount(CI.Format, NumCombinedComponents, STI);
if (CombinedBufferFormat == 0 && NumCombinedComponents == 3) {
if (Info0->BitsPerComp == 8 || Info0->BitsPerComp == 16) {
unsigned TryFormat = getBufferFormatWithCompCount(CI.Format, 4, STI);
if (!TryFormat)
return false;
CombinedBufferFormat = TryFormat;
NumCombinedComponents = 4;
}
}

if (CombinedBufferFormat == 0)
return false;

// Merge only when the two access ranges are strictly back-to-back,
// any gap or overlap can over-write data or leave holes.
unsigned BytePerComp = Info0->BitsPerComp / 8;
unsigned ElemIndex0 = CI.Offset / BytePerComp;
unsigned ElemIndex1 = Paired.Offset / BytePerComp;
if (ElemIndex0 + CI.Width != ElemIndex1 &&
ElemIndex1 + Paired.Width != ElemIndex0)
return false;

if (getBufferFormatWithCompCount(CI.Format, CI.Width + Paired.Width, STI) == 0)
// 1-byte formats require 1-byte alignment.
// 2-byte formats require 2-byte alignment.
// 4-byte and larger formats require 4-byte alignment.
unsigned MergedBytes = BytePerComp * NumCombinedComponents;
unsigned RequiredAlign = std::min(MergedBytes, 4u);
unsigned MinOff = std::min(CI.Offset, Paired.Offset);
if (MinOff % RequiredAlign != 0)
return false;

return true;
}

uint32_t EltOffset0 = CI.Offset / CI.EltSize;
Expand Down Expand Up @@ -1596,8 +1637,14 @@ MachineBasicBlock::iterator SILoadStoreOptimizer::mergeTBufferLoadPair(
if (Regs.VAddr)
MIB.add(*TII->getNamedOperand(*CI.I, AMDGPU::OpName::vaddr));

// For 8-bit or 16-bit tbuffer formats there is no 3-component encoding.
// If the combined count is 3 (e.g. X+X+X or XY+X), promote to 4 components
// and use XYZ of XYZW to enable the merge.
unsigned NumCombinedComponents = CI.Width + Paired.Width;
if (NumCombinedComponents == 3 && (CI.EltSize == 1 || CI.EltSize == 2))
NumCombinedComponents = 4;
unsigned JoinedFormat =
getBufferFormatWithCompCount(CI.Format, CI.Width + Paired.Width, *STM);
getBufferFormatWithCompCount(CI.Format, NumCombinedComponents, *STM);

// It shouldn't be possible to get this far if the two instructions
// don't have a single memoperand, because MachineInstr::mayAlias()
Expand Down Expand Up @@ -1639,8 +1686,14 @@ MachineBasicBlock::iterator SILoadStoreOptimizer::mergeTBufferStorePair(
if (Regs.VAddr)
MIB.add(*TII->getNamedOperand(*CI.I, AMDGPU::OpName::vaddr));

// For 8-bit or 16-bit tbuffer formats there is no 3-component encoding.
// If the combined count is 3 (e.g. X+X+X or XY+X), promote to 4 components
// and use XYZ of XYZW to enable the merge.
unsigned NumCombinedComponents = CI.Width + Paired.Width;
if (NumCombinedComponents == 3 && (CI.EltSize == 1 || CI.EltSize == 2))
NumCombinedComponents = 4;
unsigned JoinedFormat =
getBufferFormatWithCompCount(CI.Format, CI.Width + Paired.Width, *STM);
getBufferFormatWithCompCount(CI.Format, NumCombinedComponents, *STM);

// It shouldn't be possible to get this far if the two instructions
// don't have a single memoperand, because MachineInstr::mayAlias()
Expand Down Expand Up @@ -2353,6 +2406,19 @@ SILoadStoreOptimizer::collectMergeableInsts(
if (Swizzled != -1 && MI.getOperand(Swizzled).getImm())
continue;

if (InstClass == TBUFFER_LOAD || InstClass == TBUFFER_STORE) {
const MachineOperand *Fmt =
TII->getNamedOperand(MI, AMDGPU::OpName::format);
if (!Fmt) {
LLVM_DEBUG(dbgs() << "Skip tbuffer without format operand: " << MI);
continue;
}
if (!AMDGPU::getGcnBufferFormatInfo(Fmt->getImm(), *STM)) {
LLVM_DEBUG(dbgs() << "Skip tbuffer with unknown format: " << MI);
continue;
}
}

CombineInfo CI;
CI.setMI(MI, *this);
CI.Order = Order++;
Expand Down
Loading