Skip to content
Merged
Show file tree
Hide file tree
Changes from 5 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
71 changes: 42 additions & 29 deletions llvm/docs/AMDGPUUsage.rst
Original file line number Diff line number Diff line change
Expand Up @@ -758,6 +758,12 @@ For example:
enabled will execute correctly but may be less
performant than code generated for XNACK replay
disabled.

dynamic-vgpr TODO Represents the "Dynamic VGPR" hardware mode, introduced in GFX12.
Waves launched in this mode may allocate or deallocate the VGPRs
using dedicated instructions, but may not send the DEALLOC_VGPRS
message.

=============== ============================ ==================================================

.. _amdgpu-target-id:
Expand Down Expand Up @@ -6014,8 +6020,13 @@ Frame Pointer

If the kernel needs a frame pointer for the reasons defined in
``SIFrameLowering`` then SGPR33 is used and is always set to ``0`` in the
kernel prolog. If a frame pointer is not required then all uses of the frame
pointer are replaced with immediate ``0`` offsets.
kernel prolog. On GFX12+, when dynamic VGPRs are enabled, the prologue will
check if the kernel is running on a compute queue, and if so it will reserve
some scratch space for any dynamic VGPRs that might need to be saved by the
CWSR trap handler. In this case, the frame pointer will be initialized to
a suitably aligned offset above this reserved area. If a frame pointer is not
required then all uses of the frame pointer are replaced with immediate ``0``
offsets.

.. _amdgpu-amdhsa-kernel-prolog-flat-scratch:

Expand Down Expand Up @@ -17127,33 +17138,35 @@ within a map that has been added by the same *vendor-name*.
.. table:: AMDPAL Code Object Hardware Stage Metadata Map
:name: amdgpu-amdpal-code-object-hardware-stage-metadata-map-table

========================== ============== ========= ===============================================================
String Key Value Type Required? Description
========================== ============== ========= ===============================================================
".entry_point" string The ELF symbol pointing to this pipeline's stage entry point.
".scratch_memory_size" integer Scratch memory size in bytes.
".lds_size" integer Local Data Share size in bytes.
".perf_data_buffer_size" integer Performance data buffer size in bytes.
".vgpr_count" integer Number of VGPRs used.
".agpr_count" integer Number of AGPRs used.
".sgpr_count" integer Number of SGPRs used.
".vgpr_limit" integer If non-zero, indicates the shader was compiled with a
directive to instruct the compiler to limit the VGPR usage to
be less than or equal to the specified value (only set if
different from HW default).
".sgpr_limit" integer SGPR count upper limit (only set if different from HW
default).
".threadgroup_dimensions" sequence of Thread-group X/Y/Z dimensions (Compute only).
3 integers
".wavefront_size" integer Wavefront size (only set if different from HW default).
".uses_uavs" boolean The shader reads or writes UAVs.
".uses_rovs" boolean The shader reads or writes ROVs.
".writes_uavs" boolean The shader writes to one or more UAVs.
".writes_depth" boolean The shader writes out a depth value.
".uses_append_consume" boolean The shader uses append and/or consume operations, either
memory or GDS.
".uses_prim_id" boolean The shader uses PrimID.
========================== ============== ========= ===============================================================
=========================== ============== ========= ===============================================================
String Key Value Type Required? Description
=========================== ============== ========= ===============================================================
".entry_point" string The ELF symbol pointing to this pipeline's stage entry point.
".scratch_memory_size" integer Scratch memory size in bytes.
".lds_size" integer Local Data Share size in bytes.
".perf_data_buffer_size" integer Performance data buffer size in bytes.
".vgpr_count" integer Number of VGPRs used.
".agpr_count" integer Number of AGPRs used.
".sgpr_count" integer Number of SGPRs used.
".dynamic_vgpr_saved_count" integer No Number of dynamic VGPRs that can be stored in scratch by the
CWSR trap handler. Only used on GFX12+.
".vgpr_limit" integer If non-zero, indicates the shader was compiled with a
directive to instruct the compiler to limit the VGPR usage to
be less than or equal to the specified value (only set if
different from HW default).
".sgpr_limit" integer SGPR count upper limit (only set if different from HW
default).
".threadgroup_dimensions" sequence of Thread-group X/Y/Z dimensions (Compute only).
3 integers
".wavefront_size" integer Wavefront size (only set if different from HW default).
".uses_uavs" boolean The shader reads or writes UAVs.
".uses_rovs" boolean The shader reads or writes ROVs.
".writes_uavs" boolean The shader writes to one or more UAVs.
".writes_depth" boolean The shader writes out a depth value.
".uses_append_consume" boolean The shader uses append and/or consume operations, either
memory or GDS.
".uses_prim_id" boolean The shader uses PrimID.
=========================== ============== ========= ===============================================================

..

Expand Down
12 changes: 12 additions & 0 deletions llvm/lib/Target/AMDGPU/AMDGPU.td
Original file line number Diff line number Diff line change
Expand Up @@ -1239,6 +1239,18 @@ def FeatureXF32Insts : SubtargetFeature<"xf32-insts",
"v_mfma_f32_16x16x8_xf32 and v_mfma_f32_32x32x4_xf32"
>;

def FeatureDynamicVGPR : SubtargetFeature <"dynamic-vgpr",
"DynamicVGPR",
"true",
"Enable dynamic VGPR mode"
>;

def FeatureDynamicVGPRBlockSize32 : SubtargetFeature<"dynamic-vgpr-block-size-32",
"DynamicVGPRBlockSize32",
"true",
"Use a block size of 32 for dynamic VGPR allocation (default is 16)"
>;

// Dummy feature used to disable assembler instructions.
def FeatureDisable : SubtargetFeature<"",
"FeatureDisable","true",
Expand Down
12 changes: 11 additions & 1 deletion llvm/lib/Target/AMDGPU/AMDGPUAsmPrinter.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -1414,6 +1414,9 @@ static void EmitPALMetadataCommon(AMDGPUPALMetadata *MD,
MD->setHwStage(CC, ".trap_present",
(bool)CurrentProgramInfo.TrapHandlerEnable);
MD->setHwStage(CC, ".excp_en", CurrentProgramInfo.EXCPEnable);

if (ST.isDynamicVGPREnabled())
MD->setComputeRegisters(".dynamic_vgpr_en", true);
}

MD->setHwStage(CC, ".lds_size",
Expand All @@ -1436,8 +1439,15 @@ void AMDGPUAsmPrinter::EmitPALMetadata(const MachineFunction &MF,
MD->setEntryPoint(CC, MF.getFunction().getName());
MD->setNumUsedVgprs(CC, CurrentProgramInfo.NumVGPRsForWavesPerEU, Ctx);

// Only set AGPRs for supported devices
// For targets that support dynamic VGPRs, set the number of saved dynamic
// VGPRs (if any) in the PAL metadata.
const GCNSubtarget &STM = MF.getSubtarget<GCNSubtarget>();
if (STM.isDynamicVGPREnabled() &&
MFI->getScratchReservedForDynamicVGPRs() > 0)
MD->setHwStage(CC, ".dynamic_vgpr_saved_count",
MFI->getScratchReservedForDynamicVGPRs() / 4);

// Only set AGPRs for supported devices
if (STM.hasMAIInsts()) {
MD->setNumUsedAgprs(CC, CurrentProgramInfo.NumAccVGPR);
}
Expand Down
10 changes: 10 additions & 0 deletions llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -1452,6 +1452,16 @@ bool GCNSchedStage::shouldRevertScheduling(unsigned WavesAfter) {
if (WavesAfter < DAG.MinOccupancy)
return true;

// For dynamic VGPR mode, we don't want to waste any VGPR blocks.
if (ST.isDynamicVGPREnabled()) {
unsigned BlocksBefore = AMDGPU::IsaInfo::getAllocatedNumVGPRBlocks(
&ST, PressureBefore.getVGPRNum(false));
unsigned BlocksAfter = AMDGPU::IsaInfo::getAllocatedNumVGPRBlocks(
&ST, PressureAfter.getVGPRNum(false));
if (BlocksAfter > BlocksBefore)
return true;
}

return false;
}

Expand Down
4 changes: 4 additions & 0 deletions llvm/lib/Target/AMDGPU/GCNSubtarget.h
Original file line number Diff line number Diff line change
Expand Up @@ -190,6 +190,8 @@ class GCNSubtarget final : public AMDGPUGenSubtargetInfo,
/// indicates a lack of S_CLAUSE support.
unsigned MaxHardClauseLength = 0;
bool SupportsSRAMECC = false;
bool DynamicVGPR = false;
bool DynamicVGPRBlockSize32 = false;

// This should not be used directly. 'TargetID' tracks the dynamic settings
// for SRAMECC.
Expand Down Expand Up @@ -1647,6 +1649,8 @@ class GCNSubtarget final : public AMDGPUGenSubtargetInfo,
return true;
}

bool isDynamicVGPREnabled() const { return DynamicVGPR; }

bool requiresDisjointEarlyClobberAndUndef() const override {
// AMDGPU doesn't care if early-clobber and undef operands are allocated
// to the same register.
Expand Down
1 change: 1 addition & 0 deletions llvm/lib/Target/AMDGPU/SIDefines.h
Original file line number Diff line number Diff line change
Expand Up @@ -552,6 +552,7 @@ enum Id { // HwRegCode, (6) [5:0]

enum Offset : unsigned { // Offset, (5) [10:6]
OFFSET_MEM_VIOL = 8,
OFFSET_ME_ID = 8,
};

enum ModeRegisterMasks : uint32_t {
Expand Down
66 changes: 59 additions & 7 deletions llvm/lib/Target/AMDGPU/SIFrameLowering.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -691,17 +691,61 @@ void SIFrameLowering::emitEntryFunctionPrologue(MachineFunction &MF,
}
assert(ScratchWaveOffsetReg || !PreloadedScratchWaveOffsetReg);

if (hasFP(MF)) {
unsigned Offset = FrameInfo.getStackSize() * getScratchScaleFactor(ST);
if (!mayReserveScratchForCWSR(MF)) {
if (hasFP(MF)) {
Register FPReg = MFI->getFrameOffsetReg();
assert(FPReg != AMDGPU::FP_REG);
BuildMI(MBB, I, DL, TII->get(AMDGPU::S_MOV_B32), FPReg).addImm(0);
}

if (requiresStackPointerReference(MF)) {
Register SPReg = MFI->getStackPtrOffsetReg();
assert(SPReg != AMDGPU::SP_REG);
BuildMI(MBB, I, DL, TII->get(AMDGPU::S_MOV_B32), SPReg).addImm(Offset);
}
} else {
// We need to check if we're on a compute queue - if we are, then the CWSR
// trap handler may need to store some VGPRs on the stack. The first VGPR
// block is saved separately, so we only need to allocate space for any
// additional VGPR blocks used. For now, we will make sure there's enough
// room for the theoretical maximum number of VGPRs that can be allocated.
// FIXME: Figure out if the shader uses fewer VGPRs in practice.
assert(hasFP(MF));
Register FPReg = MFI->getFrameOffsetReg();
assert(FPReg != AMDGPU::FP_REG);
BuildMI(MBB, I, DL, TII->get(AMDGPU::S_MOV_B32), FPReg).addImm(0);
}

if (requiresStackPointerReference(MF)) {
Register SPReg = MFI->getStackPtrOffsetReg();
assert(SPReg != AMDGPU::SP_REG);
BuildMI(MBB, I, DL, TII->get(AMDGPU::S_MOV_B32), SPReg)
.addImm(FrameInfo.getStackSize() * getScratchScaleFactor(ST));
unsigned VGPRSize =
llvm::alignTo((ST.getAddressableNumVGPRs() -
AMDGPU::IsaInfo::getVGPRAllocGranule(&ST)) *
4,
FrameInfo.getMaxAlign());
MFI->setScratchReservedForDynamicVGPRs(VGPRSize);

BuildMI(MBB, I, DL, TII->get(AMDGPU::S_GETREG_B32), FPReg)
.addImm(AMDGPU::Hwreg::HwregEncoding::encode(
AMDGPU::Hwreg::ID_HW_ID2, AMDGPU::Hwreg::OFFSET_ME_ID, 1));
// The MicroEngine ID is 0 for the graphics queue, and 1 or 2 for compute
// (3 is unused, so we ignore it). Unfortunately, S_GETREG doesn't set
// SCC, so we need to check for 0 manually.
BuildMI(MBB, I, DL, TII->get(AMDGPU::S_CMP_LG_U32)).addImm(0).addReg(FPReg);
BuildMI(MBB, I, DL, TII->get(AMDGPU::S_CMOVK_I32), FPReg).addImm(VGPRSize);
if (requiresStackPointerReference(MF)) {
// If at least one of the constants can be inlined, then we can use
// s_cselect. Otherwise, use a mov and cmovk.
if (AMDGPU::isInlinableLiteral32(Offset, ST.hasInv2PiInlineImm()) ||
AMDGPU::isInlinableLiteral32(Offset + VGPRSize,
ST.hasInv2PiInlineImm())) {
BuildMI(MBB, I, DL, TII->get(AMDGPU::S_CSELECT_B32), SPReg)
.addImm(Offset + VGPRSize)
.addImm(Offset);
} else {
BuildMI(MBB, I, DL, TII->get(AMDGPU::S_MOV_B32), SPReg).addImm(Offset);
BuildMI(MBB, I, DL, TII->get(AMDGPU::S_CMOVK_I32), SPReg)
.addImm(Offset + VGPRSize);
}
}
}

bool NeedsFlatScratchInit =
Expand Down Expand Up @@ -1831,9 +1875,17 @@ bool SIFrameLowering::hasFPImpl(const MachineFunction &MF) const {
return frameTriviallyRequiresSP(MFI) || MFI.isFrameAddressTaken() ||
MF.getSubtarget<GCNSubtarget>().getRegisterInfo()->hasStackRealignment(
MF) ||
mayReserveScratchForCWSR(MF) ||
MF.getTarget().Options.DisableFramePointerElim(MF);
}

bool SIFrameLowering::mayReserveScratchForCWSR(
const MachineFunction &MF) const {
return MF.getSubtarget<GCNSubtarget>().isDynamicVGPREnabled() &&
AMDGPU::isEntryFunctionCC(MF.getFunction().getCallingConv()) &&
AMDGPU::isCompute(MF.getFunction().getCallingConv());
}

// This is essentially a reduced version of hasFP for entry functions. Since the
// stack pointer is known 0 on entry to kernels, we never really need an FP
// register. We may need to initialize the stack pointer depending on the frame
Expand Down
4 changes: 4 additions & 0 deletions llvm/lib/Target/AMDGPU/SIFrameLowering.h
Original file line number Diff line number Diff line change
Expand Up @@ -86,6 +86,10 @@ class SIFrameLowering final : public AMDGPUFrameLowering {

public:
bool requiresStackPointerReference(const MachineFunction &MF) const;

// Returns true if the function may need to reserve space on the stack for the
// CWSR trap handler.
bool mayReserveScratchForCWSR(const MachineFunction &MF) const;
};

} // end namespace llvm
Expand Down
60 changes: 37 additions & 23 deletions llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -1647,17 +1647,21 @@ bool SIInsertWaitcnts::generateWaitcntInstBefore(MachineInstr &MI,
(MI.isReturn() && MI.isCall() && !callWaitsOnFunctionEntry(MI))) {
Wait = Wait.combined(WCG->getAllZeroWaitcnt(/*IncludeVSCnt=*/false));
}
// Identify S_ENDPGM instructions which may have to wait for outstanding VMEM
// stores. In this case it can be useful to send a message to explicitly
// release all VGPRs before the stores have completed, but it is only safe to
// do this if:
// * there are no outstanding scratch stores
// * we are not in Dynamic VGPR mode
// In dynamic VGPR mode, we want to release the VGPRs before the wave exits.
// Technically the hardware will do this on its own if we don't, but that
// might cost extra cycles compared to doing it explicitly.
// When not in dynamic VGPR mode, identify S_ENDPGM instructions which may
// have to wait for outstanding VMEM stores. In this case it can be useful to
// send a message to explicitly release all VGPRs before the stores have
// completed, but it is only safe to do this if there are no outstanding
// scratch stores.
else if (MI.getOpcode() == AMDGPU::S_ENDPGM ||
MI.getOpcode() == AMDGPU::S_ENDPGM_SAVED) {
if (ST->getGeneration() >= AMDGPUSubtarget::GFX11 && !WCG->isOptNone() &&
ScoreBrackets.getScoreRange(STORE_CNT) != 0 &&
!ScoreBrackets.hasPendingEvent(SCRATCH_WRITE_ACCESS))
if (!WCG->isOptNone() &&
(ST->isDynamicVGPREnabled() ||
(ST->getGeneration() >= AMDGPUSubtarget::GFX11 &&
ScoreBrackets.getScoreRange(STORE_CNT) != 0 &&
!ScoreBrackets.hasPendingEvent(SCRATCH_WRITE_ACCESS))))
ReleaseVGPRInsts.insert(&MI);
}
// Resolve vm waits before gs-done.
Expand Down Expand Up @@ -2610,26 +2614,36 @@ bool SIInsertWaitcnts::runOnMachineFunction(MachineFunction &MF) {
}
}

// Insert DEALLOC_VGPR messages before previously identified S_ENDPGM
// instructions.
// Deallocate the VGPRs before previously identified S_ENDPGM instructions.
// This is done in different ways depending on how the VGPRs were allocated
// (i.e. whether we're in dynamic VGPR mode or not).
// Skip deallocation if kernel is waveslot limited vs VGPR limited. A short
// waveslot limited kernel runs slower with the deallocation.
if (!ReleaseVGPRInsts.empty() &&
(MF.getFrameInfo().hasCalls() ||
ST->getOccupancyWithNumVGPRs(
TRI->getNumUsedPhysRegs(*MRI, AMDGPU::VGPR_32RegClass)) <
AMDGPU::IsaInfo::getMaxWavesPerEU(ST))) {
if (ST->isDynamicVGPREnabled()) {
for (MachineInstr *MI : ReleaseVGPRInsts) {
if (ST->requiresNopBeforeDeallocVGPRs()) {
BuildMI(*MI->getParent(), MI, MI->getDebugLoc(),
TII->get(AMDGPU::S_NOP))
.addImm(0);
}
BuildMI(*MI->getParent(), MI, MI->getDebugLoc(),
TII->get(AMDGPU::S_SENDMSG))
.addImm(AMDGPU::SendMsg::ID_DEALLOC_VGPRS_GFX11Plus);
TII->get(AMDGPU::S_ALLOC_VGPR))
.addImm(0);
Modified = true;
}
} else {
if (!ReleaseVGPRInsts.empty() &&
(MF.getFrameInfo().hasCalls() ||
ST->getOccupancyWithNumVGPRs(
TRI->getNumUsedPhysRegs(*MRI, AMDGPU::VGPR_32RegClass)) <
AMDGPU::IsaInfo::getMaxWavesPerEU(ST))) {
for (MachineInstr *MI : ReleaseVGPRInsts) {
if (ST->requiresNopBeforeDeallocVGPRs()) {
BuildMI(*MI->getParent(), MI, MI->getDebugLoc(),
TII->get(AMDGPU::S_NOP))
.addImm(0);
}
BuildMI(*MI->getParent(), MI, MI->getDebugLoc(),
TII->get(AMDGPU::S_SENDMSG))
.addImm(AMDGPU::SendMsg::ID_DEALLOC_VGPRS_GFX11Plus);
Modified = true;
}
}
}
ReleaseVGPRInsts.clear();
PreheadersToFlush.clear();
Expand Down
13 changes: 13 additions & 0 deletions llvm/lib/Target/AMDGPU/SIMachineFunctionInfo.h
Original file line number Diff line number Diff line change
Expand Up @@ -455,6 +455,10 @@ class SIMachineFunctionInfo final : public AMDGPUMachineFunction,
unsigned NumSpilledSGPRs = 0;
unsigned NumSpilledVGPRs = 0;

// The size of the scratch space reserved for the CWSR trap handler to spill
// some of the dynamic VGPRs.
unsigned ScratchReservedForDynamicVGPRs = 0;

// Tracks information about user SGPRs that will be setup by hardware which
// will apply to all wavefronts of the grid.
GCNUserSGPRUsageInfo UserSGPRInfo;
Expand Down Expand Up @@ -780,6 +784,15 @@ class SIMachineFunctionInfo final : public AMDGPUMachineFunction,
BytesInStackArgArea = Bytes;
}

// This is only used if we need to save any dynamic VGPRs in scratch.
unsigned getScratchReservedForDynamicVGPRs() const {
return ScratchReservedForDynamicVGPRs;
}

void setScratchReservedForDynamicVGPRs(unsigned Size) {
ScratchReservedForDynamicVGPRs = Size;
}

// Add user SGPRs.
Register addPrivateSegmentBuffer(const SIRegisterInfo &TRI);
Register addDispatchPtr(const SIRegisterInfo &TRI);
Expand Down
Loading
Loading