-
Notifications
You must be signed in to change notification settings - Fork 15.1k
[AMDGPU] Reschedule loads in clauses to improve throughput #102595
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Implementation is not entirely complete; however, I would be interested in feedback, specifically if we think this is "OK" from a memory model perspective. |
|
Can you explain what kind of rescheduling you are doing? Why would the regular scheduler not already have put the loads into an optimal order? |
|
Memory operation clustering enforces ordering in two ways:
Below is an example of the output I am seeing. Default output: Reorder while clustering enabled: This patch (reorder after clustering+RA): |
|
Ah,
Yeah, I think that is part of the scheduler's "do no harm" policy, i.e. don't change anything unless you have a good reason to do so. For "This patch (reorder after clustering+RA)" what am I supposed to be looking at? Why is it better than the other examples? Stepping back a bit, implementing clustering by adding edges to the DAG seems like a bad idea, since DAG edges are directional so that forces you to pick an ordering up-front. It would be better if clustering only caused the loads to be scheduled adjacently, and all the other scheduler heuristics could still influence which order they go in. |
|
I am not against enabling I have not tested extensively, but I have the impression that this kind of reordering can, in some cases, give a few percentage points performance improvement. |
After clauses are formed their internal loads can be reordered to facilitate some additional opportunities for overlapping computation. This late stage rescheduling causes no change register pressure.
4169b41 to
fe48798
Compare
|
Cleaned this up and updated to cover all load types, not just MIMG. |
|
@llvm/pr-subscribers-llvm-globalisel Author: Carl Ritson (perlfu) ChangesAfter clauses are formed their internal loads can be reordered to facilitate some additional opportunities for overlapping computation. Patch is 1.82 MiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/102595.diff 70 Files Affected:
diff --git a/llvm/lib/Target/AMDGPU/SIPostRABundler.cpp b/llvm/lib/Target/AMDGPU/SIPostRABundler.cpp
index 5720b978aada0..80cca7bcfde9c 100644
--- a/llvm/lib/Target/AMDGPU/SIPostRABundler.cpp
+++ b/llvm/lib/Target/AMDGPU/SIPostRABundler.cpp
@@ -17,6 +17,7 @@
#include "GCNSubtarget.h"
#include "llvm/ADT/SmallSet.h"
#include "llvm/CodeGen/MachineFunctionPass.h"
+#include <deque>
using namespace llvm;
@@ -50,6 +51,7 @@ class SIPostRABundler {
bool run(MachineFunction &MF);
private:
+ const SIInstrInfo *TII = nullptr;
const SIRegisterInfo *TRI;
SmallSet<Register, 16> Defs;
@@ -60,6 +62,9 @@ class SIPostRABundler {
bool isBundleCandidate(const MachineInstr &MI) const;
bool isDependentLoad(const MachineInstr &MI) const;
bool canBundle(const MachineInstr &MI, const MachineInstr &NextMI) const;
+ void reorderLoads(MachineBasicBlock &MBB,
+ MachineBasicBlock::instr_iterator &BundleStart,
+ MachineBasicBlock::instr_iterator Next);
};
constexpr uint64_t MemFlags = SIInstrFlags::MTBUF | SIInstrFlags::MUBUF |
@@ -129,6 +134,141 @@ bool SIPostRABundler::canBundle(const MachineInstr &MI,
!isDependentLoad(NextMI));
}
+static Register getDef(MachineInstr &MI) {
+ assert(MI.getNumExplicitDefs() > 0);
+ return MI.defs().begin()->getReg();
+}
+
+void SIPostRABundler::reorderLoads(
+ MachineBasicBlock &MBB, MachineBasicBlock::instr_iterator &BundleStart,
+ MachineBasicBlock::instr_iterator Next) {
+ // Don't reorder ALU, store or scalar clauses.
+ if (!BundleStart->mayLoad() || BundleStart->mayStore() ||
+ SIInstrInfo::isSMRD(*BundleStart) || !BundleStart->getNumExplicitDefs())
+ return;
+
+ // Search to find the usage distance of each defined register in the clause.
+ const unsigned SearchDistance = std::max(Defs.size(), 100UL);
+ SmallDenseMap<Register, unsigned> UseDistance;
+ unsigned MaxDistance = 0;
+ for (MachineBasicBlock::iterator SearchI = Next;
+ SearchI != MBB.end() && MaxDistance < SearchDistance &&
+ UseDistance.size() < Defs.size();
+ ++SearchI, ++MaxDistance) {
+ for (Register Reg : Defs) {
+ if (UseDistance.contains(Reg))
+ continue;
+ if (SearchI->readsRegister(Reg, TRI))
+ UseDistance[Reg] = MaxDistance;
+ }
+ }
+
+ if (UseDistance.empty())
+ return;
+
+ LLVM_DEBUG(dbgs() << "Try bundle reordering\n");
+
+ // Build schedule based on use distance of register uses.
+ // Attempt to preserve exist order (NativeOrder) where possible.
+ std::deque<std::pair<MachineInstr *, unsigned>> Schedule;
+ unsigned NativeOrder = 0, LastOrder = 0;
+ bool Reordered = false;
+ for (auto II = BundleStart; II != Next; ++II, ++NativeOrder) {
+ // Bail out if we encounter anything that seems risky to reorder.
+ if (!II->getNumExplicitDefs() || II->isKill() ||
+ llvm::any_of(II->memoperands(), [&](const MachineMemOperand *MMO) {
+ return MMO->isAtomic() || MMO->isVolatile();
+ })) {
+ LLVM_DEBUG(dbgs() << " Abort\n");
+ return;
+ }
+
+ Register Reg = getDef(*II);
+ unsigned NewOrder =
+ UseDistance.contains(Reg) ? UseDistance[Reg] : MaxDistance;
+ LLVM_DEBUG(dbgs() << " Order: " << NewOrder << "," << NativeOrder
+ << ", MI: " << *II);
+ unsigned Order = (NewOrder << 16 | NativeOrder);
+ Schedule.emplace_back(&*II, Order);
+ Reordered |= Order < LastOrder;
+ LastOrder = Order;
+ }
+
+ // No reordering found.
+ if (!Reordered) {
+ LLVM_DEBUG(dbgs() << " No changes\n");
+ return;
+ }
+
+ // Apply sort on new ordering.
+ std::sort(Schedule.begin(), Schedule.end(),
+ [](std::pair<MachineInstr *, unsigned> A,
+ std::pair<MachineInstr *, unsigned> B) {
+ return A.second < B.second;
+ });
+
+ // Rebuild clause order.
+ // Schedule holds ideal order for the load operations; however, each def
+ // can only be scheduled when it will no longer clobber any uses.
+ SmallVector<MachineInstr *> Clause;
+ while (!Schedule.empty()) {
+ // Try to schedule next instruction in schedule.
+ // Iterate until we find something that can be placed.
+ auto It = Schedule.begin();
+ while (It != Schedule.end()) {
+ MachineInstr *MI = It->first;
+ LLVM_DEBUG(dbgs() << "Try schedule: " << *MI);
+
+ if (MI->getNumExplicitDefs() == 0) {
+ // No defs, always schedule.
+ LLVM_DEBUG(dbgs() << " Trivially OK\n");
+ break;
+ }
+
+ Register DefReg = getDef(*MI);
+ bool DefRegHasUse = false;
+ for (auto SearchIt = std::next(It);
+ SearchIt != Schedule.end() && !DefRegHasUse; ++SearchIt)
+ DefRegHasUse = SearchIt->first->readsRegister(DefReg, TRI);
+ if (DefRegHasUse) {
+ // A future use would be clobbered; try next instruction in the
+ // schedule.
+ LLVM_DEBUG(dbgs() << " Clobbers uses\n");
+ It++;
+ continue;
+ }
+
+ // Safe to schedule.
+ LLVM_DEBUG(dbgs() << " OK!\n");
+ break;
+ }
+
+ // Place schedule instruction into clause order.
+ assert(It != Schedule.end());
+ MachineInstr *MI = It->first;
+ Schedule.erase(It);
+ Clause.push_back(MI);
+
+ // Clear kill flags for later uses.
+ for (auto &Use : MI->all_uses()) {
+ if (!Use.isReg() || !Use.isKill())
+ continue;
+ Register UseReg = Use.getReg();
+ if (llvm::any_of(Schedule, [&](std::pair<MachineInstr *, unsigned> &SI) {
+ return SI.first->readsRegister(UseReg, TRI);
+ }))
+ Use.setIsKill(false);
+ }
+ }
+
+ // Apply order to instructions.
+ for (MachineInstr *MI : Clause)
+ MI->moveBefore(&*Next);
+
+ // Update start of bundle.
+ BundleStart = Clause[0]->getIterator();
+}
+
bool SIPostRABundlerLegacy::runOnMachineFunction(MachineFunction &MF) {
if (skipFunction(MF.getFunction()))
return false;
@@ -143,6 +283,8 @@ PreservedAnalyses SIPostRABundlerPass::run(MachineFunction &MF,
bool SIPostRABundler::run(MachineFunction &MF) {
+ const GCNSubtarget &ST = MF.getSubtarget<GCNSubtarget>();
+ TII = ST.getInstrInfo();
TRI = MF.getSubtarget<GCNSubtarget>().getRegisterInfo();
BitVector BundleUsedRegUnits(TRI->getNumRegUnits());
BitVector KillUsedRegUnits(TRI->getNumRegUnits());
@@ -170,7 +312,7 @@ bool SIPostRABundler::run(MachineFunction &MF) {
assert(Defs.empty());
if (I->getNumExplicitDefs() != 0)
- Defs.insert(I->defs().begin()->getReg());
+ Defs.insert(getDef(*I));
MachineBasicBlock::instr_iterator BundleStart = I;
MachineBasicBlock::instr_iterator BundleEnd = I;
@@ -182,7 +324,7 @@ bool SIPostRABundler::run(MachineFunction &MF) {
if (canBundle(*BundleEnd, *I)) {
BundleEnd = I;
if (I->getNumExplicitDefs() != 0)
- Defs.insert(I->defs().begin()->getReg());
+ Defs.insert(getDef(*I));
++ClauseLength;
} else if (!I->isMetaInstruction() ||
I->getOpcode() == AMDGPU::SCHED_BARRIER) {
@@ -234,6 +376,7 @@ bool SIPostRABundler::run(MachineFunction &MF) {
BundleUsedRegUnits.reset();
}
+ reorderLoads(MBB, BundleStart, Next);
finalizeBundle(MBB, BundleStart, Next);
}
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/add.vni16.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/add.vni16.ll
index b67080bd4798d..c04f86391c44b 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/add.vni16.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/add.vni16.ll
@@ -716,17 +716,17 @@ define void @add_v11i16(ptr addrspace(1) %ptra, ptr addrspace(1) %ptrb, ptr addr
; GFX9-LABEL: add_v11i16:
; GFX9: ; %bb.0:
; GFX9-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX9-NEXT: global_load_dwordx4 v[6:9], v[0:1], off
; GFX9-NEXT: global_load_ushort v14, v[0:1], off offset:16
; GFX9-NEXT: global_load_ushort v15, v[2:3], off offset:16
+; GFX9-NEXT: global_load_dwordx4 v[6:9], v[0:1], off
; GFX9-NEXT: global_load_dwordx4 v[10:13], v[2:3], off
; GFX9-NEXT: global_load_ushort v16, v[2:3], off offset:20
; GFX9-NEXT: global_load_ushort v17, v[0:1], off offset:20
; GFX9-NEXT: global_load_ushort v18, v[0:1], off offset:18
; GFX9-NEXT: global_load_ushort v19, v[2:3], off offset:18
-; GFX9-NEXT: s_waitcnt vmcnt(6)
+; GFX9-NEXT: s_waitcnt vmcnt(7)
; GFX9-NEXT: v_and_b32_e32 v14, 0xffff, v14
-; GFX9-NEXT: s_waitcnt vmcnt(5)
+; GFX9-NEXT: s_waitcnt vmcnt(6)
; GFX9-NEXT: v_and_b32_e32 v15, 0xffff, v15
; GFX9-NEXT: s_waitcnt vmcnt(4)
; GFX9-NEXT: v_pk_add_u16 v0, v6, v10
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/combine-fma-add-fma-mul.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/combine-fma-add-fma-mul.ll
index 6ea0a9446ff9d..7fca4d628d023 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/combine-fma-add-fma-mul.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/combine-fma-add-fma-mul.ll
@@ -750,20 +750,20 @@ define <4 x double> @test_f64_add_mul(<4 x double> %a, <4 x double> %b, <4 x dou
; GFX10-CONTRACT: ; %bb.0: ; %.entry
; GFX10-CONTRACT-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX10-CONTRACT-NEXT: s_clause 0x8
-; GFX10-CONTRACT-NEXT: buffer_load_dword v31, off, s[0:3], s32
; GFX10-CONTRACT-NEXT: buffer_load_dword v32, off, s[0:3], s32 offset:4
; GFX10-CONTRACT-NEXT: buffer_load_dword v33, off, s[0:3], s32 offset:8
; GFX10-CONTRACT-NEXT: buffer_load_dword v34, off, s[0:3], s32 offset:12
; GFX10-CONTRACT-NEXT: buffer_load_dword v35, off, s[0:3], s32 offset:16
; GFX10-CONTRACT-NEXT: buffer_load_dword v36, off, s[0:3], s32 offset:20
; GFX10-CONTRACT-NEXT: buffer_load_dword v37, off, s[0:3], s32 offset:24
+; GFX10-CONTRACT-NEXT: buffer_load_dword v31, off, s[0:3], s32
; GFX10-CONTRACT-NEXT: buffer_load_dword v38, off, s[0:3], s32 offset:28
; GFX10-CONTRACT-NEXT: buffer_load_dword v39, off, s[0:3], s32 offset:32
-; GFX10-CONTRACT-NEXT: s_waitcnt vmcnt(6)
+; GFX10-CONTRACT-NEXT: s_waitcnt vmcnt(7)
; GFX10-CONTRACT-NEXT: v_fma_f64 v[16:17], v[16:17], v[24:25], v[32:33]
-; GFX10-CONTRACT-NEXT: s_waitcnt vmcnt(4)
+; GFX10-CONTRACT-NEXT: s_waitcnt vmcnt(5)
; GFX10-CONTRACT-NEXT: v_fma_f64 v[18:19], v[18:19], v[26:27], v[34:35]
-; GFX10-CONTRACT-NEXT: s_waitcnt vmcnt(2)
+; GFX10-CONTRACT-NEXT: s_waitcnt vmcnt(3)
; GFX10-CONTRACT-NEXT: v_fma_f64 v[20:21], v[20:21], v[28:29], v[36:37]
; GFX10-CONTRACT-NEXT: s_waitcnt vmcnt(0)
; GFX10-CONTRACT-NEXT: v_fma_f64 v[22:23], v[22:23], v[30:31], v[38:39]
@@ -777,20 +777,20 @@ define <4 x double> @test_f64_add_mul(<4 x double> %a, <4 x double> %b, <4 x dou
; GFX10-DENORM: ; %bb.0: ; %.entry
; GFX10-DENORM-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX10-DENORM-NEXT: s_clause 0x8
-; GFX10-DENORM-NEXT: buffer_load_dword v31, off, s[0:3], s32
; GFX10-DENORM-NEXT: buffer_load_dword v32, off, s[0:3], s32 offset:4
; GFX10-DENORM-NEXT: buffer_load_dword v33, off, s[0:3], s32 offset:8
; GFX10-DENORM-NEXT: buffer_load_dword v34, off, s[0:3], s32 offset:12
; GFX10-DENORM-NEXT: buffer_load_dword v35, off, s[0:3], s32 offset:16
; GFX10-DENORM-NEXT: buffer_load_dword v36, off, s[0:3], s32 offset:20
; GFX10-DENORM-NEXT: buffer_load_dword v37, off, s[0:3], s32 offset:24
+; GFX10-DENORM-NEXT: buffer_load_dword v31, off, s[0:3], s32
; GFX10-DENORM-NEXT: buffer_load_dword v38, off, s[0:3], s32 offset:28
; GFX10-DENORM-NEXT: buffer_load_dword v39, off, s[0:3], s32 offset:32
-; GFX10-DENORM-NEXT: s_waitcnt vmcnt(6)
+; GFX10-DENORM-NEXT: s_waitcnt vmcnt(7)
; GFX10-DENORM-NEXT: v_fma_f64 v[16:17], v[16:17], v[24:25], v[32:33]
-; GFX10-DENORM-NEXT: s_waitcnt vmcnt(4)
+; GFX10-DENORM-NEXT: s_waitcnt vmcnt(5)
; GFX10-DENORM-NEXT: v_fma_f64 v[18:19], v[18:19], v[26:27], v[34:35]
-; GFX10-DENORM-NEXT: s_waitcnt vmcnt(2)
+; GFX10-DENORM-NEXT: s_waitcnt vmcnt(3)
; GFX10-DENORM-NEXT: v_fma_f64 v[20:21], v[20:21], v[28:29], v[36:37]
; GFX10-DENORM-NEXT: s_waitcnt vmcnt(0)
; GFX10-DENORM-NEXT: v_fma_f64 v[22:23], v[22:23], v[30:31], v[38:39]
@@ -804,20 +804,20 @@ define <4 x double> @test_f64_add_mul(<4 x double> %a, <4 x double> %b, <4 x dou
; GFX11-CONTRACT: ; %bb.0: ; %.entry
; GFX11-CONTRACT-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX11-CONTRACT-NEXT: s_clause 0x8
-; GFX11-CONTRACT-NEXT: scratch_load_b32 v31, off, s32
; GFX11-CONTRACT-NEXT: scratch_load_b32 v32, off, s32 offset:4
; GFX11-CONTRACT-NEXT: scratch_load_b32 v33, off, s32 offset:8
; GFX11-CONTRACT-NEXT: scratch_load_b32 v34, off, s32 offset:12
; GFX11-CONTRACT-NEXT: scratch_load_b32 v35, off, s32 offset:16
; GFX11-CONTRACT-NEXT: scratch_load_b32 v36, off, s32 offset:20
; GFX11-CONTRACT-NEXT: scratch_load_b32 v37, off, s32 offset:24
+; GFX11-CONTRACT-NEXT: scratch_load_b32 v31, off, s32
; GFX11-CONTRACT-NEXT: scratch_load_b32 v38, off, s32 offset:28
; GFX11-CONTRACT-NEXT: scratch_load_b32 v39, off, s32 offset:32
-; GFX11-CONTRACT-NEXT: s_waitcnt vmcnt(6)
+; GFX11-CONTRACT-NEXT: s_waitcnt vmcnt(7)
; GFX11-CONTRACT-NEXT: v_fma_f64 v[16:17], v[16:17], v[24:25], v[32:33]
-; GFX11-CONTRACT-NEXT: s_waitcnt vmcnt(4)
+; GFX11-CONTRACT-NEXT: s_waitcnt vmcnt(5)
; GFX11-CONTRACT-NEXT: v_fma_f64 v[18:19], v[18:19], v[26:27], v[34:35]
-; GFX11-CONTRACT-NEXT: s_waitcnt vmcnt(2)
+; GFX11-CONTRACT-NEXT: s_waitcnt vmcnt(3)
; GFX11-CONTRACT-NEXT: v_fma_f64 v[20:21], v[20:21], v[28:29], v[36:37]
; GFX11-CONTRACT-NEXT: s_waitcnt vmcnt(0)
; GFX11-CONTRACT-NEXT: v_fma_f64 v[22:23], v[22:23], v[30:31], v[38:39]
@@ -833,20 +833,20 @@ define <4 x double> @test_f64_add_mul(<4 x double> %a, <4 x double> %b, <4 x dou
; GFX11-DENORM: ; %bb.0: ; %.entry
; GFX11-DENORM-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX11-DENORM-NEXT: s_clause 0x8
-; GFX11-DENORM-NEXT: scratch_load_b32 v31, off, s32
; GFX11-DENORM-NEXT: scratch_load_b32 v32, off, s32 offset:4
; GFX11-DENORM-NEXT: scratch_load_b32 v33, off, s32 offset:8
; GFX11-DENORM-NEXT: scratch_load_b32 v34, off, s32 offset:12
; GFX11-DENORM-NEXT: scratch_load_b32 v35, off, s32 offset:16
; GFX11-DENORM-NEXT: scratch_load_b32 v36, off, s32 offset:20
; GFX11-DENORM-NEXT: scratch_load_b32 v37, off, s32 offset:24
+; GFX11-DENORM-NEXT: scratch_load_b32 v31, off, s32
; GFX11-DENORM-NEXT: scratch_load_b32 v38, off, s32 offset:28
; GFX11-DENORM-NEXT: scratch_load_b32 v39, off, s32 offset:32
-; GFX11-DENORM-NEXT: s_waitcnt vmcnt(6)
+; GFX11-DENORM-NEXT: s_waitcnt vmcnt(7)
; GFX11-DENORM-NEXT: v_fma_f64 v[16:17], v[16:17], v[24:25], v[32:33]
-; GFX11-DENORM-NEXT: s_waitcnt vmcnt(4)
+; GFX11-DENORM-NEXT: s_waitcnt vmcnt(5)
; GFX11-DENORM-NEXT: v_fma_f64 v[18:19], v[18:19], v[26:27], v[34:35]
-; GFX11-DENORM-NEXT: s_waitcnt vmcnt(2)
+; GFX11-DENORM-NEXT: s_waitcnt vmcnt(3)
; GFX11-DENORM-NEXT: v_fma_f64 v[20:21], v[20:21], v[28:29], v[36:37]
; GFX11-DENORM-NEXT: s_waitcnt vmcnt(0)
; GFX11-DENORM-NEXT: v_fma_f64 v[22:23], v[22:23], v[30:31], v[38:39]
@@ -921,20 +921,20 @@ define <4 x double> @test_f64_add_mul_rhs(<4 x double> %a, <4 x double> %b, <4 x
; GFX10-CONTRACT: ; %bb.0: ; %.entry
; GFX10-CONTRACT-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX10-CONTRACT-NEXT: s_clause 0x8
-; GFX10-CONTRACT-NEXT: buffer_load_dword v31, off, s[0:3], s32
; GFX10-CONTRACT-NEXT: buffer_load_dword v32, off, s[0:3], s32 offset:4
; GFX10-CONTRACT-NEXT: buffer_load_dword v33, off, s[0:3], s32 offset:8
; GFX10-CONTRACT-NEXT: buffer_load_dword v34, off, s[0:3], s32 offset:12
; GFX10-CONTRACT-NEXT: buffer_load_dword v35, off, s[0:3], s32 offset:16
; GFX10-CONTRACT-NEXT: buffer_load_dword v36, off, s[0:3], s32 offset:20
; GFX10-CONTRACT-NEXT: buffer_load_dword v37, off, s[0:3], s32 offset:24
+; GFX10-CONTRACT-NEXT: buffer_load_dword v31, off, s[0:3], s32
; GFX10-CONTRACT-NEXT: buffer_load_dword v38, off, s[0:3], s32 offset:28
; GFX10-CONTRACT-NEXT: buffer_load_dword v39, off, s[0:3], s32 offset:32
-; GFX10-CONTRACT-NEXT: s_waitcnt vmcnt(6)
+; GFX10-CONTRACT-NEXT: s_waitcnt vmcnt(7)
; GFX10-CONTRACT-NEXT: v_fma_f64 v[16:17], v[16:17], v[24:25], v[32:33]
-; GFX10-CONTRACT-NEXT: s_waitcnt vmcnt(4)
+; GFX10-CONTRACT-NEXT: s_waitcnt vmcnt(5)
; GFX10-CONTRACT-NEXT: v_fma_f64 v[18:19], v[18:19], v[26:27], v[34:35]
-; GFX10-CONTRACT-NEXT: s_waitcnt vmcnt(2)
+; GFX10-CONTRACT-NEXT: s_waitcnt vmcnt(3)
; GFX10-CONTRACT-NEXT: v_fma_f64 v[20:21], v[20:21], v[28:29], v[36:37]
; GFX10-CONTRACT-NEXT: s_waitcnt vmcnt(0)
; GFX10-CONTRACT-NEXT: v_fma_f64 v[22:23], v[22:23], v[30:31], v[38:39]
@@ -948,20 +948,20 @@ define <4 x double> @test_f64_add_mul_rhs(<4 x double> %a, <4 x double> %b, <4 x
; GFX10-DENORM: ; %bb.0: ; %.entry
; GFX10-DENORM-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX10-DENORM-NEXT: s_clause 0x8
-; GFX10-DENORM-NEXT: buffer_load_dword v31, off, s[0:3], s32
; GFX10-DENORM-NEXT: buffer_load_dword v32, off, s[0:3], s32 offset:4
; GFX10-DENORM-NEXT: buffer_load_dword v33, off, s[0:3], s32 offset:8
; GFX10-DENORM-NEXT: buffer_load_dword v34, off, s[0:3], s32 offset:12
; GFX10-DENORM-NEXT: buffer_load_dword v35, off, s[0:3], s32 offset:16
; GFX10-DENORM-NEXT: buffer_load_dword v36, off, s[0:3], s32 offset:20
; GFX10-DENORM-NEXT: buffer_load_dword v37, off, s[0:3], s32 offset:24
+; GFX10-DENORM-NEXT: buffer_load_dword v31, off, s[0:3], s32
; GFX10-DENORM-NEXT: buffer_load_dword v38, off, s[0:3], s32 offset:28
; GFX10-DENORM-NEXT: buffer_load_dword v39, off, s[0:3], s32 offset:32
-; GFX10-DENORM-NEXT: s_waitcnt vmcnt(6)
+; GFX10-DENORM-NEXT: s_waitcnt vmcnt(7)
; GFX10-DENORM-NEXT: v_fma_f64 v[16:17], v[16:17], v[24:25], v[32:33]
-; GFX10-DENORM-NEXT: s_waitcnt vmcnt(4)
+; GFX10-DENORM-NEXT: s_waitcnt vmcnt(5)
; GFX10-DENORM-NEXT: v_fma_f64 v[18:19], v[18:19], v[26:27], v[34:35]
-; GFX10-DENORM-NEXT: s_waitcnt vmcnt(2)
+; GFX10-DENORM-NEXT: s_waitcnt vmcnt(3)
; GFX10-DENORM-NEXT: v_fma_f64 v[20:21], v[20:21], v[28:29], v[36:37]
; GFX10-DENORM-NEXT: s_waitcnt vmcnt(0)
; GFX10-DENORM-NEXT: v_fma_f64 v[22:23], v[22:23], v[30:31], v[38:39]
@@ -975,20 +975,20 @@ define <4 x double> @test_f64_add_mul_rhs(<4 x double> %a, <4 x double> %b, <4 x
; GFX11-CONTRACT: ; %bb.0: ; %.entry
; GFX11-CONTRACT-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX11-CONTRACT-NEXT: s_clause 0x8
-; GFX11-CONTRACT-NEXT: scratch_load_b32 v31, off, s32
; GFX11-CONTRACT-NEXT: scratch_load_b32 v32, off, s32 offset:4
; GFX11-CONTRACT-NEXT: scratch_load_b32 v33, off, s32 offset:8
; GFX11-CONTRACT-NEXT: scratch_load_b32 v34, off, s32 offset:12
; GFX11-CONTRACT-NEXT: scratch_load_b32 v35, off, s32 offset:16
; GFX11-CONTRACT-NEXT: scratch_load_b32 v36, off, s32 offset:20
; GFX11-CONTRACT-NEXT: scratch_load_b32 v37, off, s32 offset:24
+; GFX11-CONTRACT-NEXT: scratch_load_b32 v31, off, s32
; GFX11-CONTRACT-NEXT: scratch_load_b32 v38, off, s32 offset:28
; GFX11-CONTRACT-NEXT: scratch_load_b32 v39, off, s32 offset:32
-; GFX11-CONTRACT-NEXT: s_waitcnt vmcnt(6)
+; GFX11-CONTRACT-NEXT: s_waitcnt vmcnt(7)
; GFX11-CONTRACT-NEXT: v_fma_f64 v[16:17], v[16:17], v[24:25], v[32:33]
-; GFX11-CONTRACT-NEXT: s_waitcnt vmcnt(4)
+; GFX11-CONTRACT-NEXT: s_waitcnt vmcnt(5)
; GFX11-CONTRACT-NEXT: v_fma_f64 v[18:19], v[18:19], v[26:27], v[34:35]
-; GFX11-CONTRACT-NEXT: s_waitcnt vmcnt(2)
+; GFX11-CONTRACT-NEXT: s_waitcnt vmcnt(3)
; GFX11-CONTRACT-NEXT: v_fma_f64 v[20:21], v[20:21], v[28:29], v[36:37]
; GFX11-CONTRACT-NEXT: s_waitcnt vmcnt(0)
; GFX11-CONTRACT-NEXT: v_fma_f64 v[22:23], v[22:23], v[30:31], v[38:39]
@@ -1004,20 +1004,20 @@ define <4 x double> @test_f64_add_mul_rhs(<4 x double> %a, <4 x double> %b, <4 x
; GFX11-DENORM: ; %bb.0: ; %.entry
; GFX11-DENORM-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX11-DENORM-NEXT: s_clause 0x8
-; GFX11-DENORM-NEXT: scratch_load_b32 v31, off, s32
; GFX11-DENORM-NEXT: scratch_load_b32 v32, off, s32 offset:4
; GFX11-DENORM-NEXT: scratch_load_b32 v33, off, s32 offset:8
; GFX11-DENORM-NEXT: scratch_load_b32 v34, off, s32 offse...
[truncated]
|
|
@llvm/pr-subscribers-backend-amdgpu Author: Carl Ritson (perlfu) ChangesAfter clauses are formed their internal loads can be reordered to facilitate some additional opportunities for overlapping computation. Patch is 1.82 MiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/102595.diff 70 Files Affected:
diff --git a/llvm/lib/Target/AMDGPU/SIPostRABundler.cpp b/llvm/lib/Target/AMDGPU/SIPostRABundler.cpp
index 5720b978aada0..80cca7bcfde9c 100644
--- a/llvm/lib/Target/AMDGPU/SIPostRABundler.cpp
+++ b/llvm/lib/Target/AMDGPU/SIPostRABundler.cpp
@@ -17,6 +17,7 @@
#include "GCNSubtarget.h"
#include "llvm/ADT/SmallSet.h"
#include "llvm/CodeGen/MachineFunctionPass.h"
+#include <deque>
using namespace llvm;
@@ -50,6 +51,7 @@ class SIPostRABundler {
bool run(MachineFunction &MF);
private:
+ const SIInstrInfo *TII = nullptr;
const SIRegisterInfo *TRI;
SmallSet<Register, 16> Defs;
@@ -60,6 +62,9 @@ class SIPostRABundler {
bool isBundleCandidate(const MachineInstr &MI) const;
bool isDependentLoad(const MachineInstr &MI) const;
bool canBundle(const MachineInstr &MI, const MachineInstr &NextMI) const;
+ void reorderLoads(MachineBasicBlock &MBB,
+ MachineBasicBlock::instr_iterator &BundleStart,
+ MachineBasicBlock::instr_iterator Next);
};
constexpr uint64_t MemFlags = SIInstrFlags::MTBUF | SIInstrFlags::MUBUF |
@@ -129,6 +134,141 @@ bool SIPostRABundler::canBundle(const MachineInstr &MI,
!isDependentLoad(NextMI));
}
+static Register getDef(MachineInstr &MI) {
+ assert(MI.getNumExplicitDefs() > 0);
+ return MI.defs().begin()->getReg();
+}
+
+void SIPostRABundler::reorderLoads(
+ MachineBasicBlock &MBB, MachineBasicBlock::instr_iterator &BundleStart,
+ MachineBasicBlock::instr_iterator Next) {
+ // Don't reorder ALU, store or scalar clauses.
+ if (!BundleStart->mayLoad() || BundleStart->mayStore() ||
+ SIInstrInfo::isSMRD(*BundleStart) || !BundleStart->getNumExplicitDefs())
+ return;
+
+ // Search to find the usage distance of each defined register in the clause.
+ const unsigned SearchDistance = std::max(Defs.size(), 100UL);
+ SmallDenseMap<Register, unsigned> UseDistance;
+ unsigned MaxDistance = 0;
+ for (MachineBasicBlock::iterator SearchI = Next;
+ SearchI != MBB.end() && MaxDistance < SearchDistance &&
+ UseDistance.size() < Defs.size();
+ ++SearchI, ++MaxDistance) {
+ for (Register Reg : Defs) {
+ if (UseDistance.contains(Reg))
+ continue;
+ if (SearchI->readsRegister(Reg, TRI))
+ UseDistance[Reg] = MaxDistance;
+ }
+ }
+
+ if (UseDistance.empty())
+ return;
+
+ LLVM_DEBUG(dbgs() << "Try bundle reordering\n");
+
+ // Build schedule based on use distance of register uses.
+ // Attempt to preserve exist order (NativeOrder) where possible.
+ std::deque<std::pair<MachineInstr *, unsigned>> Schedule;
+ unsigned NativeOrder = 0, LastOrder = 0;
+ bool Reordered = false;
+ for (auto II = BundleStart; II != Next; ++II, ++NativeOrder) {
+ // Bail out if we encounter anything that seems risky to reorder.
+ if (!II->getNumExplicitDefs() || II->isKill() ||
+ llvm::any_of(II->memoperands(), [&](const MachineMemOperand *MMO) {
+ return MMO->isAtomic() || MMO->isVolatile();
+ })) {
+ LLVM_DEBUG(dbgs() << " Abort\n");
+ return;
+ }
+
+ Register Reg = getDef(*II);
+ unsigned NewOrder =
+ UseDistance.contains(Reg) ? UseDistance[Reg] : MaxDistance;
+ LLVM_DEBUG(dbgs() << " Order: " << NewOrder << "," << NativeOrder
+ << ", MI: " << *II);
+ unsigned Order = (NewOrder << 16 | NativeOrder);
+ Schedule.emplace_back(&*II, Order);
+ Reordered |= Order < LastOrder;
+ LastOrder = Order;
+ }
+
+ // No reordering found.
+ if (!Reordered) {
+ LLVM_DEBUG(dbgs() << " No changes\n");
+ return;
+ }
+
+ // Apply sort on new ordering.
+ std::sort(Schedule.begin(), Schedule.end(),
+ [](std::pair<MachineInstr *, unsigned> A,
+ std::pair<MachineInstr *, unsigned> B) {
+ return A.second < B.second;
+ });
+
+ // Rebuild clause order.
+ // Schedule holds ideal order for the load operations; however, each def
+ // can only be scheduled when it will no longer clobber any uses.
+ SmallVector<MachineInstr *> Clause;
+ while (!Schedule.empty()) {
+ // Try to schedule next instruction in schedule.
+ // Iterate until we find something that can be placed.
+ auto It = Schedule.begin();
+ while (It != Schedule.end()) {
+ MachineInstr *MI = It->first;
+ LLVM_DEBUG(dbgs() << "Try schedule: " << *MI);
+
+ if (MI->getNumExplicitDefs() == 0) {
+ // No defs, always schedule.
+ LLVM_DEBUG(dbgs() << " Trivially OK\n");
+ break;
+ }
+
+ Register DefReg = getDef(*MI);
+ bool DefRegHasUse = false;
+ for (auto SearchIt = std::next(It);
+ SearchIt != Schedule.end() && !DefRegHasUse; ++SearchIt)
+ DefRegHasUse = SearchIt->first->readsRegister(DefReg, TRI);
+ if (DefRegHasUse) {
+ // A future use would be clobbered; try next instruction in the
+ // schedule.
+ LLVM_DEBUG(dbgs() << " Clobbers uses\n");
+ It++;
+ continue;
+ }
+
+ // Safe to schedule.
+ LLVM_DEBUG(dbgs() << " OK!\n");
+ break;
+ }
+
+ // Place schedule instruction into clause order.
+ assert(It != Schedule.end());
+ MachineInstr *MI = It->first;
+ Schedule.erase(It);
+ Clause.push_back(MI);
+
+ // Clear kill flags for later uses.
+ for (auto &Use : MI->all_uses()) {
+ if (!Use.isReg() || !Use.isKill())
+ continue;
+ Register UseReg = Use.getReg();
+ if (llvm::any_of(Schedule, [&](std::pair<MachineInstr *, unsigned> &SI) {
+ return SI.first->readsRegister(UseReg, TRI);
+ }))
+ Use.setIsKill(false);
+ }
+ }
+
+ // Apply order to instructions.
+ for (MachineInstr *MI : Clause)
+ MI->moveBefore(&*Next);
+
+ // Update start of bundle.
+ BundleStart = Clause[0]->getIterator();
+}
+
bool SIPostRABundlerLegacy::runOnMachineFunction(MachineFunction &MF) {
if (skipFunction(MF.getFunction()))
return false;
@@ -143,6 +283,8 @@ PreservedAnalyses SIPostRABundlerPass::run(MachineFunction &MF,
bool SIPostRABundler::run(MachineFunction &MF) {
+ const GCNSubtarget &ST = MF.getSubtarget<GCNSubtarget>();
+ TII = ST.getInstrInfo();
TRI = MF.getSubtarget<GCNSubtarget>().getRegisterInfo();
BitVector BundleUsedRegUnits(TRI->getNumRegUnits());
BitVector KillUsedRegUnits(TRI->getNumRegUnits());
@@ -170,7 +312,7 @@ bool SIPostRABundler::run(MachineFunction &MF) {
assert(Defs.empty());
if (I->getNumExplicitDefs() != 0)
- Defs.insert(I->defs().begin()->getReg());
+ Defs.insert(getDef(*I));
MachineBasicBlock::instr_iterator BundleStart = I;
MachineBasicBlock::instr_iterator BundleEnd = I;
@@ -182,7 +324,7 @@ bool SIPostRABundler::run(MachineFunction &MF) {
if (canBundle(*BundleEnd, *I)) {
BundleEnd = I;
if (I->getNumExplicitDefs() != 0)
- Defs.insert(I->defs().begin()->getReg());
+ Defs.insert(getDef(*I));
++ClauseLength;
} else if (!I->isMetaInstruction() ||
I->getOpcode() == AMDGPU::SCHED_BARRIER) {
@@ -234,6 +376,7 @@ bool SIPostRABundler::run(MachineFunction &MF) {
BundleUsedRegUnits.reset();
}
+ reorderLoads(MBB, BundleStart, Next);
finalizeBundle(MBB, BundleStart, Next);
}
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/add.vni16.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/add.vni16.ll
index b67080bd4798d..c04f86391c44b 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/add.vni16.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/add.vni16.ll
@@ -716,17 +716,17 @@ define void @add_v11i16(ptr addrspace(1) %ptra, ptr addrspace(1) %ptrb, ptr addr
; GFX9-LABEL: add_v11i16:
; GFX9: ; %bb.0:
; GFX9-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX9-NEXT: global_load_dwordx4 v[6:9], v[0:1], off
; GFX9-NEXT: global_load_ushort v14, v[0:1], off offset:16
; GFX9-NEXT: global_load_ushort v15, v[2:3], off offset:16
+; GFX9-NEXT: global_load_dwordx4 v[6:9], v[0:1], off
; GFX9-NEXT: global_load_dwordx4 v[10:13], v[2:3], off
; GFX9-NEXT: global_load_ushort v16, v[2:3], off offset:20
; GFX9-NEXT: global_load_ushort v17, v[0:1], off offset:20
; GFX9-NEXT: global_load_ushort v18, v[0:1], off offset:18
; GFX9-NEXT: global_load_ushort v19, v[2:3], off offset:18
-; GFX9-NEXT: s_waitcnt vmcnt(6)
+; GFX9-NEXT: s_waitcnt vmcnt(7)
; GFX9-NEXT: v_and_b32_e32 v14, 0xffff, v14
-; GFX9-NEXT: s_waitcnt vmcnt(5)
+; GFX9-NEXT: s_waitcnt vmcnt(6)
; GFX9-NEXT: v_and_b32_e32 v15, 0xffff, v15
; GFX9-NEXT: s_waitcnt vmcnt(4)
; GFX9-NEXT: v_pk_add_u16 v0, v6, v10
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/combine-fma-add-fma-mul.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/combine-fma-add-fma-mul.ll
index 6ea0a9446ff9d..7fca4d628d023 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/combine-fma-add-fma-mul.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/combine-fma-add-fma-mul.ll
@@ -750,20 +750,20 @@ define <4 x double> @test_f64_add_mul(<4 x double> %a, <4 x double> %b, <4 x dou
; GFX10-CONTRACT: ; %bb.0: ; %.entry
; GFX10-CONTRACT-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX10-CONTRACT-NEXT: s_clause 0x8
-; GFX10-CONTRACT-NEXT: buffer_load_dword v31, off, s[0:3], s32
; GFX10-CONTRACT-NEXT: buffer_load_dword v32, off, s[0:3], s32 offset:4
; GFX10-CONTRACT-NEXT: buffer_load_dword v33, off, s[0:3], s32 offset:8
; GFX10-CONTRACT-NEXT: buffer_load_dword v34, off, s[0:3], s32 offset:12
; GFX10-CONTRACT-NEXT: buffer_load_dword v35, off, s[0:3], s32 offset:16
; GFX10-CONTRACT-NEXT: buffer_load_dword v36, off, s[0:3], s32 offset:20
; GFX10-CONTRACT-NEXT: buffer_load_dword v37, off, s[0:3], s32 offset:24
+; GFX10-CONTRACT-NEXT: buffer_load_dword v31, off, s[0:3], s32
; GFX10-CONTRACT-NEXT: buffer_load_dword v38, off, s[0:3], s32 offset:28
; GFX10-CONTRACT-NEXT: buffer_load_dword v39, off, s[0:3], s32 offset:32
-; GFX10-CONTRACT-NEXT: s_waitcnt vmcnt(6)
+; GFX10-CONTRACT-NEXT: s_waitcnt vmcnt(7)
; GFX10-CONTRACT-NEXT: v_fma_f64 v[16:17], v[16:17], v[24:25], v[32:33]
-; GFX10-CONTRACT-NEXT: s_waitcnt vmcnt(4)
+; GFX10-CONTRACT-NEXT: s_waitcnt vmcnt(5)
; GFX10-CONTRACT-NEXT: v_fma_f64 v[18:19], v[18:19], v[26:27], v[34:35]
-; GFX10-CONTRACT-NEXT: s_waitcnt vmcnt(2)
+; GFX10-CONTRACT-NEXT: s_waitcnt vmcnt(3)
; GFX10-CONTRACT-NEXT: v_fma_f64 v[20:21], v[20:21], v[28:29], v[36:37]
; GFX10-CONTRACT-NEXT: s_waitcnt vmcnt(0)
; GFX10-CONTRACT-NEXT: v_fma_f64 v[22:23], v[22:23], v[30:31], v[38:39]
@@ -777,20 +777,20 @@ define <4 x double> @test_f64_add_mul(<4 x double> %a, <4 x double> %b, <4 x dou
; GFX10-DENORM: ; %bb.0: ; %.entry
; GFX10-DENORM-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX10-DENORM-NEXT: s_clause 0x8
-; GFX10-DENORM-NEXT: buffer_load_dword v31, off, s[0:3], s32
; GFX10-DENORM-NEXT: buffer_load_dword v32, off, s[0:3], s32 offset:4
; GFX10-DENORM-NEXT: buffer_load_dword v33, off, s[0:3], s32 offset:8
; GFX10-DENORM-NEXT: buffer_load_dword v34, off, s[0:3], s32 offset:12
; GFX10-DENORM-NEXT: buffer_load_dword v35, off, s[0:3], s32 offset:16
; GFX10-DENORM-NEXT: buffer_load_dword v36, off, s[0:3], s32 offset:20
; GFX10-DENORM-NEXT: buffer_load_dword v37, off, s[0:3], s32 offset:24
+; GFX10-DENORM-NEXT: buffer_load_dword v31, off, s[0:3], s32
; GFX10-DENORM-NEXT: buffer_load_dword v38, off, s[0:3], s32 offset:28
; GFX10-DENORM-NEXT: buffer_load_dword v39, off, s[0:3], s32 offset:32
-; GFX10-DENORM-NEXT: s_waitcnt vmcnt(6)
+; GFX10-DENORM-NEXT: s_waitcnt vmcnt(7)
; GFX10-DENORM-NEXT: v_fma_f64 v[16:17], v[16:17], v[24:25], v[32:33]
-; GFX10-DENORM-NEXT: s_waitcnt vmcnt(4)
+; GFX10-DENORM-NEXT: s_waitcnt vmcnt(5)
; GFX10-DENORM-NEXT: v_fma_f64 v[18:19], v[18:19], v[26:27], v[34:35]
-; GFX10-DENORM-NEXT: s_waitcnt vmcnt(2)
+; GFX10-DENORM-NEXT: s_waitcnt vmcnt(3)
; GFX10-DENORM-NEXT: v_fma_f64 v[20:21], v[20:21], v[28:29], v[36:37]
; GFX10-DENORM-NEXT: s_waitcnt vmcnt(0)
; GFX10-DENORM-NEXT: v_fma_f64 v[22:23], v[22:23], v[30:31], v[38:39]
@@ -804,20 +804,20 @@ define <4 x double> @test_f64_add_mul(<4 x double> %a, <4 x double> %b, <4 x dou
; GFX11-CONTRACT: ; %bb.0: ; %.entry
; GFX11-CONTRACT-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX11-CONTRACT-NEXT: s_clause 0x8
-; GFX11-CONTRACT-NEXT: scratch_load_b32 v31, off, s32
; GFX11-CONTRACT-NEXT: scratch_load_b32 v32, off, s32 offset:4
; GFX11-CONTRACT-NEXT: scratch_load_b32 v33, off, s32 offset:8
; GFX11-CONTRACT-NEXT: scratch_load_b32 v34, off, s32 offset:12
; GFX11-CONTRACT-NEXT: scratch_load_b32 v35, off, s32 offset:16
; GFX11-CONTRACT-NEXT: scratch_load_b32 v36, off, s32 offset:20
; GFX11-CONTRACT-NEXT: scratch_load_b32 v37, off, s32 offset:24
+; GFX11-CONTRACT-NEXT: scratch_load_b32 v31, off, s32
; GFX11-CONTRACT-NEXT: scratch_load_b32 v38, off, s32 offset:28
; GFX11-CONTRACT-NEXT: scratch_load_b32 v39, off, s32 offset:32
-; GFX11-CONTRACT-NEXT: s_waitcnt vmcnt(6)
+; GFX11-CONTRACT-NEXT: s_waitcnt vmcnt(7)
; GFX11-CONTRACT-NEXT: v_fma_f64 v[16:17], v[16:17], v[24:25], v[32:33]
-; GFX11-CONTRACT-NEXT: s_waitcnt vmcnt(4)
+; GFX11-CONTRACT-NEXT: s_waitcnt vmcnt(5)
; GFX11-CONTRACT-NEXT: v_fma_f64 v[18:19], v[18:19], v[26:27], v[34:35]
-; GFX11-CONTRACT-NEXT: s_waitcnt vmcnt(2)
+; GFX11-CONTRACT-NEXT: s_waitcnt vmcnt(3)
; GFX11-CONTRACT-NEXT: v_fma_f64 v[20:21], v[20:21], v[28:29], v[36:37]
; GFX11-CONTRACT-NEXT: s_waitcnt vmcnt(0)
; GFX11-CONTRACT-NEXT: v_fma_f64 v[22:23], v[22:23], v[30:31], v[38:39]
@@ -833,20 +833,20 @@ define <4 x double> @test_f64_add_mul(<4 x double> %a, <4 x double> %b, <4 x dou
; GFX11-DENORM: ; %bb.0: ; %.entry
; GFX11-DENORM-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX11-DENORM-NEXT: s_clause 0x8
-; GFX11-DENORM-NEXT: scratch_load_b32 v31, off, s32
; GFX11-DENORM-NEXT: scratch_load_b32 v32, off, s32 offset:4
; GFX11-DENORM-NEXT: scratch_load_b32 v33, off, s32 offset:8
; GFX11-DENORM-NEXT: scratch_load_b32 v34, off, s32 offset:12
; GFX11-DENORM-NEXT: scratch_load_b32 v35, off, s32 offset:16
; GFX11-DENORM-NEXT: scratch_load_b32 v36, off, s32 offset:20
; GFX11-DENORM-NEXT: scratch_load_b32 v37, off, s32 offset:24
+; GFX11-DENORM-NEXT: scratch_load_b32 v31, off, s32
; GFX11-DENORM-NEXT: scratch_load_b32 v38, off, s32 offset:28
; GFX11-DENORM-NEXT: scratch_load_b32 v39, off, s32 offset:32
-; GFX11-DENORM-NEXT: s_waitcnt vmcnt(6)
+; GFX11-DENORM-NEXT: s_waitcnt vmcnt(7)
; GFX11-DENORM-NEXT: v_fma_f64 v[16:17], v[16:17], v[24:25], v[32:33]
-; GFX11-DENORM-NEXT: s_waitcnt vmcnt(4)
+; GFX11-DENORM-NEXT: s_waitcnt vmcnt(5)
; GFX11-DENORM-NEXT: v_fma_f64 v[18:19], v[18:19], v[26:27], v[34:35]
-; GFX11-DENORM-NEXT: s_waitcnt vmcnt(2)
+; GFX11-DENORM-NEXT: s_waitcnt vmcnt(3)
; GFX11-DENORM-NEXT: v_fma_f64 v[20:21], v[20:21], v[28:29], v[36:37]
; GFX11-DENORM-NEXT: s_waitcnt vmcnt(0)
; GFX11-DENORM-NEXT: v_fma_f64 v[22:23], v[22:23], v[30:31], v[38:39]
@@ -921,20 +921,20 @@ define <4 x double> @test_f64_add_mul_rhs(<4 x double> %a, <4 x double> %b, <4 x
; GFX10-CONTRACT: ; %bb.0: ; %.entry
; GFX10-CONTRACT-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX10-CONTRACT-NEXT: s_clause 0x8
-; GFX10-CONTRACT-NEXT: buffer_load_dword v31, off, s[0:3], s32
; GFX10-CONTRACT-NEXT: buffer_load_dword v32, off, s[0:3], s32 offset:4
; GFX10-CONTRACT-NEXT: buffer_load_dword v33, off, s[0:3], s32 offset:8
; GFX10-CONTRACT-NEXT: buffer_load_dword v34, off, s[0:3], s32 offset:12
; GFX10-CONTRACT-NEXT: buffer_load_dword v35, off, s[0:3], s32 offset:16
; GFX10-CONTRACT-NEXT: buffer_load_dword v36, off, s[0:3], s32 offset:20
; GFX10-CONTRACT-NEXT: buffer_load_dword v37, off, s[0:3], s32 offset:24
+; GFX10-CONTRACT-NEXT: buffer_load_dword v31, off, s[0:3], s32
; GFX10-CONTRACT-NEXT: buffer_load_dword v38, off, s[0:3], s32 offset:28
; GFX10-CONTRACT-NEXT: buffer_load_dword v39, off, s[0:3], s32 offset:32
-; GFX10-CONTRACT-NEXT: s_waitcnt vmcnt(6)
+; GFX10-CONTRACT-NEXT: s_waitcnt vmcnt(7)
; GFX10-CONTRACT-NEXT: v_fma_f64 v[16:17], v[16:17], v[24:25], v[32:33]
-; GFX10-CONTRACT-NEXT: s_waitcnt vmcnt(4)
+; GFX10-CONTRACT-NEXT: s_waitcnt vmcnt(5)
; GFX10-CONTRACT-NEXT: v_fma_f64 v[18:19], v[18:19], v[26:27], v[34:35]
-; GFX10-CONTRACT-NEXT: s_waitcnt vmcnt(2)
+; GFX10-CONTRACT-NEXT: s_waitcnt vmcnt(3)
; GFX10-CONTRACT-NEXT: v_fma_f64 v[20:21], v[20:21], v[28:29], v[36:37]
; GFX10-CONTRACT-NEXT: s_waitcnt vmcnt(0)
; GFX10-CONTRACT-NEXT: v_fma_f64 v[22:23], v[22:23], v[30:31], v[38:39]
@@ -948,20 +948,20 @@ define <4 x double> @test_f64_add_mul_rhs(<4 x double> %a, <4 x double> %b, <4 x
; GFX10-DENORM: ; %bb.0: ; %.entry
; GFX10-DENORM-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX10-DENORM-NEXT: s_clause 0x8
-; GFX10-DENORM-NEXT: buffer_load_dword v31, off, s[0:3], s32
; GFX10-DENORM-NEXT: buffer_load_dword v32, off, s[0:3], s32 offset:4
; GFX10-DENORM-NEXT: buffer_load_dword v33, off, s[0:3], s32 offset:8
; GFX10-DENORM-NEXT: buffer_load_dword v34, off, s[0:3], s32 offset:12
; GFX10-DENORM-NEXT: buffer_load_dword v35, off, s[0:3], s32 offset:16
; GFX10-DENORM-NEXT: buffer_load_dword v36, off, s[0:3], s32 offset:20
; GFX10-DENORM-NEXT: buffer_load_dword v37, off, s[0:3], s32 offset:24
+; GFX10-DENORM-NEXT: buffer_load_dword v31, off, s[0:3], s32
; GFX10-DENORM-NEXT: buffer_load_dword v38, off, s[0:3], s32 offset:28
; GFX10-DENORM-NEXT: buffer_load_dword v39, off, s[0:3], s32 offset:32
-; GFX10-DENORM-NEXT: s_waitcnt vmcnt(6)
+; GFX10-DENORM-NEXT: s_waitcnt vmcnt(7)
; GFX10-DENORM-NEXT: v_fma_f64 v[16:17], v[16:17], v[24:25], v[32:33]
-; GFX10-DENORM-NEXT: s_waitcnt vmcnt(4)
+; GFX10-DENORM-NEXT: s_waitcnt vmcnt(5)
; GFX10-DENORM-NEXT: v_fma_f64 v[18:19], v[18:19], v[26:27], v[34:35]
-; GFX10-DENORM-NEXT: s_waitcnt vmcnt(2)
+; GFX10-DENORM-NEXT: s_waitcnt vmcnt(3)
; GFX10-DENORM-NEXT: v_fma_f64 v[20:21], v[20:21], v[28:29], v[36:37]
; GFX10-DENORM-NEXT: s_waitcnt vmcnt(0)
; GFX10-DENORM-NEXT: v_fma_f64 v[22:23], v[22:23], v[30:31], v[38:39]
@@ -975,20 +975,20 @@ define <4 x double> @test_f64_add_mul_rhs(<4 x double> %a, <4 x double> %b, <4 x
; GFX11-CONTRACT: ; %bb.0: ; %.entry
; GFX11-CONTRACT-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX11-CONTRACT-NEXT: s_clause 0x8
-; GFX11-CONTRACT-NEXT: scratch_load_b32 v31, off, s32
; GFX11-CONTRACT-NEXT: scratch_load_b32 v32, off, s32 offset:4
; GFX11-CONTRACT-NEXT: scratch_load_b32 v33, off, s32 offset:8
; GFX11-CONTRACT-NEXT: scratch_load_b32 v34, off, s32 offset:12
; GFX11-CONTRACT-NEXT: scratch_load_b32 v35, off, s32 offset:16
; GFX11-CONTRACT-NEXT: scratch_load_b32 v36, off, s32 offset:20
; GFX11-CONTRACT-NEXT: scratch_load_b32 v37, off, s32 offset:24
+; GFX11-CONTRACT-NEXT: scratch_load_b32 v31, off, s32
; GFX11-CONTRACT-NEXT: scratch_load_b32 v38, off, s32 offset:28
; GFX11-CONTRACT-NEXT: scratch_load_b32 v39, off, s32 offset:32
-; GFX11-CONTRACT-NEXT: s_waitcnt vmcnt(6)
+; GFX11-CONTRACT-NEXT: s_waitcnt vmcnt(7)
; GFX11-CONTRACT-NEXT: v_fma_f64 v[16:17], v[16:17], v[24:25], v[32:33]
-; GFX11-CONTRACT-NEXT: s_waitcnt vmcnt(4)
+; GFX11-CONTRACT-NEXT: s_waitcnt vmcnt(5)
; GFX11-CONTRACT-NEXT: v_fma_f64 v[18:19], v[18:19], v[26:27], v[34:35]
-; GFX11-CONTRACT-NEXT: s_waitcnt vmcnt(2)
+; GFX11-CONTRACT-NEXT: s_waitcnt vmcnt(3)
; GFX11-CONTRACT-NEXT: v_fma_f64 v[20:21], v[20:21], v[28:29], v[36:37]
; GFX11-CONTRACT-NEXT: s_waitcnt vmcnt(0)
; GFX11-CONTRACT-NEXT: v_fma_f64 v[22:23], v[22:23], v[30:31], v[38:39]
@@ -1004,20 +1004,20 @@ define <4 x double> @test_f64_add_mul_rhs(<4 x double> %a, <4 x double> %b, <4 x
; GFX11-DENORM: ; %bb.0: ; %.entry
; GFX11-DENORM-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX11-DENORM-NEXT: s_clause 0x8
-; GFX11-DENORM-NEXT: scratch_load_b32 v31, off, s32
; GFX11-DENORM-NEXT: scratch_load_b32 v32, off, s32 offset:4
; GFX11-DENORM-NEXT: scratch_load_b32 v33, off, s32 offset:8
; GFX11-DENORM-NEXT: scratch_load_b32 v34, off, s32 offse...
[truncated]
|
After clauses are formed their internal loads can be reordered to facilitate some additional opportunities for overlapping computation.
This late stage rescheduling causes no change register pressure.