-
Notifications
You must be signed in to change notification settings - Fork 15.3k
MTM: improve operand latency when missing sched info #101389
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Do you have any data to support this claim? |
|
I've tried variations of this patch in the past, but I think there's a good idea hidden here. |
I should add that this is probably due to the fact that SPEC 2017 is polluted with short-running tests. |
|
I knew that I'd messed something up, because such a crazy speed up was unexpected. The actual speed up with the revised version of the patch (and better benchmarking) is of the order of 1%, and this is across all targets. Full results can be viewed below. |
|
I've updated the CodeGen tests for the RISC-V target as an example to help with the review, and unless I'm mistaken, I do see improvements mixed with changes that have no impact. |
|
Gentle ping. I think the test changes can be summarized as:
|
|
I've updated the X86 tests now, and I think the patch is ready for review. |
|
@llvm/pr-subscribers-backend-x86 @llvm/pr-subscribers-llvm-globalisel Author: Ramkumar Ramachandra (artagnon) ChangesTargetSchedModel::computeOperandLatency is supposed to return the exact latency between two MIs, although it is observed that InstrSchedModel and InstrItineraries are often unavailable in many real-world scenarios. When these two pieces of information are not available, the function returns an estimate that is much too conservative: the default def latency. MachineTraceMetrics is one of the callers affected quite badly by these conservative estimates. To improve the estimate, and let callers of MTM generate better code, offset the default def latency by the estiamted cycles elapsed between the def MI and use MI. Since we're trying to improve codegen in the case when no scheduling information is unavailable, it is impossible to determine the number of cycles elapsed between the two MIs, and we use the distance between them accounting for issue-width as a crude approximate. In practice, this improvement of one crude estimate by offseting it with another crude estimate leads to better codegen on average, and yields non-trivial gains on standard benchmarks. Patch is 281.95 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/101389.diff 77 Files Affected:
diff --git a/llvm/lib/CodeGen/MachineTraceMetrics.cpp b/llvm/lib/CodeGen/MachineTraceMetrics.cpp
index bf3add010574b8..c3afba23628130 100644
--- a/llvm/lib/CodeGen/MachineTraceMetrics.cpp
+++ b/llvm/lib/CodeGen/MachineTraceMetrics.cpp
@@ -20,6 +20,7 @@
#include "llvm/CodeGen/MachineLoopInfo.h"
#include "llvm/CodeGen/MachineOperand.h"
#include "llvm/CodeGen/MachineRegisterInfo.h"
+#include "llvm/CodeGen/TargetInstrInfo.h"
#include "llvm/CodeGen/TargetRegisterInfo.h"
#include "llvm/CodeGen/TargetSchedule.h"
#include "llvm/CodeGen/TargetSubtargetInfo.h"
@@ -761,6 +762,64 @@ static void updatePhysDepsDownwards(const MachineInstr *UseMI,
}
}
+/// Estimates the number of cycles elapsed between DefMI and UseMI if they're
+/// non-null and in the same BasicBlock. Returns std::nullopt when UseMI is in a
+/// different MBB than DefMI, or when it is a dangling MI.
+static std::optional<unsigned>
+estimateDefUseCycles(const TargetSchedModel &Sched, const MachineInstr *DefMI,
+ const MachineInstr *UseMI) {
+ if (!DefMI || !UseMI || DefMI == UseMI)
+ return 0;
+ const MachineBasicBlock *ParentBB = DefMI->getParent();
+ if (ParentBB != UseMI->getParent())
+ return std::nullopt;
+
+ const auto DefIt =
+ llvm::find_if(ParentBB->instrs(),
+ [DefMI](const MachineInstr &MI) { return DefMI == &MI; });
+ const auto UseIt =
+ llvm::find_if(ParentBB->instrs(),
+ [UseMI](const MachineInstr &MI) { return UseMI == &MI; });
+ assert(std::distance(DefIt, UseIt) > 0 &&
+ "Def expected to appear before use");
+ unsigned NumMicroOps = 0;
+ for (auto It = DefIt; It != UseIt; ++It) {
+ // In some cases, UseMI is a dangling MI beyond the end of the MBB.
+ if (It.isEnd())
+ return std::nullopt;
+
+ NumMicroOps += Sched.getNumMicroOps(&*It);
+ }
+ return NumMicroOps / Sched.getIssueWidth();
+}
+
+/// Wraps Sched.computeOperandLatency, accounting for the case when
+/// InstrSchedModel and InstrItineraries are not available: in this case,
+/// Sched.computeOperandLatency returns DefaultDefLatency, which is a very rough
+/// approximate; to improve this approximate, offset it by the approximate
+/// cycles elapsed from DefMI to UseMI (since the MIs could be re-ordered by the
+/// scheduler, and we don't have this information, this cannot be known
+/// exactly). When scheduling information is available,
+/// Sched.computeOperandLatency returns a much better estimate (especially if
+/// UseMI is non-null), so we just return that.
+static unsigned computeOperandLatency(const TargetSchedModel &Sched,
+ const MachineInstr *DefMI,
+ unsigned DefOperIdx,
+ const MachineInstr *UseMI,
+ unsigned UseOperIdx) {
+ assert(DefMI && "Non-null DefMI expected");
+ if (!Sched.hasInstrSchedModel() && !Sched.hasInstrItineraries()) {
+ unsigned DefaultDefLatency = Sched.getInstrInfo()->defaultDefLatency(
+ *Sched.getMCSchedModel(), *DefMI);
+ std::optional<unsigned> DefUseCycles =
+ estimateDefUseCycles(Sched, DefMI, UseMI);
+ if (!DefUseCycles || DefaultDefLatency <= DefUseCycles)
+ return 0;
+ return DefaultDefLatency - *DefUseCycles;
+ }
+ return Sched.computeOperandLatency(DefMI, DefOperIdx, UseMI, UseOperIdx);
+}
+
/// The length of the critical path through a trace is the maximum of two path
/// lengths:
///
@@ -813,8 +872,8 @@ updateDepth(MachineTraceMetrics::TraceBlockInfo &TBI, const MachineInstr &UseMI,
unsigned DepCycle = Cycles.lookup(Dep.DefMI).Depth;
// Add latency if DefMI is a real instruction. Transients get latency 0.
if (!Dep.DefMI->isTransient())
- DepCycle += MTM.SchedModel
- .computeOperandLatency(Dep.DefMI, Dep.DefOp, &UseMI, Dep.UseOp);
+ DepCycle += computeOperandLatency(MTM.SchedModel, Dep.DefMI, Dep.DefOp,
+ &UseMI, Dep.UseOp);
Cycle = std::max(Cycle, DepCycle);
}
// Remember the instruction depth.
@@ -929,8 +988,8 @@ static unsigned updatePhysDepsUpwards(const MachineInstr &MI, unsigned Height,
if (!MI.isTransient()) {
// We may not know the UseMI of this dependency, if it came from the
// live-in list. SchedModel can handle a NULL UseMI.
- DepHeight += SchedModel.computeOperandLatency(&MI, MO.getOperandNo(),
- I->MI, I->Op);
+ DepHeight += computeOperandLatency(SchedModel, &MI, MO.getOperandNo(),
+ I->MI, I->Op);
}
Height = std::max(Height, DepHeight);
// This regunit is dead above MI.
@@ -963,10 +1022,9 @@ static bool pushDepHeight(const DataDep &Dep, const MachineInstr &UseMI,
unsigned UseHeight, MIHeightMap &Heights,
const TargetSchedModel &SchedModel,
const TargetInstrInfo *TII) {
- // Adjust height by Dep.DefMI latency.
if (!Dep.DefMI->isTransient())
- UseHeight += SchedModel.computeOperandLatency(Dep.DefMI, Dep.DefOp, &UseMI,
- Dep.UseOp);
+ UseHeight += computeOperandLatency(SchedModel, Dep.DefMI, Dep.DefOp, &UseMI,
+ Dep.UseOp);
// Update Heights[DefMI] to be the maximum height seen.
MIHeightMap::iterator I;
@@ -1192,8 +1250,8 @@ MachineTraceMetrics::Trace::getPHIDepth(const MachineInstr &PHI) const {
unsigned DepCycle = getInstrCycles(*Dep.DefMI).Depth;
// Add latency if DefMI is a real instruction. Transients get latency 0.
if (!Dep.DefMI->isTransient())
- DepCycle += TE.MTM.SchedModel.computeOperandLatency(Dep.DefMI, Dep.DefOp,
- &PHI, Dep.UseOp);
+ DepCycle += computeOperandLatency(TE.MTM.SchedModel, Dep.DefMI, Dep.DefOp,
+ &PHI, Dep.UseOp);
return DepCycle;
}
diff --git a/llvm/test/CodeGen/RISCV/GlobalISel/bitmanip.ll b/llvm/test/CodeGen/RISCV/GlobalISel/bitmanip.ll
index 5c42fefb95b39f..69261126cd8b0e 100644
--- a/llvm/test/CodeGen/RISCV/GlobalISel/bitmanip.ll
+++ b/llvm/test/CodeGen/RISCV/GlobalISel/bitmanip.ll
@@ -94,15 +94,15 @@ define i7 @bitreverse_i7(i7 %x) {
; RV32-NEXT: or a1, a1, a2
; RV32-NEXT: slli a2, a0, 2
; RV32-NEXT: andi a2, a2, 16
+; RV32-NEXT: or a1, a1, a2
; RV32-NEXT: andi a0, a0, 127
-; RV32-NEXT: andi a3, a0, 8
-; RV32-NEXT: or a2, a2, a3
+; RV32-NEXT: andi a2, a0, 8
; RV32-NEXT: or a1, a1, a2
; RV32-NEXT: srli a2, a0, 2
; RV32-NEXT: andi a2, a2, 4
-; RV32-NEXT: srli a3, a0, 4
-; RV32-NEXT: andi a3, a3, 2
-; RV32-NEXT: or a2, a2, a3
+; RV32-NEXT: or a1, a1, a2
+; RV32-NEXT: srli a2, a0, 4
+; RV32-NEXT: andi a2, a2, 2
; RV32-NEXT: or a1, a1, a2
; RV32-NEXT: srli a0, a0, 6
; RV32-NEXT: or a0, a1, a0
@@ -117,15 +117,15 @@ define i7 @bitreverse_i7(i7 %x) {
; RV64-NEXT: or a1, a1, a2
; RV64-NEXT: slli a2, a0, 2
; RV64-NEXT: andi a2, a2, 16
+; RV64-NEXT: or a1, a1, a2
; RV64-NEXT: andi a0, a0, 127
-; RV64-NEXT: andi a3, a0, 8
-; RV64-NEXT: or a2, a2, a3
+; RV64-NEXT: andi a2, a0, 8
; RV64-NEXT: or a1, a1, a2
; RV64-NEXT: srliw a2, a0, 2
; RV64-NEXT: andi a2, a2, 4
-; RV64-NEXT: srliw a3, a0, 4
-; RV64-NEXT: andi a3, a3, 2
-; RV64-NEXT: or a2, a2, a3
+; RV64-NEXT: or a1, a1, a2
+; RV64-NEXT: srliw a2, a0, 4
+; RV64-NEXT: andi a2, a2, 2
; RV64-NEXT: or a1, a1, a2
; RV64-NEXT: srliw a0, a0, 6
; RV64-NEXT: or a0, a1, a0
@@ -145,24 +145,24 @@ define i24 @bitreverse_i24(i24 %x) {
; RV32-NEXT: or a0, a0, a1
; RV32-NEXT: lui a1, 1048335
; RV32-NEXT: addi a1, a1, 240
-; RV32-NEXT: and a3, a1, a2
-; RV32-NEXT: and a3, a0, a3
+; RV32-NEXT: and a3, a0, a1
+; RV32-NEXT: and a3, a3, a2
; RV32-NEXT: srli a3, a3, 4
; RV32-NEXT: slli a0, a0, 4
; RV32-NEXT: and a0, a0, a1
; RV32-NEXT: or a0, a3, a0
; RV32-NEXT: lui a1, 1047757
; RV32-NEXT: addi a1, a1, -820
-; RV32-NEXT: and a3, a1, a2
-; RV32-NEXT: and a3, a0, a3
+; RV32-NEXT: and a3, a0, a1
+; RV32-NEXT: and a3, a3, a2
; RV32-NEXT: srli a3, a3, 2
; RV32-NEXT: slli a0, a0, 2
; RV32-NEXT: and a0, a0, a1
; RV32-NEXT: or a0, a3, a0
; RV32-NEXT: lui a1, 1047211
; RV32-NEXT: addi a1, a1, -1366
-; RV32-NEXT: and a2, a1, a2
-; RV32-NEXT: and a2, a0, a2
+; RV32-NEXT: and a3, a0, a1
+; RV32-NEXT: and a2, a3, a2
; RV32-NEXT: srli a2, a2, 1
; RV32-NEXT: slli a0, a0, 1
; RV32-NEXT: and a0, a0, a1
@@ -179,24 +179,24 @@ define i24 @bitreverse_i24(i24 %x) {
; RV64-NEXT: or a0, a0, a1
; RV64-NEXT: lui a1, 1048335
; RV64-NEXT: addi a1, a1, 240
-; RV64-NEXT: and a3, a1, a2
-; RV64-NEXT: and a3, a0, a3
+; RV64-NEXT: and a3, a0, a1
+; RV64-NEXT: and a3, a3, a2
; RV64-NEXT: srliw a3, a3, 4
; RV64-NEXT: slli a0, a0, 4
; RV64-NEXT: and a0, a0, a1
; RV64-NEXT: or a0, a3, a0
; RV64-NEXT: lui a1, 1047757
; RV64-NEXT: addi a1, a1, -820
-; RV64-NEXT: and a3, a1, a2
-; RV64-NEXT: and a3, a0, a3
+; RV64-NEXT: and a3, a0, a1
+; RV64-NEXT: and a3, a3, a2
; RV64-NEXT: srliw a3, a3, 2
; RV64-NEXT: slli a0, a0, 2
; RV64-NEXT: and a0, a0, a1
; RV64-NEXT: or a0, a3, a0
; RV64-NEXT: lui a1, 1047211
; RV64-NEXT: addiw a1, a1, -1366
-; RV64-NEXT: and a2, a1, a2
-; RV64-NEXT: and a2, a0, a2
+; RV64-NEXT: and a3, a0, a1
+; RV64-NEXT: and a2, a3, a2
; RV64-NEXT: srliw a2, a2, 1
; RV64-NEXT: slliw a0, a0, 1
; RV64-NEXT: and a0, a0, a1
diff --git a/llvm/test/CodeGen/RISCV/GlobalISel/vararg.ll b/llvm/test/CodeGen/RISCV/GlobalISel/vararg.ll
index d55adf371119b5..5723c4b9197a6a 100644
--- a/llvm/test/CodeGen/RISCV/GlobalISel/vararg.ll
+++ b/llvm/test/CodeGen/RISCV/GlobalISel/vararg.ll
@@ -1266,8 +1266,8 @@ define i32 @va4_va_copy(i32 %argno, ...) nounwind {
; RV32-NEXT: sw a3, 4(sp)
; RV32-NEXT: lw a2, 0(a2)
; RV32-NEXT: add a0, a0, s0
-; RV32-NEXT: add a1, a1, a2
; RV32-NEXT: add a0, a0, a1
+; RV32-NEXT: add a0, a0, a2
; RV32-NEXT: lw ra, 12(sp) # 4-byte Folded Reload
; RV32-NEXT: lw s0, 8(sp) # 4-byte Folded Reload
; RV32-NEXT: addi sp, sp, 48
@@ -1319,8 +1319,8 @@ define i32 @va4_va_copy(i32 %argno, ...) nounwind {
; RV64-NEXT: sd a3, 8(sp)
; RV64-NEXT: lw a2, 0(a2)
; RV64-NEXT: add a0, a0, s0
-; RV64-NEXT: add a1, a1, a2
-; RV64-NEXT: addw a0, a0, a1
+; RV64-NEXT: add a0, a0, a1
+; RV64-NEXT: addw a0, a0, a2
; RV64-NEXT: ld ra, 24(sp) # 8-byte Folded Reload
; RV64-NEXT: ld s0, 16(sp) # 8-byte Folded Reload
; RV64-NEXT: addi sp, sp, 96
@@ -1371,8 +1371,8 @@ define i32 @va4_va_copy(i32 %argno, ...) nounwind {
; RV32-WITHFP-NEXT: sw a3, -16(s0)
; RV32-WITHFP-NEXT: lw a2, 0(a2)
; RV32-WITHFP-NEXT: add a0, a0, s1
-; RV32-WITHFP-NEXT: add a1, a1, a2
; RV32-WITHFP-NEXT: add a0, a0, a1
+; RV32-WITHFP-NEXT: add a0, a0, a2
; RV32-WITHFP-NEXT: lw ra, 28(sp) # 4-byte Folded Reload
; RV32-WITHFP-NEXT: lw s0, 24(sp) # 4-byte Folded Reload
; RV32-WITHFP-NEXT: lw s1, 20(sp) # 4-byte Folded Reload
@@ -1427,8 +1427,8 @@ define i32 @va4_va_copy(i32 %argno, ...) nounwind {
; RV64-WITHFP-NEXT: sd a3, -32(s0)
; RV64-WITHFP-NEXT: lw a2, 0(a2)
; RV64-WITHFP-NEXT: add a0, a0, s1
-; RV64-WITHFP-NEXT: add a1, a1, a2
-; RV64-WITHFP-NEXT: addw a0, a0, a1
+; RV64-WITHFP-NEXT: add a0, a0, a1
+; RV64-WITHFP-NEXT: addw a0, a0, a2
; RV64-WITHFP-NEXT: ld ra, 40(sp) # 8-byte Folded Reload
; RV64-WITHFP-NEXT: ld s0, 32(sp) # 8-byte Folded Reload
; RV64-WITHFP-NEXT: ld s1, 24(sp) # 8-byte Folded Reload
diff --git a/llvm/test/CodeGen/RISCV/addcarry.ll b/llvm/test/CodeGen/RISCV/addcarry.ll
index 3a4163a8bb50f9..053b98755417b2 100644
--- a/llvm/test/CodeGen/RISCV/addcarry.ll
+++ b/llvm/test/CodeGen/RISCV/addcarry.ll
@@ -18,9 +18,9 @@ define i64 @addcarry(i64 %x, i64 %y) nounwind {
; RISCV32-NEXT: sltu a7, a4, a6
; RISCV32-NEXT: sltu a5, a6, a5
; RISCV32-NEXT: mulhu a6, a0, a3
-; RISCV32-NEXT: mulhu t0, a1, a2
-; RISCV32-NEXT: add a6, a6, t0
; RISCV32-NEXT: add a5, a6, a5
+; RISCV32-NEXT: mulhu a6, a1, a2
+; RISCV32-NEXT: add a5, a5, a6
; RISCV32-NEXT: add a5, a5, a7
; RISCV32-NEXT: mul a6, a1, a3
; RISCV32-NEXT: add a5, a5, a6
diff --git a/llvm/test/CodeGen/RISCV/atomicrmw-uinc-udec-wrap.ll b/llvm/test/CodeGen/RISCV/atomicrmw-uinc-udec-wrap.ll
index 634ed45044ee21..672625c182d0b5 100644
--- a/llvm/test/CodeGen/RISCV/atomicrmw-uinc-udec-wrap.ll
+++ b/llvm/test/CodeGen/RISCV/atomicrmw-uinc-udec-wrap.ll
@@ -227,8 +227,8 @@ define i16 @atomicrmw_uinc_wrap_i16(ptr %ptr, i16 %val) {
; RV32IA-NEXT: addi a5, a5, 1
; RV32IA-NEXT: sltu a7, a7, a1
; RV32IA-NEXT: neg a7, a7
-; RV32IA-NEXT: and a5, a5, a3
; RV32IA-NEXT: and a5, a7, a5
+; RV32IA-NEXT: and a5, a5, a3
; RV32IA-NEXT: sll a5, a5, a0
; RV32IA-NEXT: and a7, a6, a4
; RV32IA-NEXT: or a7, a7, a5
@@ -307,8 +307,8 @@ define i16 @atomicrmw_uinc_wrap_i16(ptr %ptr, i16 %val) {
; RV64IA-NEXT: addi a6, a6, 1
; RV64IA-NEXT: sltu t0, t0, a1
; RV64IA-NEXT: negw t0, t0
-; RV64IA-NEXT: and a6, a6, a3
; RV64IA-NEXT: and a6, t0, a6
+; RV64IA-NEXT: and a6, a6, a3
; RV64IA-NEXT: sllw a6, a6, a0
; RV64IA-NEXT: and a4, a4, a5
; RV64IA-NEXT: or a6, a4, a6
diff --git a/llvm/test/CodeGen/RISCV/calling-conv-ilp32-ilp32f-common.ll b/llvm/test/CodeGen/RISCV/calling-conv-ilp32-ilp32f-common.ll
index 278187f62cd75e..8bcdb059a95fbc 100644
--- a/llvm/test/CodeGen/RISCV/calling-conv-ilp32-ilp32f-common.ll
+++ b/llvm/test/CodeGen/RISCV/calling-conv-ilp32-ilp32f-common.ll
@@ -94,15 +94,15 @@ define i32 @callee_aligned_stack(i32 %a, i32 %b, fp128 %c, i32 %d, i32 %e, i64 %
; RV32I-FPELIM-LABEL: callee_aligned_stack:
; RV32I-FPELIM: # %bb.0:
; RV32I-FPELIM-NEXT: lw a0, 0(a2)
-; RV32I-FPELIM-NEXT: lw a1, 8(sp)
+; RV32I-FPELIM-NEXT: lw a1, 20(sp)
; RV32I-FPELIM-NEXT: lw a2, 0(sp)
-; RV32I-FPELIM-NEXT: lw a3, 20(sp)
+; RV32I-FPELIM-NEXT: lw a3, 8(sp)
; RV32I-FPELIM-NEXT: lw a4, 16(sp)
; RV32I-FPELIM-NEXT: add a0, a0, a7
-; RV32I-FPELIM-NEXT: add a1, a2, a1
-; RV32I-FPELIM-NEXT: add a0, a0, a1
-; RV32I-FPELIM-NEXT: add a3, a4, a3
+; RV32I-FPELIM-NEXT: add a0, a0, a2
; RV32I-FPELIM-NEXT: add a0, a0, a3
+; RV32I-FPELIM-NEXT: add a0, a0, a4
+; RV32I-FPELIM-NEXT: add a0, a0, a1
; RV32I-FPELIM-NEXT: ret
;
; RV32I-WITHFP-LABEL: callee_aligned_stack:
@@ -112,15 +112,15 @@ define i32 @callee_aligned_stack(i32 %a, i32 %b, fp128 %c, i32 %d, i32 %e, i64 %
; RV32I-WITHFP-NEXT: sw s0, 8(sp) # 4-byte Folded Spill
; RV32I-WITHFP-NEXT: addi s0, sp, 16
; RV32I-WITHFP-NEXT: lw a0, 0(a2)
-; RV32I-WITHFP-NEXT: lw a1, 8(s0)
+; RV32I-WITHFP-NEXT: lw a1, 20(s0)
; RV32I-WITHFP-NEXT: lw a2, 0(s0)
-; RV32I-WITHFP-NEXT: lw a3, 20(s0)
+; RV32I-WITHFP-NEXT: lw a3, 8(s0)
; RV32I-WITHFP-NEXT: lw a4, 16(s0)
; RV32I-WITHFP-NEXT: add a0, a0, a7
-; RV32I-WITHFP-NEXT: add a1, a2, a1
-; RV32I-WITHFP-NEXT: add a0, a0, a1
-; RV32I-WITHFP-NEXT: add a3, a4, a3
+; RV32I-WITHFP-NEXT: add a0, a0, a2
; RV32I-WITHFP-NEXT: add a0, a0, a3
+; RV32I-WITHFP-NEXT: add a0, a0, a4
+; RV32I-WITHFP-NEXT: add a0, a0, a1
; RV32I-WITHFP-NEXT: lw ra, 12(sp) # 4-byte Folded Reload
; RV32I-WITHFP-NEXT: lw s0, 8(sp) # 4-byte Folded Reload
; RV32I-WITHFP-NEXT: addi sp, sp, 16
diff --git a/llvm/test/CodeGen/RISCV/calling-conv-ilp32-ilp32f-ilp32d-common.ll b/llvm/test/CodeGen/RISCV/calling-conv-ilp32-ilp32f-ilp32d-common.ll
index 231ed159ab2061..4906cc8eb73a53 100644
--- a/llvm/test/CodeGen/RISCV/calling-conv-ilp32-ilp32f-ilp32d-common.ll
+++ b/llvm/test/CodeGen/RISCV/calling-conv-ilp32-ilp32f-ilp32d-common.ll
@@ -87,16 +87,16 @@ define i32 @callee_many_scalars(i8 %a, i16 %b, i32 %c, i64 %d, i32 %e, i32 %f, i
; RV32I-FPELIM-NEXT: andi a0, a0, 255
; RV32I-FPELIM-NEXT: slli a1, a1, 16
; RV32I-FPELIM-NEXT: srli a1, a1, 16
-; RV32I-FPELIM-NEXT: add a0, a0, a2
; RV32I-FPELIM-NEXT: add a0, a0, a1
+; RV32I-FPELIM-NEXT: add a0, a0, a2
; RV32I-FPELIM-NEXT: xor a1, a4, t1
; RV32I-FPELIM-NEXT: xor a2, a3, a7
; RV32I-FPELIM-NEXT: or a1, a2, a1
; RV32I-FPELIM-NEXT: seqz a1, a1
+; RV32I-FPELIM-NEXT: add a0, a1, a0
; RV32I-FPELIM-NEXT: add a0, a0, a5
; RV32I-FPELIM-NEXT: add a0, a0, a6
; RV32I-FPELIM-NEXT: add a0, a0, t0
-; RV32I-FPELIM-NEXT: add a0, a1, a0
; RV32I-FPELIM-NEXT: ret
;
; RV32I-WITHFP-LABEL: callee_many_scalars:
@@ -110,16 +110,16 @@ define i32 @callee_many_scalars(i8 %a, i16 %b, i32 %c, i64 %d, i32 %e, i32 %f, i
; RV32I-WITHFP-NEXT: andi a0, a0, 255
; RV32I-WITHFP-NEXT: slli a1, a1, 16
; RV32I-WITHFP-NEXT: srli a1, a1, 16
-; RV32I-WITHFP-NEXT: add a0, a0, a2
; RV32I-WITHFP-NEXT: add a0, a0, a1
+; RV32I-WITHFP-NEXT: add a0, a0, a2
; RV32I-WITHFP-NEXT: xor a1, a4, t1
; RV32I-WITHFP-NEXT: xor a2, a3, a7
; RV32I-WITHFP-NEXT: or a1, a2, a1
; RV32I-WITHFP-NEXT: seqz a1, a1
+; RV32I-WITHFP-NEXT: add a0, a1, a0
; RV32I-WITHFP-NEXT: add a0, a0, a5
; RV32I-WITHFP-NEXT: add a0, a0, a6
; RV32I-WITHFP-NEXT: add a0, a0, t0
-; RV32I-WITHFP-NEXT: add a0, a1, a0
; RV32I-WITHFP-NEXT: lw ra, 12(sp) # 4-byte Folded Reload
; RV32I-WITHFP-NEXT: lw s0, 8(sp) # 4-byte Folded Reload
; RV32I-WITHFP-NEXT: addi sp, sp, 16
@@ -614,15 +614,15 @@ define i32 @callee_aligned_stack(i32 %a, i32 %b, fp128 %c, i32 %d, i32 %e, i64 %
; RV32I-FPELIM-LABEL: callee_aligned_stack:
; RV32I-FPELIM: # %bb.0:
; RV32I-FPELIM-NEXT: lw a0, 0(a2)
-; RV32I-FPELIM-NEXT: lw a1, 8(sp)
+; RV32I-FPELIM-NEXT: lw a1, 20(sp)
; RV32I-FPELIM-NEXT: lw a2, 0(sp)
-; RV32I-FPELIM-NEXT: lw a3, 20(sp)
+; RV32I-FPELIM-NEXT: lw a3, 8(sp)
; RV32I-FPELIM-NEXT: lw a4, 16(sp)
; RV32I-FPELIM-NEXT: add a0, a0, a7
-; RV32I-FPELIM-NEXT: add a1, a2, a1
-; RV32I-FPELIM-NEXT: add a0, a0, a1
-; RV32I-FPELIM-NEXT: add a3, a4, a3
+; RV32I-FPELIM-NEXT: add a0, a0, a2
; RV32I-FPELIM-NEXT: add a0, a0, a3
+; RV32I-FPELIM-NEXT: add a0, a0, a4
+; RV32I-FPELIM-NEXT: add a0, a0, a1
; RV32I-FPELIM-NEXT: ret
;
; RV32I-WITHFP-LABEL: callee_aligned_stack:
@@ -632,15 +632,15 @@ define i32 @callee_aligned_stack(i32 %a, i32 %b, fp128 %c, i32 %d, i32 %e, i64 %
; RV32I-WITHFP-NEXT: sw s0, 8(sp) # 4-byte Folded Spill
; RV32I-WITHFP-NEXT: addi s0, sp, 16
; RV32I-WITHFP-NEXT: lw a0, 0(a2)
-; RV32I-WITHFP-NEXT: lw a1, 8(s0)
+; RV32I-WITHFP-NEXT: lw a1, 20(s0)
; RV32I-WITHFP-NEXT: lw a2, 0(s0)
-; RV32I-WITHFP-NEXT: lw a3, 20(s0)
+; RV32I-WITHFP-NEXT: lw a3, 8(s0)
; RV32I-WITHFP-NEXT: lw a4, 16(s0)
; RV32I-WITHFP-NEXT: add a0, a0, a7
-; RV32I-WITHFP-NEXT: add a1, a2, a1
-; RV32I-WITHFP-NEXT: add a0, a0, a1
-; RV32I-WITHFP-NEXT: add a3, a4, a3
+; RV32I-WITHFP-NEXT: add a0, a0, a2
; RV32I-WITHFP-NEXT: add a0, a0, a3
+; RV32I-WITHFP-NEXT: add a0, a0, a4
+; RV32I-WITHFP-NEXT: add a0, a0, a1
; RV32I-WITHFP-NEXT: lw ra, 12(sp) # 4-byte Folded Reload
; RV32I-WITHFP-NEXT: lw s0, 8(sp) # 4-byte Folded Reload
; RV32I-WITHFP-NEXT: addi sp, sp, 16
diff --git a/llvm/test/CodeGen/RISCV/calling-conv-ilp32e.ll b/llvm/test/CodeGen/RISCV/calling-conv-ilp32e.ll
index d08cf577b1bdd3..69691869997666 100644
--- a/llvm/test/CodeGen/RISCV/calling-conv-ilp32e.ll
+++ b/llvm/test/CodeGen/RISCV/calling-conv-ilp32e.ll
@@ -529,16 +529,16 @@ define i32 @callee_aligned_stack(i32 %a, i32 %b, fp128 %c, i32 %d, i32 %e, i64 %
; ILP32E-FPELIM-LABEL: callee_aligned_stack:
; ILP32E-FPELIM: # %bb.0:
; ILP32E-FPELIM-NEXT: lw a0, 0(a2)
-; ILP32E-FPELIM-NEXT: lw a1, 12(sp)
+; ILP32E-FPELIM-NEXT: lw a1, 24(sp)
; ILP32E-FPELIM-NEXT: lw a2, 4(sp)
; ILP32E-FPELIM-NEXT: lw a3, 8(sp)
-; ILP32E-FPELIM-NEXT: lw a4, 24(sp)
+; ILP32E-FPELIM-NEXT: lw a4, 12(sp)
; ILP32E-FPELIM-NEXT: lw a5, 20(sp)
; ILP32E-FPELIM-NEXT: add a0, a0, a2
-; ILP32E-FPELIM-NEXT: add a1, a3, a1
-; ILP32E-FPELIM-NEXT: add a0, a0, a1
-; ILP32E-FPELIM-NEXT: add a4, a5, a4
+; ILP32E-FPELIM-NEXT: add a0, a0, a3
; ILP32E-FP...
[truncated]
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if it would be useful to use defaultDefLatency for the instructions we are iterating over here in the case that some of the instructions we are iterating over depend on eachother. Imagine a scenario:
defmi = ...
a = ...
b = use a
usemi = ...
In this case, what if we checked defaultDefLatency of a and used it to understand the number of cycles elapsed between a and b? For example, if a has a default latency of 10, then b can't really start in the next NumMicroOps / IssueWidth cycles, since it has to wait 10 additional cycles.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I'm having trouble understanding your example. Here's what I think I understand:
c = ...
a = ...
b = use a
d = use c
Here, if c has a default def latency of N cycles, a has that of M cycles, b can start after M - 1 cycles, and d can start after N - 3 cycles, assuming issue-width = num-micro-ops = 1. What do you mean by "instructions that depends on each other"? How can one instruction depend on another, which in turn depends on the first? Wouldn't this break basic dominance criteria? Also, if I understand correctly, I think we're in SSA form at this point, so we don't have to worry about re-definitions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have a loop that iterates over instructions between DefMI and UseMI:
for (auto It = DefIt; It != UseIt; ++It) {
// In cases where the UseMI is a PHI at the beginning of the MBB, compute
// MicroOps until the end of the MBB.
if (It.isEnd())
break;
NumMicroOps += Sched.getNumMicroOps(&*It);
}
I suggest the following scenario:
defmi = ...
a = ...
b = use a
usemi = ...
In this case, we will be looping over instructions a and b and adding their number of micro ops to calculate the number of cycles elapsed between defmi and usemi.
Let's take the assumption that a has default def latency of N cycles and b has default latency of M cycles.
What do you mean by "instructions that depends on each other"?
In my scenario, b uses the result of a. It cannot start until that result is ready (an extra N cycles). If we want to make it concrete, we could imagine that it looks like this:
a = add 3, 2
b = sub a, 2
We cannot start the subtraction until a is finished calculating. This is what I am calling a dependency. We can assume for sake of simplicity here that a and b are independent from defmi and usemi.
In this scenario, I am suggesting that if the default latency of a is larger than the number of micro-ops, then we must wait at least the default latency of a before starting b. I suggest that we can incorporate this into the estimation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, thanks for the explanation! However, I think we're missing the fact that CPUs are usually pipelined, and the defmi-usemi dependency will be in one pipeline, while the a-b dependency will be in another pipeline. Hence, I think the defmi-usemi dependency should be independent of the a-b dependency. When the code is called with DefMI = a, and UseMI = b, it will return the correct answer for that dependency.
Now, we haven't actually modeled any pipelines, but do you think this is feasible?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is a good point that if there are multiple pipelines and the defmi-usemi instructions goes down one and the a-b goes down another, then what you have is correct. There definitely is also a scenario where these dependency pairs go down the same pipeline and in that case what I am suggesting is probably the better model.
Unfortunately we don't have any pipeline information because we don't have the scheduler model, so we don't actually know what we should do.
One argument is to keep what we have no because its simple and less expensive to compute.
Another argument is to pick the "more common" approach. I'd prefer not to make a blanket statement and say that "it is more likely for independent instructions to go down different pipelines", although I wouldn't be surprised if this was the case.
For these reasons, I am content with the approach you are proposing. Happy to see if anyone else has thoughts on this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I may have missed something but the use of microops is worrying me - many arch don't guarantee that uop and latency are a close match (e.g. alderlake divpd ymm uops=1 latency=15). And then dividing by issuewidth makes it feel more like a throughput estimate than latency?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. Will adderlake divpd not dispatch in one cycle though? We're subtracting DefaultDefLatency by number of cycles elapsed between the DefMI and UseMI. If there's a adderlake divpd between the DefMI and UseMI, should we be subtracting 15 cycles for it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@RKSimon do you still have concerns about use of micro-ops and issue width here?
I think it makes sense to use issue width in determining "the number of cycles between two instructions" in the calculation here. For example:
a = def
b = ...
c = use
If the issue width is 1, then a is issued in once cycle, and b the next. But if the issue width is 2 then a and b are issued in the same cycle. In the former case we should estimate 2 cycles between [a, c) and in the latter estimate 1 cycle.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't this makes assumptions that the ops can be issued on any pipe? I'm still not convinced your approach makes sense tbh.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we don't have the scheduling model, we could either assume that all ops are issued on the same pipeline, or that independent ops are always issued on different pipelines. I think the latter case is more common. The patch is a first-order improvement over DefaultDefLatency: it is by no means an accurate representation of how the machine functions, but wouldn't you agree that the patch is an improvement over DefaultDefLatency?
Of course, if you have a better idea concerning how to improve to latency computation in the fallback case, please do suggest.
43b0fc2 to
e921ec6
Compare
|
Rebase and ping. |
|
@artagnon please can you rebase to fix the merge conflict? |
e921ec6 to
d10bb30
Compare
|
Sorry for the delay; rebased now. |
|
Gentle ping. Do any of the other reviewers have any comments? |
|
Gentle ping. |
024690e to
0e56b6d
Compare
|
Rebase (yet again), with a change: use |
0e56b6d to
08ea248
Compare
|
Gentle ping. |
arsenm
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this only useful for incomplete targets?
It is useful when the CPU doesn't have a scheduler descriptor checked into the tree. This happens in the real-world on my X86 box, for instance. |
what x86 are you missing scheduler support for? |
I should have been clearer: I think there are scheduler descriptor files for most (if not all) CPUs in the X86 world in the tree, but since my distro's Clang can't know about my hardware, an |
TargetSchedModel::computeOperandLatency is supposed to return the exact latency between two MIs, although it is observed that InstrSchedModel and InstrItineraries are often unavailable in many real-world scenarios. When these two pieces of information are not available, the function returns an estimate that is much too conservative: the default def latency. MachineTraceMetrics is one of the callers affected quite badly by these conservative estimates. To improve the estimate, and let callers of MTM generate better code, offset the default def latency by the estiamted cycles elapsed between the def MI and use MI. Since we're trying to improve codegen in the case when no scheduling information is unavailable, it is impossible to determine the number of cycles elapsed between the two MIs, and we use the distance between them as a crude approximate. In practice, this improvement of one crude estimate by offseting it with another crude estimate leads to better codegen on average, and yields huge gains on standard benchmarks.
08ea248 to
f043d1c
Compare
|
Rebase (yet again). Could we kindly converge on whether or not this patch is useful? |
ab678f8 to
7f39018
Compare
| const MachineInstr *UseMI, | ||
| unsigned UseOperIdx) { | ||
| assert(DefMI && "Non-null DefMI expected"); | ||
| if (!Sched.hasInstrSchedModel() && !Sched.hasInstrItineraries()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of placing this in this wrapper, could the default implementation of computeOperandLatency handle this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| ; X86-NEXT: movl {{[0-9]+}}(%esp), %ebx | ||
| ; X86-NEXT: movl {{[0-9]+}}(%esp), %ebp | ||
| ; X86-NEXT: pushl %ebp | ||
| ; X86-NEXT: movl {{[0-9]+}}(%esp), %ebx |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should fix the missing models that cause any of the test changes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After #111865, there should be no changes in the X86 tests except in early-ifcvt-remarks.ll, which already specifies an mcpu that doesn't have scheduling information for all the instructions used. I'm not sure what we can do about the RISCV tests though: the only models available in the tree are SiFive's and Syntacore's; in practice, RISCV is very diverse, and there are lots of scheduling models downstream, so I think we'll have to live with those test changes.
|
I think we weren't able to conclude that this a good idea. Not pursuing. |
TargetSchedModel::computeOperandLatency is supposed to return the exact latency between two MIs, although it is observed that InstrSchedModel and InstrItineraries are often unavailable in many real-world scenarios. When these two pieces of information are not available, the function returns an estimate that is much too conservative: the default def latency. MachineTraceMetrics is one of the callers affected quite badly by these conservative estimates. To improve the estimate, and let callers of MTM generate better code, offset the default def latency by the estimated cycles elapsed between the def MI and use MI. Since we're trying to improve codegen in the case when no scheduling information is unavailable, it is impossible to determine the exact number of cycles elapsed between the two MIs, and we use the distance between them accounting for issue-width as an approximate. In practice, this improvement of one estimate by offseting it with another estimate leads to better codegen on average, and yields non-trivial gains on standard benchmarks.