-
Notifications
You must be signed in to change notification settings - Fork 15.2k
[LICM] Sink unused l-invariant loads in preheader. #157559
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Unused loop invariant loads were not sunk from preheader to exit block which increased live range. This commit sinks unused invariant loads from preheader to exit block when the load's memory location is not modified within the loop body.
|
@llvm/pr-subscribers-backend-powerpc @llvm/pr-subscribers-llvm-transforms Author: Vigneshwar Jayakumar (VigneshwarJ) ChangesUnused loop invariant loads were not sunk from preheader to exit block which increased live range. This commit sinks unused invariant loads from preheader to exit block when the load's memory location is not modified within the loop body. Full diff: https://github.com/llvm/llvm-project/pull/157559.diff 3 Files Affected:
diff --git a/llvm/lib/Transforms/Scalar/IndVarSimplify.cpp b/llvm/lib/Transforms/Scalar/IndVarSimplify.cpp
index c32731185afd0..61138e379607c 100644
--- a/llvm/lib/Transforms/Scalar/IndVarSimplify.cpp
+++ b/llvm/lib/Transforms/Scalar/IndVarSimplify.cpp
@@ -130,6 +130,7 @@ class IndVarSimplify {
const DataLayout &DL;
TargetLibraryInfo *TLI;
const TargetTransformInfo *TTI;
+ AliasAnalysis *AA;
std::unique_ptr<MemorySSAUpdater> MSSAU;
SmallVector<WeakTrackingVH, 16> DeadInsts;
@@ -162,8 +163,9 @@ class IndVarSimplify {
public:
IndVarSimplify(LoopInfo *LI, ScalarEvolution *SE, DominatorTree *DT,
const DataLayout &DL, TargetLibraryInfo *TLI,
- TargetTransformInfo *TTI, MemorySSA *MSSA, bool WidenIndVars)
- : LI(LI), SE(SE), DT(DT), DL(DL), TLI(TLI), TTI(TTI),
+ TargetTransformInfo *TTI, AliasAnalysis *AA, MemorySSA *MSSA,
+ bool WidenIndVars)
+ : LI(LI), SE(SE), DT(DT), DL(DL), TLI(TLI), TTI(TTI), AA(AA),
WidenIndVars(WidenIndVars) {
if (MSSA)
MSSAU = std::make_unique<MemorySSAUpdater>(MSSA);
@@ -1089,6 +1091,16 @@ bool IndVarSimplify::sinkUnusedInvariants(Loop *L) {
if (!Preheader) return false;
bool MadeAnyChanges = false;
+
+ // Collect all store instructions that may modify memory in the loop.
+ SmallPtrSet<Instruction *, 8> Stores;
+ for (BasicBlock *BB : L->blocks()) {
+ for (Instruction &I : *BB) {
+ if (I.mayWriteToMemory())
+ Stores.insert(&I);
+ }
+ }
+
for (Instruction &I : llvm::make_early_inc_range(llvm::reverse(*Preheader))) {
// Skip BB Terminator.
@@ -1100,14 +1112,32 @@ bool IndVarSimplify::sinkUnusedInvariants(Loop *L) {
break;
// Don't move instructions which might have side effects, since the side
- // effects need to complete before instructions inside the loop. Also don't
- // move instructions which might read memory, since the loop may modify
- // memory. Note that it's okay if the instruction might have undefined
- // behavior: LoopSimplify guarantees that the preheader dominates the exit
- // block.
- if (I.mayHaveSideEffects() || I.mayReadFromMemory())
+ // effects need to complete before instructions inside the loop. Note that
+ // it's okay if the instruction might have undefined behavior: LoopSimplify
+ // guarantees that the preheader dominates the exit block.
+ if (I.mayHaveSideEffects())
continue;
+ // Don't sink read instruction which's memory might be modified in the loop.
+ if (I.mayReadFromMemory()) {
+ if (LoadInst *Load = dyn_cast<LoadInst>(&I)) {
+ MemoryLocation Loc = MemoryLocation::get(Load);
+ bool isModified = false;
+
+ // Check if any store instruction in the loop modifies the loaded memory
+ // location.
+ for (Instruction *S : Stores) {
+ if (isModSet(AA->getModRefInfo(S, Loc))) {
+ isModified = true;
+ break;
+ }
+ }
+ if (isModified)
+ continue;
+ } else {
+ continue;
+ }
+ }
// Skip debug or pseudo instructions.
if (I.isDebugOrPseudoInst())
continue;
@@ -2042,8 +2072,8 @@ PreservedAnalyses IndVarSimplifyPass::run(Loop &L, LoopAnalysisManager &AM,
Function *F = L.getHeader()->getParent();
const DataLayout &DL = F->getDataLayout();
- IndVarSimplify IVS(&AR.LI, &AR.SE, &AR.DT, DL, &AR.TLI, &AR.TTI, AR.MSSA,
- WidenIndVars && AllowIVWidening);
+ IndVarSimplify IVS(&AR.LI, &AR.SE, &AR.DT, DL, &AR.TLI, &AR.TTI, &AR.AA,
+ AR.MSSA, WidenIndVars && AllowIVWidening);
if (!IVS.run(&L))
return PreservedAnalyses::all();
diff --git a/llvm/test/Transforms/IndVarSimplify/sink-from-preheader.ll b/llvm/test/Transforms/IndVarSimplify/sink-from-preheader.ll
index 89583f9131518..8d714c775bde5 100644
--- a/llvm/test/Transforms/IndVarSimplify/sink-from-preheader.ll
+++ b/llvm/test/Transforms/IndVarSimplify/sink-from-preheader.ll
@@ -30,3 +30,64 @@ loop:
exit:
ret i32 %add
}
+
+define i32 @test_with_unused_load(i32 %a, ptr %b, i32 %N) {
+; CHECK-LABEL: @test_with_unused_load(
+; CHECK-NEXT: entry:
+; CHECK-NEXT: br label [[LOOP:%.*]]
+; CHECK: loop:
+; CHECK-NEXT: [[IV:%.*]] = phi i32 [ 0, [[ENTRY:%.*]] ], [ [[IV_NEXT:%.*]], [[LOOP]] ]
+; CHECK-NEXT: [[IV_NEXT]] = add nuw nsw i32 [[IV]], 1
+; CHECK-NEXT: [[CMP:%.*]] = icmp slt i32 [[IV_NEXT]], [[N:%.*]]
+; CHECK-NEXT: br i1 [[CMP]], label [[LOOP]], label [[EXIT:%.*]]
+; CHECK: exit:
+; CHECK-NEXT: [[LOAD:%.*]] = load i32, ptr [[B:%.*]], align 4
+; CHECK-NEXT: [[ADD:%.*]] = add i32 [[A:%.*]], [[LOAD]]
+; CHECK-NEXT: ret i32 [[ADD]]
+;
+entry:
+ %load = load i32, ptr %b
+ %add = add i32 %a, %load
+ br label %loop
+
+loop:
+ %iv = phi i32 [ 0, %entry ], [ %iv.next, %loop ]
+ %iv.next = add i32 %iv, 1
+ %cmp = icmp slt i32 %iv.next, %N
+ br i1 %cmp, label %loop, label %exit
+
+exit:
+ ret i32 %add
+}
+
+; load's memory location is modified inside the loop, don't sink load.
+define i32 @test_with_unused_load_modified_store(i32 %a, ptr %b, i32 %N) {
+; CHECK-LABEL: @test_with_unused_load_modified_store(
+; CHECK-NEXT: entry:
+; CHECK-NEXT: [[LOAD:%.*]] = load i32, ptr [[B:%.*]], align 4
+; CHECK-NEXT: br label [[LOOP:%.*]]
+; CHECK: loop:
+; CHECK-NEXT: [[IV:%.*]] = phi i32 [ 0, [[ENTRY:%.*]] ], [ [[IV_NEXT:%.*]], [[LOOP]] ]
+; CHECK-NEXT: [[IV_NEXT]] = add nuw nsw i32 [[IV]], 1
+; CHECK-NEXT: store i32 [[IV_NEXT]], ptr [[B]], align 4
+; CHECK-NEXT: [[CMP:%.*]] = icmp slt i32 [[IV_NEXT]], [[N:%.*]]
+; CHECK-NEXT: br i1 [[CMP]], label [[LOOP]], label [[EXIT:%.*]]
+; CHECK: exit:
+; CHECK-NEXT: [[ADD:%.*]] = add i32 [[A:%.*]], [[LOAD]]
+; CHECK-NEXT: ret i32 [[ADD]]
+;
+entry:
+ %load = load i32, ptr %b
+ %add = add i32 %a, %load
+ br label %loop
+
+loop:
+ %iv = phi i32 [ 0, %entry ], [ %iv.next, %loop ]
+ %iv.next = add i32 %iv, 1
+ store i32 %iv.next, ptr %b
+ %cmp = icmp slt i32 %iv.next, %N
+ br i1 %cmp, label %loop, label %exit
+
+exit:
+ ret i32 %add
+}
diff --git a/llvm/test/Transforms/LoopSimplify/ashr-crash.ll b/llvm/test/Transforms/LoopSimplify/ashr-crash.ll
index 85198eb19aad6..ae4f9adf8e16c 100644
--- a/llvm/test/Transforms/LoopSimplify/ashr-crash.ll
+++ b/llvm/test/Transforms/LoopSimplify/ashr-crash.ll
@@ -29,11 +29,11 @@ define void @foo() {
; CHECK-LABEL: define void @foo() {
; CHECK-NEXT: entry:
; CHECK-NEXT: store i32 0, ptr @d, align 4
-; CHECK-NEXT: [[TMP0:%.*]] = load i32, ptr @c, align 4
; CHECK-NEXT: br label [[FOR_COND1_PREHEADER:%.*]]
; CHECK: for.cond1.preheader:
; CHECK-NEXT: br label [[FOR_BODY3:%.*]]
; CHECK: for.body3:
+; CHECK-NEXT: [[TMP0:%.*]] = load i32, ptr @c, align 4
; CHECK-NEXT: store i32 1, ptr @a, align 4
; CHECK-NEXT: store i32 1, ptr @d, align 4
; CHECK-NEXT: [[CMP4_LE_LE_INV:%.*]] = icmp sgt i32 [[TMP0]], 0
|
nikic
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think IndVarSimplify is a good place to perform this transform. We could probably do this in LICM, which has access to MemorySSA. (LICM already sinks instructions from the loop to the exits, so doing it fore the preheader would be a somewhat natural extension of the scope of the pass.)
It looks like the current patch miscompiles clang (stage2 build crashes on llvm-test-suite).
| } | ||
| if (isModified) | ||
| continue; | ||
| } else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add comment for clarity.
| exit: | ||
| ret i32 %add | ||
| } | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add at least one (two is better) test with noalias & alias_info metadata for loads and stores; and show whether load will sink or not depending on metadata.
I would also write these tests on a separate .ll file.
I agree that the whole logic of sinkUnusedInvariants can be removed from here and moved. I will try doing that in LICM, but LICM also hoists/sinks instructions from loop to preheader/exit block. So I have to add it at the end. I want to know if LoopSink.cpp would be a better place to move the logic to (it also has access to MemorySSA). It already sinks from preheader back to loop blocks. We could do the sinking to exit block there as well. |
LoopSink might be a better place, as the original intent of this code is to reduce the register pressure, and I believe moving it to LoopSink would solve that purpose. Having said that, it might have some (a few) performance impacts on the applications out there. |
|
LoopSink is a PGO pass. Assuming you don't want to drive this with profiling information, it's probably not a good fit for that reason. |
llvm/lib/Transforms/Scalar/LICM.cpp
Outdated
| if (auto *PN = dyn_cast<PHINode>(UserI)) { | ||
| unsigned OpIdx = | ||
| PHINode::getIncomingValueNumForOperand(U.getOperandNo()); | ||
| UseBB = PN->getIncomingBlock(OpIdx); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can directly pass U, no need to call getIncomingValueNumForOperand().
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If @VigneshwarJ is to do that, then he should have two patches;
first, for cleaning up what is already here like what you are mentioning.
second, move this sub pass to an appropriate location.
llvm/lib/Transforms/Scalar/LICM.cpp
Outdated
| // sink pre-header defs that are unused in-loop into the unique exit to reduce | ||
| // pressure. | ||
| Changed |= | ||
| sinkUnusedInvariantsFromPreheaderToExit(L, AA, &SafetyInfo, MSSAU, SE); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move this above forgetLoopDispositions() and the verification calls?
What live range? If the result is unused, how can it contribute to register pressure in the loop body? |
I believe "unused" here refers to "unused inside the loop, but used after the loop". |
|
Crash reproducer: ; RUN: opt -S -passes="loop-mssa(loop-simplifycfg,licm<no-allowspeculation>,loop-rotate,simple-loop-unswitch)" < %s
target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-i128:128-f80:128-n8:16:32:64-S128"
target triple = "x86_64-pc-linux-gnu"
define void @_ZN4ncnnL8rnn_int8ERKNS_3MatERS0_iS2_PKfS2_S2_S5_S3_RKNS_6OptionE.omp_outlined(ptr %0) {
%2 = alloca i32, align 4
call void @llvm.lifetime.start.p0(ptr %2)
br label %3
3: ; preds = %5, %1
%4 = load i32, ptr null, align 4
br label %.preheader
.preheader: ; preds = %7, %3
%.049 = phi i32 [ %8, %7 ], [ 0, %3 ]
br i1 false, label %7, label %5
5: ; preds = %.preheader
%6 = load ptr, ptr %0, align 8
store float 0.000000e+00, ptr %6, align 4
br label %3
7: ; preds = %.preheader
%8 = add i32 0, 0
br label %.preheader
}
; Function Attrs: nocallback nofree nosync nounwind willreturn memory(argmem: readwrite)
declare void @llvm.lifetime.start.p0(ptr captures(none)) #0
attributes #0 = { nocallback nofree nosync nounwind willreturn memory(argmem: readwrite) } |
(edit) I see the crash now once I updated with main branch |
|
@nikic , That was due to a bad MSSA update, fixed that issue. |
|
Ping for review |
|
@nikic ping for review |
srpande
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just some nitpick. Otherwise LGTM
|
@zyw-bot csmith-fuzz |
nikic
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Worth noting that this kind of sinking can also increase register pressure in the loop. E.g. if you sink an instruction with two operands, and the operand instructions cannot be sunk. Then you'll need two registers instead of one in the loop.
Though from what I can tell, this does seem to be beneficial on average.
| ; NOTE: Assertions have been autogenerated by utils/update_test_checks.py | ||
| ; RUN: opt < %s -passes=licm -verify-memoryssa -S | FileCheck %s | ||
| target datalayout = "e-p:32:32:32-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:32:64-f32:32:32-f64:32:64-v64:64:64-v128:128:128-a0:0:64-f80:128:128" | ||
| target triple = "i386-apple-darwin10.0" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Omit the triple if it's not needed. Otherwise the test may need REQUIRES.
| target datalayout = "e-p:32:32:32-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:32:64-f32:32:32-f64:32:64-v64:64:64-v128:128:128-a0:0:64-f80:128:128" | ||
| target triple = "i386-apple-darwin10.0" | ||
|
|
||
| ; We make sinking here, Changed flag should be set properly. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| ; We make sinking here, Changed flag should be set properly. | |
| ; We perform sinking here, Changed flag should be set properly. |
Unused loop invariant loads were not sunk from the preheader to the exit block, increasing live range. This commit moves the sinkUnusedInvariant logic from indvarsimplify to LICM also adds functionality to sink unused load that's not clobbered by the loop body.
Unused loop invariant loads were not sunk from the preheader to the exit block, increasing live range. This commit moves the sinkUnusedInvariant logic from indvarsimplify to LICM also adds functionality to sink unused load that's not clobbered by the loop body.
Unused loop invariant loads were not sunk from the preheader to the exit block, increasing live range. This commit moves the sinkUnusedInvariant logic from indvarsimplify to LICM also adds functionality to sink unused load that's not clobbered by the loop body.
Unused loop invariant loads were not sunk from the preheader to the exit block, increasing live range. This commit moves the sinkUnusedInvariant logic from indvarsimplify to LICM also adds functionality to sink unused load that's not clobbered by the loop body.
|
Hi @VigneshwarJ, I'm working on the downstream project and found the GPU performance for some GEMMs were regressed after this PR. I've added some useful information here #166671. Could you help take a look? |
|
@yzhang93 sure, I will look at the issue. |
Unused loop invariant loads were not sunk from preheader to exit block, increasing live range.
This commit moves the sinkUnusedInvariant logic from indvarsimplify to LICM and also adds functionality to sinks unused load that's not clobbered by loop body.