[CVP] Implement type narrowing for LShr #119577

adam-bzowski · 2024-12-11T16:13:57Z

Implements type narrowing for LShr. The treatment is analogous to the type narrowing of UDiv. Since LShr is a relatively cheap instruction, the narrowing occurs only if the following conditions hold: i) all the users of the LShr instruction are already TruncInst; ii) the narrowing is carried out to the largest TruncInst following the LShr instruction. Additionally, the function optimizes the cases where the result of the LShr instruction is guaranteed to vanish or be equal to poison.

github-actions · 2024-12-11T16:14:17Z

Thank you for submitting a Pull Request (PR) to the LLVM Project!

This PR will be automatically labeled and the relevant teams will be notified.

If you wish to, you can add reviewers by using the "Reviewers" section on this page.

If this is not working for you, it is probably because you do not have write permissions for the repository. In which case you can instead tag reviewers by name in a comment by using @ followed by their GitHub username.

If you have received no comments on your PR for a week, you can request a review by "ping"ing the PR by adding a comment “Ping”. The common courtesy "ping" rate is once a week. Please remember that you are asking for valuable time from other developers.

If you have further questions, they may be answered by the LLVM GitHub User Guide.

You can also ask questions in a comment on this PR, on the LLVM Discord or on the forums.

llvmbot · 2024-12-11T16:14:51Z

@llvm/pr-subscribers-llvm-transforms

Author: None (adam-bzowski)

Changes

Implements type narrowing for LShr. The treatment is analogous to the type narrowing of UDiv. Since LShr is a relatively cheap instruction, the narrowing occurs only if the following conditions hold: i) all the users of the LShr instruction are already TruncInst; ii) the narrowing is carried out to the largest TruncInst following the LShr instruction. Additionally, the function optimizes the cases where the result of the LShr instruction is guaranteed to vanish or be equal to poison.

Full diff: https://github.com/llvm/llvm-project/pull/119577.diff

3 Files Affected:

(modified) llvm/lib/Transforms/Scalar/CorrelatedValuePropagation.cpp (+133)
(added) llvm/test/Transforms/CorrelatedValuePropagation/lshr-plus-instcombine.ll (+114)
(added) llvm/test/Transforms/CorrelatedValuePropagation/lshr.ll (+241)

diff --git a/llvm/lib/Transforms/Scalar/CorrelatedValuePropagation.cpp b/llvm/lib/Transforms/Scalar/CorrelatedValuePropagation.cpp
index 8e74b8645fad9a..9b81adf77bdf52 100644
--- a/llvm/lib/Transforms/Scalar/CorrelatedValuePropagation.cpp
+++ b/llvm/lib/Transforms/Scalar/CorrelatedValuePropagation.cpp
@@ -40,6 +40,7 @@
 #include "llvm/Support/Casting.h"
 #include "llvm/Transforms/Utils/Local.h"
 #include <cassert>
+#include <limits>
 #include <optional>
 #include <utility>
 
@@ -60,6 +61,8 @@ STATISTIC(NumUDivURemsNarrowed,
           "Number of udivs/urems whose width was decreased");
 STATISTIC(NumAShrsConverted, "Number of ashr converted to lshr");
 STATISTIC(NumAShrsRemoved, "Number of ashr removed");
+STATISTIC(NumLShrsRemoved, "Number of lshr removed");
+STATISTIC(NumLShrsNarrowed, "Number of lshrs whose width was decreased");
 STATISTIC(NumSRems,     "Number of srem converted to urem");
 STATISTIC(NumSExt,      "Number of sext converted to zext");
 STATISTIC(NumSIToFP,    "Number of sitofp converted to uitofp");
@@ -93,6 +96,10 @@ STATISTIC(NumUDivURemsNarrowedExpanded,
           "Number of bound udiv's/urem's expanded");
 STATISTIC(NumNNeg, "Number of zext/uitofp non-negative deductions");
 
+static cl::opt<bool>
+    NarrowLShr("correlated-propagation-narrow-lshr", cl::init(true), cl::Hidden,
+               cl::desc("Enable narrowing of LShr instructions."));
+
 static Constant *getConstantAt(Value *V, Instruction *At, LazyValueInfo *LVI) {
   if (Constant *C = LVI->getConstant(V, At))
     return C;
@@ -1067,6 +1074,124 @@ static bool processSDivOrSRem(BinaryOperator *Instr, LazyValueInfo *LVI) {
   return narrowSDivOrSRem(Instr, LCR, RCR);
 }
 
+/**
+ * @brief Narrows type of the LShr instruction if the range of the possible
+ * values fits into a smaller type. Since LShr is a relatively cheap
+ * instruction, the narrowing should not happen too frequently. Performance
+ * testing and compatibility with other passes indicate that the narrowing is
+ * beneficial under the following circumstances:
+ *
+ * i) the narrowing occurs only if all the users of the LShr instruction are
+ * already TruncInst;
+ *
+ * ii) the narrowing is carried out to the largest TruncInst following the LShr
+ * instruction.
+ *
+ * Additionally, the function optimizes the cases where the result of the LShr
+ * instruction is guaranteed to vanish or be equal to poison.
+ */
+static bool narrowLShr(BinaryOperator *LShr, LazyValueInfo *LVI) {
+
+  IntegerType *RetTy = dyn_cast<IntegerType>(LShr->getType());
+  if (!RetTy)
+    return false;
+
+  ConstantRange ArgRange = LVI->getConstantRangeAtUse(LShr->getOperandUse(0),
+                                                      /*UndefAllowed*/ false);
+  ConstantRange ShiftRange = LVI->getConstantRangeAtUse(LShr->getOperandUse(1),
+                                                        /*UndefAllowed*/ false);
+
+  unsigned OrigWidth = RetTy->getScalarSizeInBits();
+  unsigned MaxActiveBitsInArg = ArgRange.getActiveBits();
+  uint64_t MinShiftValue64 = ShiftRange.getUnsignedMin().getZExtValue();
+  unsigned MinShiftValue =
+      MinShiftValue64 < std::numeric_limits<unsigned>::max()
+          ? static_cast<unsigned>(MinShiftValue64)
+          : std::numeric_limits<unsigned>::max();
+
+  // First we deal with the cases where the result is guaranteed to vanish or be
+  // equal to posion.
+
+  auto replaceWith = [&](Value *V) -> void {
+    LShr->replaceAllUsesWith(V);
+    LShr->eraseFromParent();
+    ++NumLShrsRemoved;
+  };
+
+  // If the shift is larger or equal to the bit width of the argument,
+  // the instruction returns a poison value.
+  if (MinShiftValue >= OrigWidth) {
+    replaceWith(PoisonValue::get(RetTy));
+    return true;
+  }
+
+  // If we are guaranteed to shift away all bits,
+  // we replace the shift by the null value.
+  // We should not apply the optimization if LShr is exact,
+  // as the result may be poison.
+  if (!LShr->isExact() && MinShiftValue >= MaxActiveBitsInArg) {
+    replaceWith(Constant::getNullValue(RetTy));
+    return true;
+  }
+
+  // That's how many bits we need.
+  unsigned MaxActiveBits =
+      std::max(MaxActiveBitsInArg, ShiftRange.getActiveBits());
+
+  // We could do better, but is it worth it?
+  // With the first argument being the n-bit integer, we may limit the value of
+  // the second argument to be less than n, as larger shifts would lead to a
+  // vanishing result or poison. Thus the number of bits in the second argument
+  // is limited by Log2(n). Unfortunately, this would require an introduction of
+  // a select instruction (or llvm.min) to make sure every argument larger than
+  // n is mapped to n and not just truncated. We do not implement it here.
+
+  // What is the smallest bit width that can accommodate the entire value ranges
+  // of both of the operands? Don't shrink below 8 bits wide.
+  unsigned NewWidth = std::max<unsigned>(PowerOf2Ceil(MaxActiveBits), 8);
+
+  // NewWidth might be greater than OrigWidth if OrigWidth is not a power of
+  // two.
+  if (NewWidth >= OrigWidth)
+    return false;
+
+  // This is the time to check if all the users are TruncInst
+  // and to figure out what the largest user is.
+  for (User *user : LShr->users()) {
+    if (TruncInst *TI = dyn_cast<TruncInst>(user)) {
+      NewWidth = std::max(NewWidth, TI->getDestTy()->getScalarSizeInBits());
+    } else {
+      return false;
+    }
+  }
+
+  // We are ready to truncate.
+  IRBuilder<> B(LShr);
+  Type *TruncTy = RetTy->getWithNewBitWidth(NewWidth);
+  Value *ArgTrunc = B.CreateTruncOrBitCast(LShr->getOperand(0), TruncTy,
+                                           LShr->getName() + ".arg.trunc");
+  Value *ShiftTrunc = B.CreateTruncOrBitCast(LShr->getOperand(1), TruncTy,
+                                             LShr->getName() + ".shift.trunc");
+  Value *LShrTrunc =
+      B.CreateBinOp(LShr->getOpcode(), ArgTrunc, ShiftTrunc, LShr->getName());
+  Value *Zext = B.CreateZExt(LShrTrunc, RetTy, LShr->getName() + ".zext");
+
+  // Should always cast, but better safe than sorry.
+  if (BinaryOperator *LShrTruncBO = dyn_cast<BinaryOperator>(LShrTrunc)) {
+    LShrTruncBO->setDebugLoc(LShr->getDebugLoc());
+    LShrTruncBO->setIsExact(LShr->isExact());
+  }
+  LShr->replaceAllUsesWith(Zext);
+  LShr->eraseFromParent();
+
+  ++NumLShrsNarrowed;
+  return true;
+}
+
+static bool processLShr(BinaryOperator *SDI, LazyValueInfo *LVI) {
+  return NarrowLShr ? narrowLShr(SDI, LVI) : false;
+}
+
 static bool processAShr(BinaryOperator *SDI, LazyValueInfo *LVI) {
   ConstantRange LRange =
       LVI->getConstantRangeAtUse(SDI->getOperandUse(0), /*UndefAllowed*/ false);
@@ -1093,6 +1218,11 @@ static bool processAShr(BinaryOperator *SDI, LazyValueInfo *LVI) {
   SDI->replaceAllUsesWith(BO);
   SDI->eraseFromParent();
 
+  // Check if the new LShr can be narrowed.
+  if (NarrowLShr) {
+    narrowLShr(BO, LVI);
+  }
+
   return true;
 }
 
@@ -1254,6 +1384,9 @@ static bool runImpl(Function &F, LazyValueInfo *LVI, DominatorTree *DT,
       case Instruction::AShr:
         BBChanged |= processAShr(cast<BinaryOperator>(&II), LVI);
         break;
+      case Instruction::LShr:
+        BBChanged |= processLShr(cast<BinaryOperator>(&II), LVI);
+        break;
       case Instruction::SExt:
         BBChanged |= processSExt(cast<SExtInst>(&II), LVI);
         break;
diff --git a/llvm/test/Transforms/CorrelatedValuePropagation/lshr-plus-instcombine.ll b/llvm/test/Transforms/CorrelatedValuePropagation/lshr-plus-instcombine.ll
new file mode 100644
index 00000000000000..3646ff9aee4783
--- /dev/null
+++ b/llvm/test/Transforms/CorrelatedValuePropagation/lshr-plus-instcombine.ll
@@ -0,0 +1,114 @@
+; RUN: opt < %s -passes="correlated-propagation,instcombine" -S | FileCheck %s
+
+; The tests below are the same as in lshr.ll
+; Here we test whether the CorrelatedValuePropagation pass 
+; composed with InstCombinePass produces the expected optimizations.
+
+; CHECK-LABEL: @trunc_test1
+; CHECK-NEXT: [[A1:%.*]] = lshr i32 [[A:%.*]], 16
+; CHECK-NEXT: [[CARG:%.*]] = trunc nuw i32 [[A1]] to i16
+; CHECK-NEXT: [[CSHIFT:%.*]] = trunc i32 [[B:%.*]] to i16
+; CHECK-NEXT: [[C1:%.*]] = lshr i16 [[CARG]], [[CSHIFT]]
+; CHECK-NEXT: ret i16 [[C1]]
+
+define i16 @trunc_test1(i32 %a, i32 %b) {
+  %a.eff.trunc = lshr i32 %a, 16
+  %b.eff.trunc = and i32 %b, 65535
+  %c = lshr i32 %a.eff.trunc, %b.eff.trunc
+  %c.trunc = trunc i32 %c to i16
+  ret i16 %c.trunc
+}
+
+; CHECK-LABEL: @trunc_test2
+; CHECK-NEXT: [[C1:%.*]] = lshr i16 [[A:%.*]], 2
+; CHECK-NEXT: ret i16 [[C1]]
+
+define i16 @trunc_test2(i16 %a) {
+  %a.ext = zext i16 %a to i32
+  %c = lshr i32 %a.ext, 2
+  %c.trunc = trunc i32 %c to i16
+  ret i16 %c.trunc
+}
+
+; CHECK-LABEL: @trunc_test3
+; CHECK-NEXT: [[B:%.*]] = lshr i16 [[A:%.*]], 2
+; CHECK-NEXT: [[C:%.*]] = add nuw nsw i16 [[B]], 123
+; CHECK-NEXT: ret i16 [[C]]
+
+define i16 @trunc_test3(i16 %a) {
+  %a.ext = zext i16 %a to i32
+  %b = lshr i32 %a.ext, 2
+  %c = add i32 %b, 123
+  %c.trunc = trunc i32 %c to i16
+  ret i16 %c.trunc
+}
+
+; CHECK-LABEL: @trunc_test4
+; CHECK-NEXT: [[A1:%.*]] = udiv i32 [[A:%.*]], 17000000
+; CHECK-NEXT: [[B:%.*]] = trunc nuw nsw i32 [[A1]] to i16
+; CHECK-NEXT: [[B1:%.*]] = lshr i16 [[B]], 2
+; CHECK-NEXT: ret i16 [[B1]]
+
+define i16 @trunc_test4(i32 %a) {
+  %a.eff.trunc = udiv i32 %a, 17000000  ; larger than 2^24
+  %b = lshr i32 %a.eff.trunc, 2 
+  %b.trunc.1 = trunc i32 %b to i16
+  %b.trunc.2 = trunc i32 %b to i8
+  ret i16 %b.trunc.1
+}
+
+; CHECK-LABEL: @trunc_test5
+; CHECK-NEXT: [[A1:%.*]] = udiv i32 [[A:%.*]], 17000000
+; CHECK-NEXT: [[B:%.*]] = lshr i32 [[A1]], 2
+; CHECK-NEXT: [[C:%.*]] = add nuw nsw i32 [[B]], 123
+; CHECK-NEXT: ret i32 [[C]]
+
+define i32 @trunc_test5(i32 %a) {
+  %a.eff.trunc = udiv i32 %a, 17000000  ; larger than 2^24
+  %b = lshr i32 %a.eff.trunc, 2 
+  %b.trunc.1 = trunc i32 %b to i16
+  %b.trunc.2 = trunc i32 %b to i8
+  %c = add i32 %b, 123
+  ret i32 %c
+}
+
+; CHECK-LABEL: @zero_test1
+; CHECK-NEXT: ret i32 poison
+  
+define i32 @zero_test1(i32 %a) {
+  %b = lshr i32 %a, 32
+  %c = add i32 %b, 123
+  ret i32 %c
+}
+
+; CHECK-LABEL: @zero_test2
+; CHECK-NEXT: ret i32 poison
+
+define i32 @zero_test2(i32 %a, i32 %b) {
+  %b.large = add nuw nsw i32 %b, 50
+  %c = lshr i32 %a, %b.large
+  %d = add i32 %c, 123
+  ret i32 %d
+}
+
+; CHECK-LABEL: @zero_test3
+; CHECK-NEXT: ret i32 123
+
+define i32 @zero_test3(i32 %a, i32 %b) {
+  %a.small = lshr i32 %a, 16
+  %b.large = add nuw nsw i32 %b, 20
+  %c = lshr i32 %a.small, %b.large
+  %d = add i32 %c, 123
+  ret i32 %d
+}
+
+; CHECK-LABEL: @zero_test4
+; CHECK-NEXT: ret i32 123
+
+define i32 @zero_test4(i32 %a, i32 %b) {
+  %a.small = lshr i32 %a, 16
+  %b.large = add nuw nsw i32 %b, 20
+  %c = lshr exact i32 %a.small, %b.large
+  %d = add i32 %c, 123
+  ret i32 %d
+}
diff --git a/llvm/test/Transforms/CorrelatedValuePropagation/lshr.ll b/llvm/test/Transforms/CorrelatedValuePropagation/lshr.ll
new file mode 100644
index 00000000000000..3f129a9bd2e411
--- /dev/null
+++ b/llvm/test/Transforms/CorrelatedValuePropagation/lshr.ll
@@ -0,0 +1,241 @@
+; RUN: opt < %s -passes=correlated-propagation -S | FileCheck %s
+
+; Tests: test_nop and tests 1 through 6 are taken from udiv.ll
+; with udiv replaced by lshr (plus some tweaks).
+; In those tests the lshr instruction has no users.
+
+; CHECK-LABEL: @test_nop
+define void @test_nop(i32 %n) {
+; CHECK: lshr i32 %n, 2
+  %shr = lshr i32 %n, 2
+  ret void
+}
+
+; CHECK-LABEL: @test1(
+define void @test1(i32 %n) {
+entry:
+  %cmp = icmp ule i32 %n, 65535
+  br i1 %cmp, label %bb, label %exit
+
+bb:
+; CHECK: lshr i16
+  %shr = lshr i32 %n, 2
+  br label %exit
+
+exit:
+  ret void
+}
+
+; CHECK-LABEL: @test2(
+define void @test2(i32 %n) {
+entry:
+  %cmp = icmp ule i32 %n, 65536
+  br i1 %cmp, label %bb, label %exit
+
+bb:
+; CHECK: lshr i32
+  %shr = lshr i32 %n, 2
+  br label %exit
+
+exit:
+  ret void
+}
+
+; CHECK-LABEL: @test3(
+define void @test3(i32 %m, i32 %n) {
+entry:
+  %cmp1 = icmp ult i32 %m, 65535
+  %cmp2 = icmp ult i32 %n, 65535
+  %cmp = and i1 %cmp1, %cmp2
+  br i1 %cmp, label %bb, label %exit
+
+bb:
+; CHECK: lshr i16
+  %shr = lshr i32 %m, %n
+  br label %exit
+
+exit:
+  ret void
+}
+
+; CHECK-LABEL: @test4(
+define void @test4(i32 %m, i32 %n) {
+entry:
+  %cmp1 = icmp ult i32 %m, 65535
+  %cmp2 = icmp ule i32 %n, 65536
+  %cmp = and i1 %cmp1, %cmp2
+  br i1 %cmp, label %bb, label %exit
+
+bb:
+; CHECK: lshr i32
+  %shr = lshr i32 %m, %n
+  br label %exit
+
+exit:
+  ret void
+}
+
+; CHECK-LABEL: @test5
+define void @test5(i32 %n) {
+  %trunc = and i32 %n, 65535
+  ; CHECK: lshr i16
+  %shr = lshr i32 %trunc, 2
+  ret void
+}
+
+; CHECK-LABEL: @test6
+define void @test6(i32 %n) {
+entry:
+  %cmp = icmp ule i32 %n, 255
+  br i1 %cmp, label %bb, label %exit
+
+bb:
+; CHECK: lshr i8
+  %shr = lshr i32 %n, 2
+  br label %exit
+
+exit:
+  ret void
+}
+
+; The tests below check whether the narrowing occures only if the appropriate
+; `trunc` instructions follow.
+;
+; Just as in udiv.ll, additional zext and trunc instructions appear. 
+; They are eventually recombined by InstCombinePass 
+; that follows in the pipeline.
+
+; CHECK-LABEL: @trunc_test1
+; CHECK-NEXT: [[A1:%.*]] = lshr i32 [[A:%.*]], 16
+; CHECK-NEXT: [[B1:%.*]] = and i32 [[B:%.*]], 65535
+; CHECK-NEXT: [[A2:%.*]] = trunc i32 [[A1]] to i16
+; CHECK-NEXT: [[B2:%.*]] = trunc i32 [[B1]] to i16
+; CHECK-NEXT: [[C1:%.*]] = lshr i16 [[A2]], [[B2]]
+; CHECK-NEXT: [[C2:%.*]] = zext i16 [[C1]] to i32
+; CHECK-NEXT: [[C3:%.*]] = trunc i32 [[C2]] to i16
+; CHECK-NEXT: ret i16 [[C3]]
+
+define i16 @trunc_test1(i32 %a, i32 %b) {
+  %a.eff.trunc = lshr i32 %a, 16
+  %b.eff.trunc = and i32 %b, 65535
+  %c = lshr i32 %a.eff.trunc, %b.eff.trunc
+  %c.trunc = trunc i32 %c to i16
+  ret i16 %c.trunc
+}
+
+; CHECK-LABEL: @trunc_test2
+; CHECK-NEXT: [[A1:%.*]] = zext i16 [[A:%.*]] to i32
+; CHECK-NEXT: [[A2:%.*]] = trunc i32 [[A1]] to i16
+; CHECK-NEXT: [[C1:%.*]] = lshr i16 [[A2]], 2
+; CHECK-NEXT: [[C2:%.*]] = zext i16 [[C1]] to i32
+; CHECK-NEXT: [[C3:%.*]] = trunc i32 [[C2]] to i16
+; CHECK-NEXT: ret i16 [[C3]]
+
+define i16 @trunc_test2(i16 %a) {
+  %a.ext = zext i16 %a to i32
+  %c = lshr i32 %a.ext, 2
+  %c.trunc = trunc i32 %c to i16
+  ret i16 %c.trunc
+}
+
+; CHECK-LABEL: @trunc_test3
+; CHECK-NEXT: [[A1:%.*]] = zext i16 [[A:%.*]] to i32
+; CHECK-NEXT: [[B:%.*]] = lshr i32 [[A1]], 2
+; CHECK-NEXT: [[C0:%.*]] = add nuw nsw i32 [[B]], 123
+; CHECK-NEXT: [[C1:%.*]] = trunc i32 [[C0]] to i16
+; CHECK-NEXT: ret i16 [[C1]]
+
+define i16 @trunc_test3(i16 %a) {
+  %a.ext = zext i16 %a to i32
+  %b = lshr i32 %a.ext, 2
+  %c = add i32 %b, 123
+  %c.trunc = trunc i32 %c to i16
+  ret i16 %c.trunc
+}
+
+; CHECK-LABEL: @trunc_test4
+; CHECK-NEXT: [[A1:%.*]] = udiv i32 [[A:%.*]], 17000000
+; CHECK-NEXT: [[B0:%.*]] = trunc i32 [[A1]] to i16
+; CHECK-NEXT: [[B1:%.*]] = lshr i16 [[B0]], 2
+; CHECK-NEXT: [[B2:%.*]] = zext i16 [[B1]] to i32
+; CHECK-NEXT: [[C1:%.*]] = trunc i32 [[B2]] to i16
+; CHECK-NEXT: [[C2:%.*]] = trunc i32 [[B2]] to i8
+; CHECK-NEXT: ret i16 [[C1]]
+
+define i16 @trunc_test4(i32 %a) {
+  %a.eff.trunc = udiv i32 %a, 17000000  ; larger than 2^24
+  %b = lshr i32 %a.eff.trunc, 2 
+  %b.trunc.1 = trunc i32 %b to i16
+  %b.trunc.2 = trunc i32 %b to i8
+  ret i16 %b.trunc.1
+}
+
+; CHECK-LABEL: @trunc_test5
+; CHECK-NEXT: [[A1:%.*]] = udiv i32 [[A:%.*]], 17000000
+; CHECK-NEXT: [[B:%.*]] = lshr i32 [[A1]], 2
+; CHECK-NEXT: [[B1:%.*]] = trunc i32 [[B]] to i16
+; CHECK-NEXT: [[B2:%.*]] = trunc i32 [[B]] to i8
+; CHECK-NEXT: [[C:%.*]] = add nuw nsw i32 [[B]], 123
+; CHECK-NEXT: ret i32 [[C]]
+
+define i32 @trunc_test5(i32 %a) {
+  %a.eff.trunc = udiv i32 %a, 17000000  ; larger than 2^24
+  %b = lshr i32 %a.eff.trunc, 2 
+  %b.trunc.1 = trunc i32 %b to i16
+  %b.trunc.2 = trunc i32 %b to i8
+  %c = add i32 %b, 123
+  ret i32 %c
+}
+
+; Test cases where lshr simplifies to zero or poison.
+
+; CHECK-LABEL: @zero_test1
+; CHECK-NEXT: [[C:%.*]] = add i32 poison, 123
+; CHECK-NEXT: ret i32 [[C]]
+  
+define i32 @zero_test1(i32 %a) {
+  %b = lshr i32 %a, 32
+  %c = add i32 %b, 123
+  ret i32 %c
+}
+
+; CHECK-LABEL: @zero_test2
+; CHECK-NEXT: [[B1:%.*]] = add nuw nsw i32 [[B:%.*]], 50
+; CHECK-NEXT: [[D:%.*]] = add i32 poison, 123
+; CHECK-NEXT: ret i32 [[D]]
+
+define i32 @zero_test2(i32 %a, i32 %b) {
+  %b.large = add nuw nsw i32 %b, 50
+  %c = lshr i32 %a, %b.large
+  %d = add i32 %c, 123
+  ret i32 %d
+}
+
+; CHECK-LABEL: @zero_test3
+; CHECK-NEXT: [[A1:%.*]] = lshr i32 [[A:%.*]], 16
+; CHECK-NEXT: [[B1:%.*]] = add nuw nsw i32 [[B:%.*]], 20
+; CHECK-NEXT: [[D:%.*]] = add nuw nsw i32 0, 123
+; CHECK-NEXT: ret i32 123
+
+define i32 @zero_test3(i32 %a, i32 %b) {
+  %a.small = lshr i32 %a, 16
+  %b.large = add nuw nsw i32 %b, 20
+  %c = lshr i32 %a.small, %b.large
+  %d = add i32 %c, 123
+  ret i32 %d
+}
+
+; CHECK-LABEL: @zero_test4
+; CHECK-NEXT: [[A1:%.*]] = lshr i32 [[A:%.*]], 16
+; CHECK-NEXT: [[B1:%.*]] = add nuw nsw i32 [[B:%.*]], 20
+; CHECK-NEXT: [[C:%.*]] = lshr exact i32 [[A1]], [[B1]]
+; CHECK-NEXT: [[D:%.*]] = add nuw nsw i32 [[C]], 123
+; CHECK-NEXT: ret i32 123
+
+define i32 @zero_test4(i32 %a, i32 %b) {
+  %a.small = lshr i32 %a, 16
+  %b.large = add nuw nsw i32 %b, 20
+  %c = lshr exact i32 %a.small, %b.large
+  %d = add i32 %c, 123
+  ret i32 %d
+}

dtcxzyw

It is incorrect. You should check that the shift amount is always smaller than NewWidth.

dtcxzyw · 2024-12-12T00:05:11Z

llvm/lib/Transforms/Scalar/CorrelatedValuePropagation.cpp

+
+  // What is the smallest bit width that can accommodate the entire value ranges
+  // of both of the operands? Don't shrink below 8 bits wide.
+  unsigned NewWidth = std::max<unsigned>(PowerOf2Ceil(MaxActiveBits), 8);


We should avoid narrowing the type to an illegal type. See also InstCombinerImpl::shouldChangeType.

This part is taken from narrowUDivOrURem and narrowSDivOrSRem. It does look suspicious, though. I can correct in all 3 places or leave it.

I fixed the types in narrowUDivOrURem, narrowSDivOrSRem, and narrowLShr. Now the narrowing can result only in i8, i16, or i32.

dtcxzyw · 2024-12-12T00:11:19Z

llvm/test/Transforms/CorrelatedValuePropagation/lshr-plus-instcombine.ll

+; CHECK-NEXT: [[C1:%.*]] = lshr i16 [[CARG]], [[CSHIFT]]
+; CHECK-NEXT: ret i16 [[C1]]
+
+define i16 @trunc_test1(i32 %a, i32 %b) {


Can you demonstrate that this patch will improve the codegen? https://godbolt.org/z/hbP9qrsxv

BTW this transformation is incorrect: https://alive2.llvm.org/ce/z/j2w5Wa

The original motivation is a subtle difference between the division and right shift. Due to the lack of narrowing for LShr, UDivs may produce a better code, here's an example:

https://godbolt.org/z/7EPnj9a8d

With the proposed patch the two functions generate identical (better) code.

The incorrect transformation was fixed by checking if the maximal shift is smaller than the new bit width.

dtcxzyw · 2024-12-12T00:14:44Z

llvm/test/Transforms/CorrelatedValuePropagation/lshr-plus-instcombine.ll

+; CHECK-NEXT: [[C1:%.*]] = lshr i16 [[A:%.*]], 2
+; CHECK-NEXT: ret i16 [[C1]]
+
+define i16 @trunc_test2(i16 %a) {


It has been handled by InstCombine (SimplifyDemandedBits?): https://godbolt.org/z/143fs31s4

nikic

I don't think CVP should be doing this transform. This looks like the job of TruncInstCombine in AggressiveInstCombine?

Implements type narrowing for LShr. The treatment is analogous to the type narrowing of UDiv. Since LShr is a relatively cheap instruction, the narrowing occurs only if the following conditions hold: i) all the users of the LShr instruction are already TruncInst; ii) the narrowing is carried out to the largest TruncInst following the LShr instruction. Additionally, the function optimizes the cases where the result of the LShr instruction is guaranteed to vanish or be equal to poison.

adam-bzowski · 2024-12-13T13:35:02Z

It is incorrect. You should check that the shift amount is always smaller than NewWidth.

Good point. Fixed that and added some LIT tests to check,

Implements type narrowing for LShr. The treatment is analogous to the type narrowing of UDiv. Since LShr is a relatively cheap instruction, the narrowing occurs only if the following conditions hold: i) all the users of the LShr instruction are already TruncInst; ii) the narrowing is carried out to the largest TruncInst following the LShr instruction. Additionally, the function optimizes the cases where the result of the LShr instruction is guaranteed to vanish or be equal to poison.

adam-bzowski · 2024-12-16T13:36:49Z

I don't think CVP should be doing this transform. This looks like the job of TruncInstCombine in AggressiveInstCombine?

I don't think it would work. TruncInstCombine doesn't use LazyValueInfo, which is essential here. That's why it fits CVP more, which already deals with similar narrowings.

nikic · 2024-12-16T13:38:58Z

It doesn't use LVI, but it does use KnownBits. Why do you need LVI?

nikic · 2024-12-16T13:56:46Z

Generally, can you share what your actual motivating case here is?

dtcxzyw · 2024-12-16T14:13:30Z

Generally, can you share what your actual motivating case here is?

CVP narrows UDiv but not LShr.

adam-bzowski · 2024-12-16T16:12:48Z

Generally, can you share what your actual motivating case here is?

Hi, thanks for the comments. The original motivation is as follows. Consider the simple code here:
https://godbolt.org/z/GzGsjbaG4
In principle, one would expect that both sides result in the same IR, as the division by 128 is exchanged for the right shift by 7. But, as you can see, the division results in lshr for i16s, while the original lshr ends up with i32s plus unnecessary zext and trunc. This may not have much impact for such a simple code, but when you close the code in a loop, the vectorizer produces a worse code, as you can see here:
https://godbolt.org/z/7EPnj9a8d

I was looking at this issue and realized that the essential difference between the two cases is how CVP treats udiv and lshr. Since lshr is a relatively cheap instruction, we do not want to narrow it too aggressively. We want to avoid the situation where the gain from the narrowing is overshadowed by the inserted zexts and truncs. That's why I added the condition that lshr must be followed by truncs. With this constraint the resulting code after the narrowing does seem to be slightly faster. Please let me know what you think!

nikic · 2024-12-16T16:53:15Z

Thanks for the example. The important distinction here is that we aggressively narrow integer divs because on many CPUs the latency of integer divisions can be significantly smaller for smaller bit widths. This is not the case for lshr. For lshr, this optimization only really makes sense insofar as it removes unnecessary ext+trunc patterns. And this is not specific to lshr, you can narrow most integer ops with various constraints.

We already have various optimizations that do this, e.g. see canEvaluateTruncated in InstCombine and TruncInstCombine in AggressiveInstCombine. I think the reason they don't work for your sample is that computeKnownBits probably fails to determine that the top bits are zero in your sample. Probably it's just a matter of adding support for isSignedMinMaxIntrinsicClamp in computeKnownBits. (If we didn't fold the ashr to lshr it would probably also work via ComputeNumSignBits, which does handle the clamp pattern.)

adam-bzowski · 2024-12-16T17:18:43Z

We already have various optimizations that do this, e.g. see canEvaluateTruncated in InstCombine and TruncInstCombine in AggressiveInstCombine. I think the reason they don't work for your sample is that computeKnownBits probably fails to determine that the top bits are zero in your sample. Probably it's just a matter of adding support for isSignedMinMaxIntrinsicClamp in computeKnownBits. (If we didn't fold the ashr to lshr it would probably also work via ComputeNumSignBits, which does handle the clamp pattern.)

I'll have a look there tomorrow and see if I can implement the narrowing in TruncInstCombine. I agree that it's not specific to lshr. Thanks!

adam-bzowski · 2024-12-19T13:39:33Z

I looked at TruncInstCombine as suggested and, indeed, I can make very small changes there (6 lines of code added), which are beneficial even in a wider context than discussed here. I opened a new PR, here:
#120576
I would appreciate any comments and thanks for your help.

adam-bzowski requested a review from nikic as a code owner December 11, 2024 16:13

llvmbot added the llvm:transforms label Dec 11, 2024

dtcxzyw requested changes Dec 12, 2024

View reviewed changes

nikic reviewed Dec 12, 2024

View reviewed changes

adam-bzowski added 3 commits December 13, 2024 06:34

adam-bzowski mentioned this pull request Dec 19, 2024

[ValueTracking] Improve KnownBits for signed min-max clamping #120576

Merged

adam-bzowski closed this Dec 19, 2024

[CVP] Implement type narrowing for LShr #119577

[CVP] Implement type narrowing for LShr #119577

Uh oh!

Conversation

adam-bzowski commented Dec 11, 2024

Uh oh!

github-actions bot commented Dec 11, 2024

Uh oh!

llvmbot commented Dec 11, 2024

Uh oh!

dtcxzyw left a comment

Choose a reason for hiding this comment

Uh oh!

dtcxzyw Dec 12, 2024

Choose a reason for hiding this comment

Uh oh!

adam-bzowski Dec 13, 2024

Choose a reason for hiding this comment

Uh oh!

adam-bzowski Dec 16, 2024

Choose a reason for hiding this comment

Uh oh!

dtcxzyw Dec 12, 2024

Choose a reason for hiding this comment

Uh oh!

adam-bzowski Dec 13, 2024

Choose a reason for hiding this comment

Uh oh!

dtcxzyw Dec 12, 2024

Choose a reason for hiding this comment

Uh oh!

nikic left a comment

Choose a reason for hiding this comment

Uh oh!

adam-bzowski commented Dec 13, 2024

Uh oh!

adam-bzowski commented Dec 16, 2024

Uh oh!

nikic commented Dec 16, 2024

Uh oh!

nikic commented Dec 16, 2024

Uh oh!

dtcxzyw commented Dec 16, 2024

Uh oh!

adam-bzowski commented Dec 16, 2024

Uh oh!

nikic commented Dec 16, 2024

Uh oh!

adam-bzowski commented Dec 16, 2024

Uh oh!

adam-bzowski commented Dec 19, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants