Skip to content

Conversation

@artagnon
Copy link
Contributor

@artagnon artagnon commented Jul 19, 2024

Add a dry-run routine that computes a conservative estimate of the number of rows and columns that the transform will require, and fail early if the estimates exceed the upper bounds. This patch has almost no overhead, and improves compile-time on some benchmarks significantly.

-- 8< --
Results from LLVM Compile Time Tracker: http://llvm-compile-time-tracker.com/compare.php?from=c5ee3c05ca61f3fae11337c5853aee7b450a9dc6&to=50c2e95b9a66395d5f6b62a99e67fec1a874f8f5&stat=instructions%3Au.

Add a dry-run routine that computes a conservative estimate of the
number of rows and columns that the transform will require, and fail
early if the estimates exceed the upper bounds. This patch has a small
overhead, but improves compile-time on one benchmark significantly. The
overhead will be compensated for in a follow-up patch, where
ConstraintSystem is ported to use a Matrix data structure, performing
the full allocation ahead-of-time using these estimates.
@llvmbot
Copy link
Member

llvmbot commented Jul 19, 2024

@llvm/pr-subscribers-llvm-transforms

Author: Ramkumar Ramachandra (artagnon)

Changes

Add a dry-run routine that computes a conservative estimate of the number of rows and columns that the transform will require, and fail early if the estimates exceed the upper bounds. This patch has a small overhead, but improves compile-time on one benchmark significantly.

-- 8< --
Results from LLVM Compile Time Tracker: http://llvm-compile-time-tracker.com/compare.php?from=c5ee3c05ca61f3fae11337c5853aee7b450a9dc6&amp;to=a4b9f80e85d0fdd0e0e6390eb4548c88aa3fd35e&amp;stat=instructions:u.


Full diff: https://github.com/llvm/llvm-project/pull/99670.diff

2 Files Affected:

  • (modified) llvm/lib/Transforms/Scalar/ConstraintElimination.cpp (+165-10)
  • (renamed) llvm/test/Transforms/ConstraintElimination/max-row-column-limit.ll (+9-8)
diff --git a/llvm/lib/Transforms/Scalar/ConstraintElimination.cpp b/llvm/lib/Transforms/Scalar/ConstraintElimination.cpp
index c31173879af1e..369a6daa4d970 100644
--- a/llvm/lib/Transforms/Scalar/ConstraintElimination.cpp
+++ b/llvm/lib/Transforms/Scalar/ConstraintElimination.cpp
@@ -40,7 +40,6 @@
 #include "llvm/Transforms/Utils/Cloning.h"
 #include "llvm/Transforms/Utils/ValueMapper.h"
 
-#include <cmath>
 #include <optional>
 #include <string>
 
@@ -57,6 +56,10 @@ static cl::opt<unsigned>
     MaxRows("constraint-elimination-max-rows", cl::init(500), cl::Hidden,
             cl::desc("Maximum number of rows to keep in constraint system"));
 
+static cl::opt<unsigned> MaxColumns(
+    "constraint-elimination-max-cols", cl::init(50), cl::Hidden,
+    cl::desc("Maximum number of columns to keep in constraint system"));
+
 static cl::opt<bool> DumpReproducers(
     "constraint-elimination-dump-reproducers", cl::init(false), cl::Hidden,
     cl::desc("Dump IR to reproduce successful transformations."));
@@ -303,6 +306,7 @@ class ConstraintInfo {
   void popLastNVariables(bool Signed, unsigned N) {
     getCS(Signed).popLastNVariables(N);
   }
+  const DataLayout &getDataLayout() const { return DL; }
 
   bool doesHold(CmpInst::Predicate Pred, Value *A, Value *B) const;
 
@@ -1491,7 +1495,7 @@ removeEntryFromStack(const StackEntry &E, ConstraintInfo &Info,
 /// Check if either the first condition of an AND or OR is implied by the
 /// (negated in case of OR) second condition or vice versa.
 static bool checkOrAndOpImpliedByOther(
-    FactOrCheck &CB, ConstraintInfo &Info, Module *ReproducerModule,
+    const FactOrCheck &CB, ConstraintInfo &Info, Module *ReproducerModule,
     SmallVectorImpl<ReproducerEntry> &ReproducerCondStack,
     SmallVectorImpl<StackEntry> &DFSInStack) {
 
@@ -1671,18 +1675,91 @@ tryToSimplifyOverflowMath(IntrinsicInst *II, ConstraintInfo &Info,
   return Changed;
 }
 
-static bool eliminateConstraints(Function &F, DominatorTree &DT, LoopInfo &LI,
-                                 ScalarEvolution &SE,
-                                 OptimizationRemarkEmitter &ORE) {
-  bool Changed = false;
+/// Performs a dry run of AddFact, computing a conservative estimate of the
+/// number of new variables introduced.
+static void dryRunAddFact(CmpInst::Predicate Pred, Value *A, Value *B,
+                          const ConstraintInfo &Info, unsigned &EstimatedRowsA,
+                          unsigned &EstimatedRowsB,
+                          unsigned &EstimatedColumns) {
+  auto UpdateEstimate = [&Info, &EstimatedRowsA, &EstimatedRowsB,
+                         &EstimatedColumns](CmpInst::Predicate Pred, Value *A,
+                                            Value *B) {
+    SmallVector<Value *> NewVars;
+    auto R = Info.getConstraint(Pred, A, B, NewVars);
+
+    // We offset it by 1 due to logic in addFact.
+    unsigned NewEstimate =
+        count_if(R.Coefficients, [](int64_t C) { return C != 0; }) + 1;
+
+    EstimatedColumns = std::max(EstimatedColumns, NewEstimate);
+    if (R.IsSigned)
+      ++EstimatedRowsA;
+    else
+      ++EstimatedRowsB;
+  };
+
+  UpdateEstimate(Pred, A, B);
+
+  // What follows is a dry-run of transferToOtherSystem.
+  auto IsKnownNonNegative = [&Info](Value *V) {
+    return Info.doesHold(CmpInst::ICMP_SGE, V,
+                         ConstantInt::get(V->getType(), 0)) ||
+           isKnownNonNegative(V, Info.getDataLayout(),
+                              MaxAnalysisRecursionDepth - 1);
+  };
+
+  if (!A->getType()->isIntegerTy())
+    return;
+
+  switch (Pred) {
+  default:
+    break;
+  case CmpInst::ICMP_ULT:
+  case CmpInst::ICMP_ULE:
+    if (IsKnownNonNegative(B)) {
+      UpdateEstimate(CmpInst::ICMP_SGE, A, ConstantInt::get(B->getType(), 0));
+      UpdateEstimate(CmpInst::getSignedPredicate(Pred), A, B);
+    }
+    break;
+  case CmpInst::ICMP_UGE:
+  case CmpInst::ICMP_UGT:
+    if (IsKnownNonNegative(A)) {
+      UpdateEstimate(CmpInst::ICMP_SGE, B, ConstantInt::get(B->getType(), 0));
+      UpdateEstimate(CmpInst::getSignedPredicate(Pred), A, B);
+    }
+    break;
+  case CmpInst::ICMP_SLT:
+    if (IsKnownNonNegative(A))
+      UpdateEstimate(CmpInst::ICMP_ULT, A, B);
+    break;
+  case CmpInst::ICMP_SGT:
+    if (Info.doesHold(CmpInst::ICMP_SGE, B, ConstantInt::get(B->getType(), -1)))
+      UpdateEstimate(CmpInst::ICMP_UGE, A, ConstantInt::get(B->getType(), 0));
+    if (IsKnownNonNegative(B))
+      UpdateEstimate(CmpInst::ICMP_UGT, A, B);
+    break;
+  case CmpInst::ICMP_SGE:
+    if (IsKnownNonNegative(B))
+      UpdateEstimate(CmpInst::ICMP_UGE, A, B);
+    break;
+  }
+}
+
+/// Performs a dry run of the transform, computing a conservative estimate of
+/// the total number of columns we need in the underlying storage.
+static std::tuple<State, unsigned, unsigned>
+dryRun(Function &F, DominatorTree &DT, LoopInfo &LI, ScalarEvolution &SE) {
   DT.updateDFSNumbers();
   SmallVector<Value *> FunctionArgs;
   for (Value &Arg : F.args())
     FunctionArgs.push_back(&Arg);
-  ConstraintInfo Info(F.getDataLayout(), FunctionArgs);
   State S(DT, LI, SE);
-  std::unique_ptr<Module> ReproducerModule(
-      DumpReproducers ? new Module(F.getName(), F.getContext()) : nullptr);
+  unsigned EstimatedColumns = FunctionArgs.size() + 1;
+
+  // EstimatedRowsA corresponds to SignedCS, and EstimatedRowsB corresponds to
+  // UnsignedCS.
+  unsigned EstimatedRowsA = 0, EstimatedRowsB = 1;
+  ConstraintInfo Info(F.getDataLayout(), FunctionArgs);
 
   // First, collect conditions implied by branches and blocks with their
   // Dominator DFS in and out numbers.
@@ -1725,12 +1802,90 @@ static bool eliminateConstraints(Function &F, DominatorTree &DT, LoopInfo &LI,
     return A.NumIn < B.NumIn;
   });
 
+  for (const FactOrCheck &CB : S.WorkList) {
+    ICmpInst::Predicate Pred;
+    Value *A, *B;
+    if (CB.isCheck()) {
+      // What follows is a dry-run of checkOrAndOpImpliedByOther, without
+      // assuming that instructions have been simplified, as they would have
+      // during the course of normal operation.
+      auto *ContextInst = CB.getContextInst();
+      if (auto *Cmp =
+              dyn_cast_or_null<ICmpInst>(CB.getInstructionToSimplify())) {
+        unsigned OtherOpIdx = ContextInst->getOperand(0) == Cmp ? 1 : 0;
+        if (match(ContextInst, m_LogicalOp()) &&
+            match(ContextInst->getOperand(OtherOpIdx),
+                  m_ICmp(Pred, m_Value(A), m_Value(B)))) {
+          if (match(ContextInst, m_LogicalOr()))
+            Pred = CmpInst::getInversePredicate(Pred);
+          dryRunAddFact(Pred, A, B, Info, EstimatedRowsA, EstimatedRowsB,
+                        EstimatedColumns);
+        }
+      }
+      continue;
+    }
+    if (!CB.isConditionFact()) {
+      Value *X;
+      if (match(CB.Inst, m_Intrinsic<Intrinsic::abs>(m_Value(X)))) {
+        if (cast<ConstantInt>(CB.Inst->getOperand(1))->isOne())
+          dryRunAddFact(CmpInst::ICMP_SGE, CB.Inst,
+                        ConstantInt::get(CB.Inst->getType(), 0), Info,
+                        EstimatedRowsA, EstimatedRowsB, EstimatedColumns);
+        dryRunAddFact(CmpInst::ICMP_SGE, CB.Inst, X, Info, EstimatedRowsA,
+                      EstimatedRowsB, EstimatedColumns);
+        continue;
+      }
+
+      if (auto *MinMax = dyn_cast<MinMaxIntrinsic>(CB.Inst)) {
+        Pred = ICmpInst::getNonStrictPredicate(MinMax->getPredicate());
+        dryRunAddFact(Pred, MinMax, MinMax->getLHS(), Info, EstimatedRowsA,
+                      EstimatedRowsB, EstimatedColumns);
+        dryRunAddFact(Pred, MinMax, MinMax->getRHS(), Info, EstimatedRowsA,
+                      EstimatedRowsB, EstimatedColumns);
+        continue;
+      }
+    }
+
+    if (CB.isConditionFact()) {
+      Pred = CB.Cond.Pred;
+      A = CB.Cond.Op0;
+      B = CB.Cond.Op1;
+    } else {
+      bool Matched = match(CB.Inst, m_Intrinsic<Intrinsic::assume>(
+                                        m_ICmp(Pred, m_Value(A), m_Value(B))));
+      (void)Matched;
+      assert(Matched && "Must have an assume intrinsic with a icmp operand");
+    }
+    dryRunAddFact(Pred, A, B, Info, EstimatedRowsA, EstimatedRowsB,
+                  EstimatedColumns);
+  }
+  return {S, std::max(EstimatedRowsA, EstimatedRowsB), EstimatedColumns};
+}
+
+static bool eliminateConstraints(Function &F, DominatorTree &DT, LoopInfo &LI,
+                                 ScalarEvolution &SE,
+                                 OptimizationRemarkEmitter &ORE) {
+  bool Changed = false;
+  const auto &[S, EstimatedRows, EstimatedColumns] = dryRun(F, DT, LI, SE);
+
+  // Fail early if estimates exceed limits. Row estimate could be off by up to
+  // 40%.
+  if (EstimatedRows > 1.4 * MaxRows || EstimatedColumns > MaxColumns)
+    return false;
+
+  SmallVector<Value *> FunctionArgs;
+  for (Value &Arg : F.args())
+    FunctionArgs.push_back(&Arg);
+  ConstraintInfo Info(F.getDataLayout(), FunctionArgs);
+  std::unique_ptr<Module> ReproducerModule(
+      DumpReproducers ? new Module(F.getName(), F.getContext()) : nullptr);
+
   SmallVector<Instruction *> ToRemove;
 
   // Finally, process ordered worklist and eliminate implied conditions.
   SmallVector<StackEntry, 16> DFSInStack;
   SmallVector<ReproducerEntry> ReproducerCondStack;
-  for (FactOrCheck &CB : S.WorkList) {
+  for (const FactOrCheck &CB : S.WorkList) {
     // First, pop entries from the stack that are out-of-scope for CB. Remove
     // the corresponding entry from the constraint system.
     while (!DFSInStack.empty()) {
diff --git a/llvm/test/Transforms/ConstraintElimination/max-row-limit.ll b/llvm/test/Transforms/ConstraintElimination/max-row-column-limit.ll
similarity index 81%
rename from llvm/test/Transforms/ConstraintElimination/max-row-limit.ll
rename to llvm/test/Transforms/ConstraintElimination/max-row-column-limit.ll
index 0e078109ed663..2f3b62dc5dab7 100644
--- a/llvm/test/Transforms/ConstraintElimination/max-row-limit.ll
+++ b/llvm/test/Transforms/ConstraintElimination/max-row-column-limit.ll
@@ -1,7 +1,9 @@
 ; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
-; RUN: opt -passes=constraint-elimination -S %s | FileCheck --check-prefixes=COMMON,SIMP %s
-; RUN: opt -passes=constraint-elimination -constraint-elimination-max-rows=9 -S %s | FileCheck --check-prefixes=COMMON,SIMP %s
-; RUN: opt -passes=constraint-elimination -constraint-elimination-max-rows=8 -S %s | FileCheck --check-prefixes=COMMON,NOSIMP %s
+; RUN: opt -passes=constraint-elimination -S %s | FileCheck --check-prefixes=SIMP %s
+; RUN: opt -passes=constraint-elimination -constraint-elimination-max-rows=8 -S %s | FileCheck --check-prefixes=SIMP %s
+; RUN: opt -passes=constraint-elimination -constraint-elimination-max-cols=6 -S %s | FileCheck --check-prefixes=SIMP %s
+; RUN: opt -passes=constraint-elimination -constraint-elimination-max-rows=7 -S %s | FileCheck --check-prefixes=NOSIMP %s
+; RUN: opt -passes=constraint-elimination -constraint-elimination-max-cols=5 -S %s | FileCheck --check-prefixes=NOSIMP %s
 
 
 define i1 @test_max_row_limit(i32 %l0, i32 %l1, i32 %l2, i32 %l3, i32 %l4) {
@@ -22,7 +24,8 @@ define i1 @test_max_row_limit(i32 %l0, i32 %l1, i32 %l2, i32 %l3, i32 %l4) {
 ; SIMP-NEXT:    [[C4:%.*]] = icmp uge i32 [[L4:%.*]], 100
 ; SIMP-NEXT:    br i1 [[C4]], label [[BB5:%.*]], label [[EXIT]]
 ; SIMP:       bb5:
-; SIMP-NEXT:    ret i1 true
+; SIMP-NEXT:    [[C5:%.*]] = icmp sge i32 [[L4:%.*]], 100
+; SIMP-NEXT:    ret i1 [[C5]]
 ; SIMP:       exit:
 ; SIMP-NEXT:    ret i1 false
 ;
@@ -43,7 +46,7 @@ define i1 @test_max_row_limit(i32 %l0, i32 %l1, i32 %l2, i32 %l3, i32 %l4) {
 ; NOSIMP-NEXT:    [[C4:%.*]] = icmp uge i32 [[L4:%.*]], 100
 ; NOSIMP-NEXT:    br i1 [[C4]], label [[BB5:%.*]], label [[EXIT]]
 ; NOSIMP:       bb5:
-; NOSIMP-NEXT:    [[C5:%.*]] = icmp uge i32 [[L4]], 100
+; NOSIMP-NEXT:    [[C5:%.*]] = icmp sge i32 [[L4]], 100
 ; NOSIMP-NEXT:    ret i1 [[C5]]
 ; NOSIMP:       exit:
 ; NOSIMP-NEXT:    ret i1 false
@@ -69,11 +72,9 @@ bb4:
   br i1 %c4, label %bb5, label %exit
 
 bb5:
-  %c5 = icmp uge i32 %l4, 100
+  %c5 = icmp sge i32 %l4, 100
   ret i1 %c5
 
 exit:
   ret i1 false
 }
-;; NOTE: These prefixes are unused and the list is autogenerated. Do not add tests below this line:
-; COMMON: {{.*}}

@dtcxzyw
Copy link
Member

dtcxzyw commented Jul 19, 2024

Compile-time impact on my dataset:
baseline: e8fbefe

Top 5 improvements:
  darktable/FujiDecompressor.cpp.ll 3874209532 3010886363 -22.28%
  faiss/IndexIVFAdditiveQuantizerFastScan.cpp.ll 874177979 821419686 -6.04%
  cmake/cmPolicies.cxx.ll 2196055101 2089770103 -4.84%
  php/metaphone.ll 3643053889 3480158718 -4.47%
  faiss/IndexHNSW.cpp.ll 2712011139 2615367204 -3.56%
Top 5 regressions:
  faiss/IndexFlat.cpp.ll 608107033 638979523 +5.08%
  faiss/IndexIVFFastScan.cpp.ll 7470109132 7751418757 +3.77%
  faiss/utils.cpp.ll 1280148979 1317007093 +2.88%
  lightgbm/dataset_loader.cpp.ll 7403066309 7608114932 +2.77%
  gromacs/freeenergydispatch.cpp.ll 724512915 744495719 +2.76%
Overall: 0.14623721%

Copy link
Contributor

@nikic nikic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not look particularly compelling to me. It's a regression on average, and also duplicates a lot of logic.

@artagnon
Copy link
Contributor Author

This does not look particularly compelling to me. It's a regression on average, and also duplicates a lot of logic.

Perhaps it is possible to strip down the complicated logic, obtaining a cheaper estimate, and reducing the logic-duplication. I will benchmark it on the top regressions, and see what we can strip out.

@artagnon artagnon force-pushed the perf/constraintelim-dryrun branch from 7d91373 to 50c2e95 Compare July 20, 2024 12:34
@artagnon
Copy link
Contributor Author

@dtcxzyw
Copy link
Member

dtcxzyw commented Jul 20, 2024

New results look good: http://llvm-compile-time-tracker.com/compare.php?from=c5ee3c05ca61f3fae11337c5853aee7b450a9dc6&to=50c2e95b9a66395d5f6b62a99e67fec1a874f8f5&stat=instructions%3Au.

@dtcxzyw Could you kindly check the latest commit against your data set?

Top 5 improvements:
  darktable/FujiDecompressor.cpp.ll 3874209532 3001864479 -22.52%
  openjdk/ad_x86_gen.ll 20827921447 18376594487 -11.77%
  cmake/cmPolicies.cxx.ll 2196055101 2081555278 -5.21%
  php/metaphone.ll 3643053889 3457411868 -5.10%
  faiss/IndexIVFAdditiveQuantizerFastScan.cpp.ll 874177979 835608401 -4.41%
Top 5 regressions:
  lightgbm/linear_tree_learner.cpp.ll 9935337605 10245318195 +3.12%
  lightgbm/sample_strategy.cpp.ll 1000916300 1027777356 +2.68%
  gromacs/propagator.cpp.ll 7940003968 8151320784 +2.66%
  meshlab/arap.cpp.ll 16215819406 16630930530 +2.56%
  gromacs/statepropagatordata.cpp.ll 3171332547 3251119016 +2.52%
Overall: 0.06266579%

@artagnon artagnon force-pushed the perf/constraintelim-dryrun branch from a1c6274 to da823b7 Compare July 21, 2024 16:20
@github-actions
Copy link

github-actions bot commented Jul 21, 2024

✅ With the latest revision this PR passed the C/C++ code formatter.

@artagnon artagnon force-pushed the perf/constraintelim-dryrun branch from da823b7 to 11b306f Compare July 21, 2024 16:25
@artagnon
Copy link
Contributor Author

I've made one more optimization, which, according to Compile Time Tracker, is a small optimization: http://llvm-compile-time-tracker.com/compare.php?from=c5ee3c05ca61f3fae11337c5853aee7b450a9dc6&to=11b306fae6c365a152d8fb9b61846ca733ac87c1&stat=instructions%3Au. I'm running out of ideas to optimize the dry-run routine further.

@dtcxzyw Could you kindly re-check with the latest commit?

@dtcxzyw
Copy link
Member

dtcxzyw commented Jul 22, 2024

I've made one more optimization, which, according to Compile Time Tracker, is a small optimization: http://llvm-compile-time-tracker.com/compare.php?from=c5ee3c05ca61f3fae11337c5853aee7b450a9dc6&to=11b306fae6c365a152d8fb9b61846ca733ac87c1&stat=instructions%3Au. I'm running out of ideas to optimize the dry-run routine further.

@dtcxzyw Could you kindly re-check with the latest commit?

Top 5 improvements:
darktable/FujiDecompressor.cpp.ll 3874209532 3011988604 -22.26%
openjdk/ad_x86_gen.ll 20827921447 18378859574 -11.76%
faiss/IndexIVFAdditiveQuantizerFastScan.cpp.ll 874177979 810966258 -7.23%
php/metaphone.ll 3643053889 3456185042 -5.13%
cmake/cmPolicies.cxx.ll 2196055101 2085039126 -5.06%
Top 5 regressions:
faiss/utils.cpp.ll 1280148979 1316090722 +2.81%
gromacs/pull.cpp.ll 3521897039 3617100314 +2.70%
faiss/IndexBinaryHNSW.cpp.ll 1079192410 1105841398 +2.47%
faiss/IndexIVFFlat.cpp.ll 1772139516 1815060287 +2.42%
faiss/IndexAdditiveQuantizerFastScan.cpp.ll 412085733 421718533 +2.34%
Overall: 0.03707713%

@artagnon
Copy link
Contributor Author

Okay, so it's a near-neutral change overall, with huge improvements in a some cases. I don't think it's possible to optimize this further, without duplicating decompose(), and that is probably undesirable.

Although the benefits of this patch on its own are questionable, it is a prerequisite for another neutral change: the change to the Matrix data structure. The change to the Matrix data structure is a prerequisite for building more complicated things on top of ConsttraintSystem: I'm planning to write a solver-based dependence analysis.

@artagnon artagnon force-pushed the perf/constraintelim-dryrun branch from 9537b21 to f0d674c Compare July 22, 2024 16:14
@artagnon
Copy link
Contributor Author

Managed to eek out another small optimization: http://llvm-compile-time-tracker.com/compare.php?from=c5ee3c05ca61f3fae11337c5853aee7b450a9dc6&to=f0d674c9f5aa601c0e5472f208ff96f3d855d9b6&stat=instructions%3Au.

@dtcxzyw Could you kindly re-check against the latest commit?

@artagnon artagnon force-pushed the perf/constraintelim-dryrun branch from 76831e7 to f0d674c Compare July 22, 2024 17:23
@antoniofrighetto
Copy link
Contributor

Managed to eek out another small optimization: http://llvm-compile-time-tracker.com/compare.php?from=c5ee3c05ca61f3fae11337c5853aee7b450a9dc6&to=f0d674c9f5aa601c0e5472f208ff96f3d855d9b6&stat=instructions%3Au.

@dtcxzyw Could you kindly re-check against the latest commit?

I also wonder if this suffices to establish whether the early conservative estimation of rows/columns outweighs the somewhat noticeable code duplication, maybe worth it to run SPEC with this before further reviewing?

@artagnon
Copy link
Contributor Author

I also wonder if this suffices to establish whether the early conservative estimation of rows/columns outweighs the somewhat noticeable code duplication, maybe worth it to run SPEC with this before further reviewing?

Certainly. Unfortunately, I don't have access to a machine with SPEC set up. Perhaps someone else has access to one?

I think it might be possible to get rid of at least some of the code duplication using a template parameter. Also, I think the regressions are coming from tiny compilation units, for which I think we can short-circuit the dry-run using a rough estimate like number of items in the worklist.

@artagnon
Copy link
Contributor Author

@fhahn
Copy link
Contributor

fhahn commented Jul 22, 2024

Managed to eek out another small optimization: http://llvm-compile-time-tracker.com/compare.php?from=c5ee3c05ca61f3fae11337c5853aee7b450a9dc6&to=f0d674c9f5aa601c0e5472f208ff96f3d855d9b6&stat=instructions%3Au.
@dtcxzyw Could you kindly re-check against the latest commit?

I also wonder if this suffices to establish whether the early conservative estimation of rows/columns outweighs the somewhat noticeable code duplication, maybe worth it to run SPEC with this before further reviewing?

I can check that this is NFC from a code-gen perspective for SPEC and some other larger workloads in the next few days. Please ping me if you don't hear from me by Thursday :)

dtcxzyw added a commit to dtcxzyw/llvm-opt-benchmark that referenced this pull request Jul 23, 2024
@artagnon
Copy link
Contributor Author

I can check that this is NFC from a code-gen perspective for SPEC and some other larger workloads in the next few days. Please ping me if you don't hear from me by Thursday :)

Ping :)

@artagnon
Copy link
Contributor Author

I managed to get access to a copy of SPEC2006. I'm happy to report that there are no binary changes, and that the compile time has improved by 0.9%, as measured with perf.

Program                                       compile_time
                                              baseline     patch   diff
External/SPEC/CFP2006/470.lbm/470.lbm            6.70         6.94  3.5%
External/SPEC/CFP2006/444.namd/444.namd        277.31       279.18  0.7%
External/S.../CFP2006/453.povray/453.povray    737.11       741.86  0.6%
External/S...NT2006/471.omnetpp/471.omnetpp   2325.30      2340.06  0.6%
External/SPEC/CFP2006/433.milc/433.milc        153.72       154.29  0.4%
External/S...C/CINT2006/445.gobmk/445.gobmk    897.01       899.44  0.3%
Bitcode/Be...hmarks/Halide/blur/halide_blur      1.43         1.43  0.0%
External/S...FP2006/482.sphinx3/482.sphinx3     98.46        98.14 -0.3%
External/S...06/483.xalancbmk/483.xalancbmk   7426.60      7397.76 -0.4%
External/S...C/CINT2006/456.hmmer/456.hmmer    339.66       337.53 -0.6%
External/S...NT2006/464.h264ref/464.h264ref    841.65       836.31 -0.6%
External/S...C/CINT2006/401.bzip2/401.bzip2     98.84        98.13 -0.7%
External/SPEC/CINT2006/403.gcc/403.gcc        2470.16      2442.94 -1.1%
External/S...C/CINT2006/473.astar/473.astar     72.93        72.03 -1.2%
External/S.../CFP2006/450.soplex/450.soplex    775.14       759.94 -2.0%
                           Geomean difference                      -0.9%

@fhahn
Copy link
Contributor

fhahn commented Jul 27, 2024

I had a chance to check SPEC2017 and there are a few cases where fewer conditions are simplified (with -O3)

Program                                       constraint-elimination.NumCondsRemoved
                                              base                                   patch   diff
External/S...NT2017rate/502.gcc_r/502.gcc_r   1123.00                                1098.00  -2.2%
External/S...rate/510.parest_r/510.parest_r    273.00                                 262.00  -4.0%
External/S...00.perlbench_r/500.perlbench_r     97.00                                  88.00  -9.3%
External/S...te/538.imagick_r/538.imagick_r     30.00                                  24.00 -20.0%
External/S...te/526.blender_r/526.blender_r    599.00                                 344.00 -42.6%

In Sqlite from the test suite, there also are a few cases where we have fewer simplifications (but no binary changes probably due to them being simplified later).

I am wondering if we should aim for a check like below, to surface more cases where the behavior changes (with that change, I also tried bootstrapping Clang but couldn't get very far)

diff --git a/llvm/lib/Transforms/Scalar/ConstraintElimination.cpp b/llvm/lib/Transforms/Scalar/ConstraintElimination.cpp
index 40d117d49489..916f90c1fa0f 100644
--- a/llvm/lib/Transforms/Scalar/ConstraintElimination.cpp
+++ b/llvm/lib/Transforms/Scalar/ConstraintElimination.cpp
@@ -1898,8 +1898,8 @@ static bool eliminateConstraints(Function &F, DominatorTree &DT, LoopInfo &LI,

   // Fail early if estimates exceed limits. Row estimate could be off by up to
   // 40%.
-  if (EstimatedRows > 1.4 * MaxRows || EstimatedColumns > MaxColumns)
-    return false;
+  // if (EstimatedRows > 1.4 * MaxRows || EstimatedColumns > MaxColumns)
+  //  return false;

   SmallVector<Value *> FunctionArgs;
   for (Value &Arg : F.args())
@@ -2058,6 +2058,9 @@ static bool eliminateConstraints(Function &F, DominatorTree &DT, LoopInfo &LI,

   for (Instruction *I : ToRemove)
     I->eraseFromParent();
+
+  assert(!Changed ||
+         (EstimatedRows <= 1.4 * MaxRows && EstimatedColumns <= MaxColumns));
   return Changed;
 }

@fhahn
Copy link
Contributor

fhahn commented Jul 27, 2024

Presumably some of those issues would go away with a larger default for MaxColumns?

@artagnon
Copy link
Contributor Author

I had a chance to check SPEC2017 and there are a few cases where fewer conditions are simplified (with -O3)

Program                                       constraint-elimination.NumCondsRemoved
                                              base                                   patch   diff
External/S...NT2017rate/502.gcc_r/502.gcc_r   1123.00                                1098.00  -2.2%
External/S...rate/510.parest_r/510.parest_r    273.00                                 262.00  -4.0%
External/S...00.perlbench_r/500.perlbench_r     97.00                                  88.00  -9.3%
External/S...te/538.imagick_r/538.imagick_r     30.00                                  24.00 -20.0%
External/S...te/526.blender_r/526.blender_r    599.00                                 344.00 -42.6%

In Sqlite from the test suite, there also are a few cases where we have fewer simplifications (but no binary changes probably due to them being simplified later).

I am wondering if we should aim for a check like below, to surface more cases where the behavior changes (with that change, I also tried bootstrapping Clang but couldn't get very far)

diff --git a/llvm/lib/Transforms/Scalar/ConstraintElimination.cpp b/llvm/lib/Transforms/Scalar/ConstraintElimination.cpp
index 40d117d49489..916f90c1fa0f 100644
--- a/llvm/lib/Transforms/Scalar/ConstraintElimination.cpp
+++ b/llvm/lib/Transforms/Scalar/ConstraintElimination.cpp
@@ -1898,8 +1898,8 @@ static bool eliminateConstraints(Function &F, DominatorTree &DT, LoopInfo &LI,

   // Fail early if estimates exceed limits. Row estimate could be off by up to
   // 40%.
-  if (EstimatedRows > 1.4 * MaxRows || EstimatedColumns > MaxColumns)
-    return false;
+  // if (EstimatedRows > 1.4 * MaxRows || EstimatedColumns > MaxColumns)
+  //  return false;

   SmallVector<Value *> FunctionArgs;
   for (Value &Arg : F.args())
@@ -2058,6 +2058,9 @@ static bool eliminateConstraints(Function &F, DominatorTree &DT, LoopInfo &LI,

   for (Instruction *I : ToRemove)
     I->eraseFromParent();
+
+  assert(!Changed ||
+         (EstimatedRows <= 1.4 * MaxRows && EstimatedColumns <= MaxColumns));
   return Changed;
 }

My gut reaction is that this diff will not be helpful: we will get a lot of asserts in the case of binary changes, and these asserts will not have information to meaningfully update the estimates. The dry-run estimation can't be used for a logically precise condition: it is a cheap estimation that "works in practice". I wish we had some kind of formula for the upper-bound we could just lift out from a paper (as is the case for many algorithms like FM), but this is not possible for ConstraintElimination.

Presumably some of those issues would go away with a larger default for MaxColumns?

My intuition from writing the dry-run estimation routine is that the 1.4 factor is the more problematic issue, and not MaxColumns. We should bump the factor to 1.6, and try again: the reason I used 1.4 is because I wanted a fine upper bound so we don't lose the benefits of the patch; 1.6 could work just as well, but a factor that's too high might not yield such great overall results. I think the row-estimation is off by a constant factor (modulo minor error), so multiplying a constant factor should be possible.

I've observed that the column-estimation is pretty accurate, although it is possible that MaxColumns would need to be bumped up to 64 if the row-estimation times the constant factor is correct.

@fhahn
Copy link
Contributor

fhahn commented Jul 29, 2024

My gut reaction is that this diff will not be helpful: we will get a lot of asserts in the case of binary changes, and these asserts will not have information to meaningfully update the estimates. The dry-run estimation can't be used for a logically precise condition: it is a cheap estimation that "works in practice". I wish we had some kind of formula for the upper-bound we could just lift out from a paper (as is the case for many algorithms like FM), but this is not possible for ConstraintElimination.

Presumably some of those issues would go away with a larger default for MaxColumns?

My intuition from writing the dry-run estimation routine is that the 1.4 factor is the more problematic issue, and not MaxColumns. We should bump the factor to 1.6, and try again: the reason I used 1.4 is because I wanted a fine upper bound so we don't lose the benefits of the patch; 1.6 could work just as well, but a factor that's too high might not yield such great overall results. I think the row-estimation is off by a constant factor (modulo minor error), so multiplying a constant factor should be possible.

I've observed that the column-estimation is pretty accurate, although it is possible that MaxColumns would need to be bumped up to 64 if the row-estimation times the constant factor is correct.

Hm I think we should aim for a path forward without missing any simplification cases if possible. It also changes the behavior of the max column limit IIRC, as previously we just skipped in the case of hitting the limit until we removed rows again, then we continue.

The current patch means we don't even try to simplify anything if the max is going to be above the threshold, which doesn't seem ideal.

Thinking also in the context of going forward with the new matrix data structure, some other alternatives that come to mind

  • using the estimate from the dry run for the initial matrix size, without the tight upper bound (possibly with one that is much larger to avoid degenerate cases)
  • skip adding new facts when doing so would exceed to available columns in the matrix
  • resize the matrix data structure on demand, even if costly, if the cost would be amortized by more efficient ops on average.

I don't think we need to aim for the current patch having a positive compile-time impact on its own if it comes at the cost of lost simplifications; having a positive impact in combination with the new matrix data type should be sufficient IMO.

@artagnon
Copy link
Contributor Author

Hm I think we should aim for a path forward without missing any simplification cases if possible. It also changes the behavior of the max column limit IIRC, as previously we just skipped in the case of hitting the limit until we removed rows again, then we continue.

The current patch means we don't even try to simplify anything if the max is going to be above the threshold, which doesn't seem ideal.

I agree with your reasoning fully, since it is impossible to know if there will be a case where fewer simplifications are made. Even if we manage to get SPEC 2017 to not have binary changes, there could be degenerate cases out there that will miss some simplifications because of the dry-run-cutoff. There are, however, fine trade-offs we need to make, which I will explain shortly.

Thinking also in the context of going forward with the new matrix data structure, some other alternatives that come to mind

* using the estimate from the dry run for the initial matrix size, without the tight upper bound (possibly with one that is much larger to avoid degenerate cases)

Okay, so the current patch is a strict over-estimate. It doesn't under-estimate under any circumstance. The only reason for MaxColumns to exist is for the test.

I don't think we need to aim for the current patch having a positive compile-time impact on its own if it comes at the cost of lost simplifications; having a positive impact in combination with the new matrix data type should be sufficient IMO.

The crux of the issue here is that the dry-run estimation adds a very small overhead, and the Matrix data structure is just a zero-cost abstraction (such a small benefit, that the benchmarks classify it as noise). We would therefore see a small negative impact overall after applying both patches. Although I'm not sure, I think removing the dry-run routine, and column-resizing the Matrix at runtime would make it a non-zero-cost abstraction.

The reason I developed the Matrix data structure is so that we can take slices of the Matrix, and operate on that slice, as if it were first-class. This would allow us to parse dependency information into a larger Matrix, and pass a sub-Matrix to the solver: dependency information has a "domain" and "range" which are parsed into separate column ranges, but solvers cannot operate on "mappings"; you need to extract a "set" out of the mapping, and operate on that. The way Presburger solves this problem is by copying pieces of "mappings" into "sets", which is far from ideal. Another benefit is that it would allow us to experiment with implementing different ILP algorithms like Simplex-based methods, which are supposedly faster (although we'd have to experiment to find out if it really is faster in our case). Although it is technically possible to use vector-of-vectors to implement Simplex-based methods, the most natural data structure is a "tableau" (or Matrix).

Now, it wouldn't be ideal to land a patch that has negative impact overall, with the promise that it could be addressed in the future. The current patch is in its current form because that was the trade-off calculation I performed in my head.

@artagnon
Copy link
Contributor Author

Tweaking the patch was pretty straight-forward actually. There are no binary changes on SPEC2017 after tweaking the factor to 1.6. Compile time improvement is 0.6%. As I mentioned, there is no way to absolutely guarantee that there won't be binary changes in a theoretical degenerate case, but I think the trade-off I've chosen is practical, and will allow us to move forward.

Program                                       compile_time
                                              baseline     patch    diff
External/S...96.specrand_fs/996.specrand_fs       1.62         1.72   6.0%
External/S...99.specrand_ir/999.specrand_ir       1.67         1.76   5.6%
External/S...P2017speed/644.nab_s/644.nab_s     102.36       108.04   5.5%
External/S...NT2017rate/505.mcf_r/505.mcf_r      22.42        23.25   3.7%
External/S...T2017speed/605.mcf_s/605.mcf_s      22.29        22.98   3.1%
External/S...017speed/625.x264_s/625.x264_s     365.08       373.04   2.2%
External/S...te/520.omnetpp_r/520.omnetpp_r    3009.42      3027.51   0.6%
External/S...17rate/541.leela_r/541.leela_r     355.39       357.28   0.5%
External/S...te/526.blender_r/526.blender_r    7542.34      7568.07   0.3%
External/S...rate/510.parest_r/510.parest_r   10715.75     10749.72   0.3%
External/S...rate/511.povray_r/511.povray_r     802.93       805.45   0.3%
External/S...ed/638.imagick_s/638.imagick_s    1035.18      1038.16   0.3%
External/S...ed/620.omnetpp_s/620.omnetpp_s    3026.72      3032.78   0.2%
External/S...23.xalancbmk_r/523.xalancbmk_r    6407.85      6414.59   0.1%
Bitcode/Be...hmarks/Halide/blur/halide_blur       1.43         1.43   0.0%
External/S...T2017speed/602.gcc_s/602.gcc_s    6570.48      6569.85  -0.0%
External/S...23.xalancbmk_s/623.xalancbmk_s    6397.61      6396.53  -0.0%
External/S...INT2017speed/657.xz_s/657.xz_s     129.01       128.97  -0.0%
External/S...2017rate/525.x264_r/525.x264_r     369.01       368.63  -0.1%
External/S...00.perlbench_r/500.perlbench_r    1074.48      1071.83  -0.2%
External/S...NT2017rate/502.gcc_r/502.gcc_r    6543.43      6523.43  -0.3%
External/S...te/538.imagick_r/538.imagick_r    1136.65      1132.24  -0.4%
External/S...2017rate/508.namd_r/508.namd_r     539.89       537.10  -0.5%
External/S...7speed/641.leela_s/641.leela_s     353.56       350.23  -0.9%
External/S...98.specrand_is/998.specrand_is       1.77         1.75  -1.0%
External/S...CINT2017rate/557.xz_r/557.xz_r     131.92       129.86  -1.6%
External/S...00.perlbench_s/600.perlbench_s    1194.66      1174.82  -1.7%
External/S...FP2017rate/544.nab_r/544.nab_r     113.08       111.06  -1.8%
External/S...31.deepsjeng_s/631.deepsjeng_s      76.04        74.23  -2.4%
External/S...31.deepsjeng_r/531.deepsjeng_r      76.10        71.79  -5.7%
External/S...FP2017rate/519.lbm_r/519.lbm_r       8.87         8.23  -7.2%
External/S...97.specrand_fr/997.specrand_fr       1.65         1.53  -7.3%
External/S...P2017speed/619.lbm_s/619.lbm_s       9.61         8.05 -16.3%
                           Geomean difference                        -0.6%

@artagnon
Copy link
Contributor Author

artagnon commented Aug 7, 2024

So, I had some more time to think about this, and there is one other slightly sub-optimal way forward: I experimented with dropping the dry-run routine entirely, and hard-coding 50 columns in the Matrix. If we feel that the small compile-time regression caused by this is acceptable, and that the dry-run routine is unacceptable, this would be one potential way forward: http://llvm-compile-time-tracker.com/compare.php?from=76e2307e695330912b89e51f65f75ca0f4c8bd24&to=1be3ca30a469ff4b2a45a2eb5bba3db230afe95e&stat=instructions:u.

Otherwise, I can also experiment with column-resizing the Matrix dynamically, although I'm less optimistic about that.

@artagnon artagnon closed this Nov 18, 2024
@artagnon artagnon deleted the perf/constraintelim-dryrun branch November 18, 2024 10:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants