Skip to content

Conversation

@pcc
Copy link
Contributor

@pcc pcc commented Apr 23, 2025

By calling this function after every link, we introduce quadratic
behavior during full LTO links that primarily affects links involving
large numbers of constant arrays, such as the full LTO part of a
ThinLTO link which will involve large numbers of vtable constants.
Removing this call resulted in a 1.3x speedup in the indexing phase
of a large internal Google binary.

This call was originally introduced to reduce memory consumption during
full LTO (see commit dab999d), but it
doesn't seem to be the case that this helps any more. I ran 3 stage2
links of clang with full LTO and ThinLTO with and without this change
and measured the max RSS. The results were as follows (all in KB):

FullLTO before 22362512 22383524 22387456
         after 22383496 22365356 22364352
ThinLTO before  4391404  4478192  4383468
         after  4399220  4363100  4417688

As you can see, any max RSS differences are in the noise.

Created using spr 1.3.6-beta.1
@llvmbot llvmbot added LTO Link time optimization (regular/full LTO or ThinLTO) llvm:ir labels Apr 23, 2025
@pcc pcc requested a review from teresajohnson April 23, 2025 23:15
@llvmbot
Copy link
Member

llvmbot commented Apr 23, 2025

@llvm/pr-subscribers-lto

Author: Peter Collingbourne (pcc)

Changes

By calling this function after every link, we introduce quadratic
behavior during full LTO links that primarily affects links involving
large numbers of constant arrays, such as the full LTO part of a
ThinLTO link which will involve large numbers of vtable constants.
Removing this call resulted in a 1.3x speedup in the indexing phase
of a large internal Google binary.

This call was originally introduced to reduce memory consumption during
full LTO (see commit dab999d), but it
doesn't seem to be the case that this helps any more. I ran 3 stage2
links of clang with full LTO and ThinLTO with and without this change
and measured the max RSS. The results were as follows (all in KB):

FullLTO before 22362512 22383524 22387456
         after 22383496 22365356 22364352
ThinLTO before  4391404  4478192  4383468
         after  4399220  4363100  4417688

As you can see, any max RSS differences are in the noise.


Full diff: https://github.com/llvm/llvm-project/pull/137081.diff

4 Files Affected:

  • (modified) llvm/include/llvm/IR/Module.h (-9)
  • (modified) llvm/lib/IR/LLVMContextImpl.cpp (-27)
  • (modified) llvm/lib/IR/LLVMContextImpl.h (-3)
  • (modified) llvm/lib/Linker/IRMover.cpp (+1-3)
diff --git a/llvm/include/llvm/IR/Module.h b/llvm/include/llvm/IR/Module.h
index 91ccd76c41e07..6d06c12a77759 100644
--- a/llvm/include/llvm/IR/Module.h
+++ b/llvm/include/llvm/IR/Module.h
@@ -877,15 +877,6 @@ class LLVM_ABI Module {
   }
 /// @}
 
-  /// Destroy ConstantArrays in LLVMContext if they are not used.
-  /// ConstantArrays constructed during linking can cause quadratic memory
-  /// explosion. Releasing all unused constants can cause a 20% LTO compile-time
-  /// slowdown for a large application.
-  ///
-  /// NOTE: Constants are currently owned by LLVMContext. This can then only
-  /// be called where all uses of the LLVMContext are understood.
-  void dropTriviallyDeadConstantArrays();
-
 /// @name Utility functions for printing and dumping Module objects
 /// @{
 
diff --git a/llvm/lib/IR/LLVMContextImpl.cpp b/llvm/lib/IR/LLVMContextImpl.cpp
index d8cdbf8370984..dfeabb4cbc7c1 100644
--- a/llvm/lib/IR/LLVMContextImpl.cpp
+++ b/llvm/lib/IR/LLVMContextImpl.cpp
@@ -147,33 +147,6 @@ LLVMContextImpl::~LLVMContextImpl() {
     delete Pair.second;
 }
 
-void LLVMContextImpl::dropTriviallyDeadConstantArrays() {
-  SmallSetVector<ConstantArray *, 4> WorkList;
-
-  // When ArrayConstants are of substantial size and only a few in them are
-  // dead, starting WorkList with all elements of ArrayConstants can be
-  // wasteful. Instead, starting WorkList with only elements that have empty
-  // uses.
-  for (ConstantArray *C : ArrayConstants)
-    if (C->use_empty())
-      WorkList.insert(C);
-
-  while (!WorkList.empty()) {
-    ConstantArray *C = WorkList.pop_back_val();
-    if (C->use_empty()) {
-      for (const Use &Op : C->operands()) {
-        if (auto *COp = dyn_cast<ConstantArray>(Op))
-          WorkList.insert(COp);
-      }
-      C->destroyConstant();
-    }
-  }
-}
-
-void Module::dropTriviallyDeadConstantArrays() {
-  Context.pImpl->dropTriviallyDeadConstantArrays();
-}
-
 namespace llvm {
 
 /// Make MDOperand transparent for hashing.
diff --git a/llvm/lib/IR/LLVMContextImpl.h b/llvm/lib/IR/LLVMContextImpl.h
index 9b60b57e64130..71369772f4529 100644
--- a/llvm/lib/IR/LLVMContextImpl.h
+++ b/llvm/lib/IR/LLVMContextImpl.h
@@ -1808,9 +1808,6 @@ class LLVMContextImpl {
   LLVMContextImpl(LLVMContext &C);
   ~LLVMContextImpl();
 
-  /// Destroy the ConstantArrays if they are not used.
-  void dropTriviallyDeadConstantArrays();
-
   mutable OptPassGate *OPG = nullptr;
 
   /// Access the object which can disable optional passes and individual
diff --git a/llvm/lib/Linker/IRMover.cpp b/llvm/lib/Linker/IRMover.cpp
index 0f32e9e1a05a5..af2ac13e3665b 100644
--- a/llvm/lib/Linker/IRMover.cpp
+++ b/llvm/lib/Linker/IRMover.cpp
@@ -1766,7 +1766,5 @@ Error IRMover::move(std::unique_ptr<Module> Src,
   IRLinker TheIRLinker(Composite, SharedMDs, IdentifiedStructTypes,
                        std::move(Src), ValuesToLink, std::move(AddLazyFor),
                        IsPerformingImport);
-  Error E = TheIRLinker.run();
-  Composite.dropTriviallyDeadConstantArrays();
-  return E;
+  return TheIRLinker.run();
 }

@llvmbot
Copy link
Member

llvmbot commented Apr 23, 2025

@llvm/pr-subscribers-llvm-ir

Author: Peter Collingbourne (pcc)

Changes

By calling this function after every link, we introduce quadratic
behavior during full LTO links that primarily affects links involving
large numbers of constant arrays, such as the full LTO part of a
ThinLTO link which will involve large numbers of vtable constants.
Removing this call resulted in a 1.3x speedup in the indexing phase
of a large internal Google binary.

This call was originally introduced to reduce memory consumption during
full LTO (see commit dab999d), but it
doesn't seem to be the case that this helps any more. I ran 3 stage2
links of clang with full LTO and ThinLTO with and without this change
and measured the max RSS. The results were as follows (all in KB):

FullLTO before 22362512 22383524 22387456
         after 22383496 22365356 22364352
ThinLTO before  4391404  4478192  4383468
         after  4399220  4363100  4417688

As you can see, any max RSS differences are in the noise.


Full diff: https://github.com/llvm/llvm-project/pull/137081.diff

4 Files Affected:

  • (modified) llvm/include/llvm/IR/Module.h (-9)
  • (modified) llvm/lib/IR/LLVMContextImpl.cpp (-27)
  • (modified) llvm/lib/IR/LLVMContextImpl.h (-3)
  • (modified) llvm/lib/Linker/IRMover.cpp (+1-3)
diff --git a/llvm/include/llvm/IR/Module.h b/llvm/include/llvm/IR/Module.h
index 91ccd76c41e07..6d06c12a77759 100644
--- a/llvm/include/llvm/IR/Module.h
+++ b/llvm/include/llvm/IR/Module.h
@@ -877,15 +877,6 @@ class LLVM_ABI Module {
   }
 /// @}
 
-  /// Destroy ConstantArrays in LLVMContext if they are not used.
-  /// ConstantArrays constructed during linking can cause quadratic memory
-  /// explosion. Releasing all unused constants can cause a 20% LTO compile-time
-  /// slowdown for a large application.
-  ///
-  /// NOTE: Constants are currently owned by LLVMContext. This can then only
-  /// be called where all uses of the LLVMContext are understood.
-  void dropTriviallyDeadConstantArrays();
-
 /// @name Utility functions for printing and dumping Module objects
 /// @{
 
diff --git a/llvm/lib/IR/LLVMContextImpl.cpp b/llvm/lib/IR/LLVMContextImpl.cpp
index d8cdbf8370984..dfeabb4cbc7c1 100644
--- a/llvm/lib/IR/LLVMContextImpl.cpp
+++ b/llvm/lib/IR/LLVMContextImpl.cpp
@@ -147,33 +147,6 @@ LLVMContextImpl::~LLVMContextImpl() {
     delete Pair.second;
 }
 
-void LLVMContextImpl::dropTriviallyDeadConstantArrays() {
-  SmallSetVector<ConstantArray *, 4> WorkList;
-
-  // When ArrayConstants are of substantial size and only a few in them are
-  // dead, starting WorkList with all elements of ArrayConstants can be
-  // wasteful. Instead, starting WorkList with only elements that have empty
-  // uses.
-  for (ConstantArray *C : ArrayConstants)
-    if (C->use_empty())
-      WorkList.insert(C);
-
-  while (!WorkList.empty()) {
-    ConstantArray *C = WorkList.pop_back_val();
-    if (C->use_empty()) {
-      for (const Use &Op : C->operands()) {
-        if (auto *COp = dyn_cast<ConstantArray>(Op))
-          WorkList.insert(COp);
-      }
-      C->destroyConstant();
-    }
-  }
-}
-
-void Module::dropTriviallyDeadConstantArrays() {
-  Context.pImpl->dropTriviallyDeadConstantArrays();
-}
-
 namespace llvm {
 
 /// Make MDOperand transparent for hashing.
diff --git a/llvm/lib/IR/LLVMContextImpl.h b/llvm/lib/IR/LLVMContextImpl.h
index 9b60b57e64130..71369772f4529 100644
--- a/llvm/lib/IR/LLVMContextImpl.h
+++ b/llvm/lib/IR/LLVMContextImpl.h
@@ -1808,9 +1808,6 @@ class LLVMContextImpl {
   LLVMContextImpl(LLVMContext &C);
   ~LLVMContextImpl();
 
-  /// Destroy the ConstantArrays if they are not used.
-  void dropTriviallyDeadConstantArrays();
-
   mutable OptPassGate *OPG = nullptr;
 
   /// Access the object which can disable optional passes and individual
diff --git a/llvm/lib/Linker/IRMover.cpp b/llvm/lib/Linker/IRMover.cpp
index 0f32e9e1a05a5..af2ac13e3665b 100644
--- a/llvm/lib/Linker/IRMover.cpp
+++ b/llvm/lib/Linker/IRMover.cpp
@@ -1766,7 +1766,5 @@ Error IRMover::move(std::unique_ptr<Module> Src,
   IRLinker TheIRLinker(Composite, SharedMDs, IdentifiedStructTypes,
                        std::move(Src), ValuesToLink, std::move(AddLazyFor),
                        IsPerformingImport);
-  Error E = TheIRLinker.run();
-  Composite.dropTriviallyDeadConstantArrays();
-  return E;
+  return TheIRLinker.run();
 }

@nikic
Copy link
Contributor

nikic commented Apr 25, 2025

I'd expect this optimization is mainly targeting linking many modules that have globals with appending linkage -- not sure if clang exercises that. It should be possible to address memory usage for that case more directly though, by tracking the old ConstantArrays for the appending linkage link and only trying to clean them up.

@pcc
Copy link
Contributor Author

pcc commented Apr 26, 2025

Each of our translation units with a cl::opt will have an llvm.global_ctors entry so I would expect clang to exercise this case fairly realistically given the number of translation units we have that declare flags. I would also expect that clang overrepresents this case; most large C++ programs that I'm aware of try to avoid global constructors (see e.g. Google's style guide).

I probably wouldn't try to optimize this preemptively in any case; even if the issue somehow still exists it's possible that anyone who previously cared about it has moved to ThinLTO (the patch that introduced dropTriviallyDeadConstantArrays predates ThinLTO) which doesn't import appending arrays so I reckon we should make the change and see if anyone complains.

Copy link
Contributor

@teresajohnson teresajohnson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like the wrong place to do this optimization anyway - is there no optimization pass that removes dead constant arrays, e.g. GlobalDCE?

@pcc
Copy link
Contributor Author

pcc commented Apr 28, 2025

This seems like the wrong place to do this optimization anyway - is there no optimization pass that removes dead constant arrays, e.g. GlobalDCE?

GlobalDCE only deletes dead globals as far as I can tell. Although I suppose we could have passes (and other code like the IR linker) try to delete constants that have likely become unused (e.g. GlobalDCE can try to delete initializers of globals it deletes), that should only be done if the benefits outweigh the costs, which is not the case here.

@pcc pcc merged commit 95795ab into main Apr 28, 2025
14 checks passed
@pcc pcc deleted the users/pcc/spr/linker-remove-droptriviallydeadconstantarrays branch April 28, 2025 19:23
IanWood1 pushed a commit to IanWood1/llvm-project that referenced this pull request May 6, 2025
By calling this function after every link, we introduce quadratic
behavior during full LTO links that primarily affects links involving
large numbers of constant arrays, such as the full LTO part of a
ThinLTO link which will involve large numbers of vtable constants.
Removing this call resulted in a 1.3x speedup in the indexing phase
of a large internal Google binary.

This call was originally introduced to reduce memory consumption during
full LTO (see commit dab999d), but it
doesn't seem to be the case that this helps any more. I ran 3 stage2
links of clang with full LTO and ThinLTO with and without this change
and measured the max RSS. The results were as follows (all in KB):

```
FullLTO before 22362512 22383524 22387456
         after 22383496 22365356 22364352
ThinLTO before  4391404  4478192  4383468
         after  4399220  4363100  4417688
```

As you can see, any max RSS differences are in the noise.

Reviewers: nikic, teresajohnson

Reviewed By: teresajohnson, nikic

Pull Request: llvm#137081
llvm-sync bot pushed a commit to arm/arm-toolchain that referenced this pull request May 6, 2025
By calling this function after every link, we introduce quadratic
behavior during full LTO links that primarily affects links involving
large numbers of constant arrays, such as the full LTO part of a
ThinLTO link which will involve large numbers of vtable constants.
Removing this call resulted in a 1.3x speedup in the indexing phase
of a large internal Google binary.

This call was originally introduced to reduce memory consumption during
full LTO (see commit dab999d), but it
doesn't seem to be the case that this helps any more. I ran 3 stage2
links of clang with full LTO and ThinLTO with and without this change
and measured the max RSS. The results were as follows (all in KB):

```
FullLTO before 22362512 22383524 22387456
         after 22383496 22365356 22364352
ThinLTO before  4391404  4478192  4383468
         after  4399220  4363100  4417688
```

As you can see, any max RSS differences are in the noise.

Reviewers: nikic, teresajohnson

Reviewed By: teresajohnson, nikic

Pull Request: llvm/llvm-project#137081
GeorgeARM pushed a commit to GeorgeARM/llvm-project that referenced this pull request May 7, 2025
By calling this function after every link, we introduce quadratic
behavior during full LTO links that primarily affects links involving
large numbers of constant arrays, such as the full LTO part of a
ThinLTO link which will involve large numbers of vtable constants.
Removing this call resulted in a 1.3x speedup in the indexing phase
of a large internal Google binary.

This call was originally introduced to reduce memory consumption during
full LTO (see commit dab999d), but it
doesn't seem to be the case that this helps any more. I ran 3 stage2
links of clang with full LTO and ThinLTO with and without this change
and measured the max RSS. The results were as follows (all in KB):

```
FullLTO before 22362512 22383524 22387456
         after 22383496 22365356 22364352
ThinLTO before  4391404  4478192  4383468
         after  4399220  4363100  4417688
```

As you can see, any max RSS differences are in the noise.

Reviewers: nikic, teresajohnson

Reviewed By: teresajohnson, nikic

Pull Request: llvm#137081
pcc added a commit that referenced this pull request Nov 12, 2025
A full LTO link time performance and memory regression was introduced
by #137081 in cases where the modules contain large quantities of llvm.used
globals. This was unnoticed because it was not expected that this would
be a typical case, but this is exactly what coverage collection does,
and when this feature is enabled together with full LTO we end up with
quadratic memory consumption (from the unused constants) and quadratic
complexity in the function Verifier::visitGlobalValue (which visits all
the unused constants in the use list of each global value). This is a
targeted fix that avoids reintroducing the quadratic complexity from
before #137081, by having ValueMapper delete the old initializer of an
appending global if it is unused, instead of visiting every global in
the context after every link.

The repro-cfi-64 reproducer from #167037 before and after this change:

```
        Elapsed time   Max RSS (KB)
Before   12:05.11        52537184
After     3:27.68         7520696
```

Fixes #167037.

Reviewers: nikic, teresajohnson

Reviewed By: teresajohnson

Pull Request: #167629
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

llvm:ir LTO Link time optimization (regular/full LTO or ThinLTO)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants