Skip to content

Conversation

@ellishg
Copy link
Contributor

@ellishg ellishg commented Feb 8, 2025

ICF runs before BPSectionOrderer. When a section is ICF'ed, it seems that the original sections are marked as not live, but are still kept around. Prior to this patch, those ICF'ed sections would be passed to BP and ordered before being skipped when writing the output. Now, these sections are no longer passed to BP, saving runtime and possibly improving BP's output.

In a large binary, I found that the number of sections ordered using BP decreased, while the number of duplicate sections drastically decreased as expected.

Functions for startup: 50755 -> 50520
Functions for compression: 165734 -> 105328
Duplicate functions: 1827231 -> 55230

@ellishg
Copy link
Contributor Author

ellishg commented Feb 8, 2025

CC @Colibrow

@llvmbot
Copy link
Member

llvmbot commented Feb 8, 2025

@llvm/pr-subscribers-lld
@llvm/pr-subscribers-lld-elf

@llvm/pr-subscribers-lld-macho

Author: Ellis Hoag (ellishg)

Changes

ICF runs before BPSectionOrderer. When a section is ICF'ed, it seems that the original sections are marked as not live, but are still kept around. Prior to this patch, those ICF'ed sections would be passed to BP and ordered before being skipped when writing the output. Now, these sections are no longer passed to BP, saving runtime and possibly improving BP's output.

In a large binary, I found that the number of sections ordered using BP decreased, while the number of duplicate sections drastically decreased as expected.

Functions for startup: 50755 -> 50520
Functions for compression: 165734 -> 105328
Duplicate functions: 1827231 -> 55230

Full diff: https://github.com/llvm/llvm-project/pull/126327.diff

3 Files Affected:

  • (modified) lld/MachO/BPSectionOrderer.cpp (+2)
  • (modified) lld/test/ELF/bp-section-orderer.s (+56-10)
  • (modified) lld/test/MachO/bp-section-orderer.s (+10-2)
diff --git a/lld/MachO/BPSectionOrderer.cpp b/lld/MachO/BPSectionOrderer.cpp
index 689afd67712a417..f693e5e59f8faea 100644
--- a/lld/MachO/BPSectionOrderer.cpp
+++ b/lld/MachO/BPSectionOrderer.cpp
@@ -117,6 +117,8 @@ DenseMap<const InputSection *, int> lld::macho::runBalancedPartitioning(
         auto *isec = subsec.isec;
         if (!isec || isec->data.empty())
           continue;
+        if (isa<ConcatInputSection>(isec) && !isec->isLive(0))
+          continue;
         size_t idx = sections.size();
         sections.emplace_back(isec);
         for (auto *sym : BPOrdererMachO::getSymbols(*isec)) {
diff --git a/lld/test/ELF/bp-section-orderer.s b/lld/test/ELF/bp-section-orderer.s
index 2e18107c02ca306..6a61e7e8cef2b8d 100644
--- a/lld/test/ELF/bp-section-orderer.s
+++ b/lld/test/ELF/bp-section-orderer.s
@@ -1,3 +1,4 @@
+# NOTE: Code has been autogenerated by utils/update_test_body.py
 # REQUIRES: aarch64
 # RUN: rm -rf %t && split-file %s %t && cd %t
 
@@ -25,30 +26,33 @@
 
 # RUN: ld.lld -o out.s a.o --irpgo-profile=a.profdata --bp-startup-sort=function
 # RUN: llvm-nm -jn out.s | tr '\n' , | FileCheck %s --check-prefix=STARTUP
-# STARTUP: s5,s4,s3,s2,s1,A,B,C,F,E,D,_start,d4,d3,d2,d1,{{$}}
+# STARTUP: s5,s4,s3,s2,s1,A,B,C,F,E,D,merged1,merged2,_start,d4,d3,d2,d1,{{$}}
 
 # RUN: ld.lld -o out.os a.o --irpgo-profile=a.profdata --bp-startup-sort=function --symbol-ordering-file a.txt
 # RUN: llvm-nm -jn out.os | tr '\n' , | FileCheck %s --check-prefix=ORDER-STARTUP
-# ORDER-STARTUP: s2,s1,s5,s4,s3,A,F,E,D,B,C,_start,d3,d2,d4,d1,{{$}}
+# ORDER-STARTUP: s2,s1,s5,s4,s3,A,F,E,D,B,C,merged1,merged2,_start,d3,d2,d4,d1,{{$}}
 
 # RUN: ld.lld -o out.cf a.o --verbose-bp-section-orderer --bp-compression-sort=function 2>&1 | FileCheck %s --check-prefix=BP-COMPRESSION-FUNC
+# RUN: ld.lld -o out.cf.icf a.o --verbose-bp-section-orderer --bp-compression-sort=function --icf=all 2>&1 | FileCheck %s --check-prefix=BP-COMPRESSION-ICF-FUNC
 # RUN: llvm-nm -jn out.cf | tr '\n' , | FileCheck %s --check-prefix=CFUNC
-# CFUNC: s5,s4,s3,s2,s1,F,C,E,D,B,A,_start,d4,d3,d2,d1,{{$}}
+# CFUNC: s5,s4,s3,s2,s1,A,F,merged1,merged2,C,E,D,B,_start,d4,d3,d2,d1,{{$}}
 
 # RUN: ld.lld -o out.cd a.o --verbose-bp-section-orderer --bp-compression-sort=data 2>&1 | FileCheck %s --check-prefix=BP-COMPRESSION-DATA
 # RUN: llvm-nm -jn out.cd | tr '\n' , | FileCheck %s --check-prefix=CDATA
-# CDATA: s5,s3,s4,s2,s1,F,C,E,D,B,A,_start,d4,d1,d3,d2,{{$}}
+# CDATA: s5,s3,s4,s2,s1,F,C,E,D,B,A,merged1,merged2,_start,d4,d1,d3,d2,{{$}}
 
 # RUN: ld.lld -o out.cb a.o --verbose-bp-section-orderer --bp-compression-sort=both 2>&1 | FileCheck %s --check-prefix=BP-COMPRESSION-BOTH
-# RUN: llvm-nm -jn out.cb | tr '\n' , | FileCheck %s --check-prefix=CDATA
+# RUN: llvm-nm -jn out.cb | tr '\n' , | FileCheck %s --check-prefix=CBOTH
+# CBOTH: s5,s3,s4,s2,s1,A,F,merged1,merged2,C,E,D,B,_start,d4,d1,d3,d2,{{$}}
 
 # RUN: ld.lld -o out.cbs a.o --verbose-bp-section-orderer --bp-compression-sort=both --irpgo-profile=a.profdata --bp-startup-sort=function 2>&1 | FileCheck %s --check-prefix=BP-COMPRESSION-BOTH
 # RUN: llvm-nm -jn out.cbs | tr '\n' , | FileCheck %s --check-prefix=CBOTH-STARTUP
-# CBOTH-STARTUP: s5,s3,s4,s2,s1,A,B,C,F,E,D,_start,d4,d1,d3,d2,{{$}}
+# CBOTH-STARTUP: s5,s3,s4,s2,s1,A,B,C,F,E,D,merged1,merged2,_start,d4,d1,d3,d2,{{$}}
 
-# BP-COMPRESSION-FUNC: Ordered 7 sections using balanced partitioning
+# BP-COMPRESSION-FUNC: Ordered 9 sections using balanced partitioning
+# BP-COMPRESSION-ICF-FUNC: Ordered 8 sections using balanced partitioning
 # BP-COMPRESSION-DATA: Ordered 9 sections using balanced partitioning
-# BP-COMPRESSION-BOTH: Ordered 16 sections using balanced partitioning
+# BP-COMPRESSION-BOTH: Ordered 18 sections using balanced partitioning
 
 #--- a.proftext
 :ir
@@ -125,6 +129,9 @@ int C(int a) { A(); return a + 2; }
 int B(int a) { A(); return a + 1; }
 void A() {}
 
+int merged1(int a) { return F(a + 101); }
+int merged2(int a) { return F(a + 101); }
+
 int _start() { return 0; }
 
 #--- gen
@@ -236,6 +243,44 @@ A:                                      // @A
 .Lfunc_end5:
 	.size	A, .Lfunc_end5-A
                                         // -- End function
+	.section	.text.merged1,"ax",@progbits
+	.globl	merged1                         // -- Begin function merged1
+	.p2align	2
+	.type	merged1,@function
+merged1:                                // @merged1
+// %bb.0:                               // %entry
+	sub	sp, sp, #32
+	stp	x29, x30, [sp, #16]             // 16-byte Folded Spill
+	add	x29, sp, #16
+	stur	w0, [x29, #-4]
+	ldur	w8, [x29, #-4]
+	add	w0, w8, #101
+	bl	F
+	ldp	x29, x30, [sp, #16]             // 16-byte Folded Reload
+	add	sp, sp, #32
+	ret
+.Lfunc_end6:
+	.size	merged1, .Lfunc_end6-merged1
+                                        // -- End function
+	.section	.text.merged2,"ax",@progbits
+	.globl	merged2                         // -- Begin function merged2
+	.p2align	2
+	.type	merged2,@function
+merged2:                                // @merged2
+// %bb.0:                               // %entry
+	sub	sp, sp, #32
+	stp	x29, x30, [sp, #16]             // 16-byte Folded Spill
+	add	x29, sp, #16
+	stur	w0, [x29, #-4]
+	ldur	w8, [x29, #-4]
+	add	w0, w8, #101
+	bl	F
+	ldp	x29, x30, [sp, #16]             // 16-byte Folded Reload
+	add	sp, sp, #32
+	ret
+.Lfunc_end7:
+	.size	merged2, .Lfunc_end7-merged2
+                                        // -- End function
 	.section	.text._start,"ax",@progbits
 	.globl	_start                          // -- Begin function _start
 	.p2align	2
@@ -244,8 +289,8 @@ _start:                                 // @_start
 // %bb.0:                               // %entry
 	mov	w0, wzr
 	ret
-.Lfunc_end6:
-	.size	_start, .Lfunc_end6-_start
+.Lfunc_end8:
+	.size	_start, .Lfunc_end8-_start
                                         // -- End function
 	.type	s5,@object                      // @s5
 	.section	.rodata.s5,"a",@progbits
@@ -330,6 +375,7 @@ d1:
 
 	.section	".note.GNU-stack","",@progbits
 	.addrsig
+	.addrsig_sym F
 	.addrsig_sym C
 	.addrsig_sym B
 	.addrsig_sym A
diff --git a/lld/test/MachO/bp-section-orderer.s b/lld/test/MachO/bp-section-orderer.s
index 2eaff04bdc04700..f0ac1bd99f7227a 100644
--- a/lld/test/MachO/bp-section-orderer.s
+++ b/lld/test/MachO/bp-section-orderer.s
@@ -42,13 +42,15 @@
 # RUN: %no-fatal-warnings-lld -arch arm64 -lSystem -e _main -o %t/a.out %t/a.o --verbose-bp-section-orderer --compression-sort=both --irpgo-profile-sort=%t/a.profdata 2>&1 | FileCheck %s --check-prefix=COMPRESSION-BOTH
 
 # RUN: %lld -arch arm64 -lSystem -e _main -o %t/a.out %t/a.o --verbose-bp-section-orderer --bp-compression-sort=function 2>&1 | FileCheck %s --check-prefix=COMPRESSION-FUNC
+# RUN: %lld -arch arm64 -lSystem -e _main -o %t/a.out %t/a.o --verbose-bp-section-orderer --bp-compression-sort=function --icf=all 2>&1 | FileCheck %s --check-prefix=COMPRESSION-ICF-FUNC
 # RUN: %lld -arch arm64 -lSystem -e _main -o %t/a.out %t/a.o --verbose-bp-section-orderer --bp-compression-sort=data 2>&1 | FileCheck %s --check-prefix=COMPRESSION-DATA
 # RUN: %lld -arch arm64 -lSystem -e _main -o %t/a.out %t/a.o --verbose-bp-section-orderer --bp-compression-sort=both 2>&1 | FileCheck %s --check-prefix=COMPRESSION-BOTH
 # RUN: %lld -arch arm64 -lSystem -e _main -o %t/a.out %t/a.o --verbose-bp-section-orderer --bp-compression-sort=both --irpgo-profile=%t/a.profdata --bp-startup-sort=function 2>&1 | FileCheck %s --check-prefix=COMPRESSION-BOTH
 
-# COMPRESSION-FUNC: Ordered 7 sections using balanced partitioning
+# COMPRESSION-FUNC: Ordered 9 sections using balanced partitioning
+# COMPRESSION-ICF-FUNC: Ordered 7 sections using balanced partitioning
 # COMPRESSION-DATA: Ordered 7 sections using balanced partitioning
-# COMPRESSION-BOTH: Ordered 14 sections using balanced partitioning
+# COMPRESSION-BOTH: Ordered 16 sections using balanced partitioning
 
 #--- a.s
 .text
@@ -78,6 +80,12 @@ F:
   add w0, w0, #3
   bl l_C.__uniq.111111111111111111111111111111111111111.llvm.2222222222222222222
   ret
+merged1:
+  add w0, w0, #101
+  ret
+merged2:
+  add w0, w0, #101
+  ret
 
 .data
 s1:

@MaskRay
Copy link
Member

MaskRay commented Feb 8, 2025

Thanks for updating the ELF test! If you want to update the ELF code, you could do the following.

--- i/lld/ELF/BPSectionOrderer.cpp
+++ w/lld/ELF/BPSectionOrderer.cpp
@@ -75,8 +75,11 @@ DenseMap<const InputSectionBase *, int> elf::runBalancedPartitioning(
     auto *d = dyn_cast<Defined>(&sym);
     if (!d)
       return;
-    auto *sec = dyn_cast_or_null<InputSectionBase>(d->section);
-    if (!sec || sec->size == 0 || !orderer.secToSym.try_emplace(sec, d).second)
+    auto *sec = dyn_cast_or_null<InputSection>(d->section);
+    // Skip empty, discarded, ICF folded sections. Skipping ICF folded sections
+    // reduces duplicate detection work in BPSectionOrderer.
+    if (!sec || sec->size == 0 || !sec->isLive() || sec->repl != sec ||
+        !orderer.secToSym.try_emplace(sec, d).second)
       return;
     rootSymbolToSectionIdxs[CachedHashStringRef(getRootSymbol(sym.getName()))]
         .insert(sections.size());

In ELF, InputSectionBase can be InputSection (regular sections from .o), EhInputSection (.eh_frame, already merged), MergeInputSection (mergeable string sections like .rodata.str1 .debug.str, similar to Mach-O __TEXT,__cstring; already merged into a MergeSyntheticSection). We only need to handle InputSection.

!sec->isLive() skips unreachable input sections due to --gc-sections (like -dead_strip).

While writing this comment, I realize that the tests probably should be enhanced to test section garbage collection (dead stripping).
Since many functions are actually unreachable from main, you will probably need to add this attribute

// used is to suppress compiler garbage collection in ELF; retain is to suppress linker garbage collection; used is not needed for non-internal linkage symbols

// used is for both compiler/linker GC in Mach-O; retain is ignored for Mach-O

#define RETAIN [[gnu::used,gnu::retain]]

RETAIN void A() {}
RETAIN void B() {}

auto *isec = subsec.isec;
if (!isec || isec->data.empty())
continue;
if (isa<ConcatInputSection>(isec) && !isec->isLive(0))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the first offset, 0, is dead, is it guaranteed that other offsets are also dead? This might be the case for code, but I'm not sure about data.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should add a comment here.

// ConcatInputSections are entirely live or dead, so the offset is irrelevant.
bool isLive(uint64_t off) const override { return live; }

I actually don't know if isec could be some other InputSection, but I didn't see any examples in my large test binary.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

off is for mergeable string and constant literals. It is unneeded for // ConcatInputSections are entirely live or dead, so the offset is irrelevant.

While I haven't verified, I suspect that we only need to pass ConcatInputSection to BP if Mach-O works like ELF.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah it probably should use ConcatInputSection instead of InputSection. I think I was following the orderfile implementation which uses InputSection. I think this change should be a separate PR, though.

@ellishg
Copy link
Contributor Author

ellishg commented Feb 10, 2025

Thanks for updating the ELF test! If you want to update the ELF code, you could do the following.

--- i/lld/ELF/BPSectionOrderer.cpp
+++ w/lld/ELF/BPSectionOrderer.cpp
@@ -75,8 +75,11 @@ DenseMap<const InputSectionBase *, int> elf::runBalancedPartitioning(
     auto *d = dyn_cast<Defined>(&sym);
     if (!d)
       return;
-    auto *sec = dyn_cast_or_null<InputSectionBase>(d->section);
-    if (!sec || sec->size == 0 || !orderer.secToSym.try_emplace(sec, d).second)
+    auto *sec = dyn_cast_or_null<InputSection>(d->section);
+    // Skip empty, discarded, ICF folded sections. Skipping ICF folded sections
+    // reduces duplicate detection work in BPSectionOrderer.
+    if (!sec || sec->size == 0 || !sec->isLive() || sec->repl != sec ||
+        !orderer.secToSym.try_emplace(sec, d).second)
       return;
     rootSymbolToSectionIdxs[CachedHashStringRef(getRootSymbol(sym.getName()))]
         .insert(sections.size());

In ELF, InputSectionBase can be InputSection (regular sections from .o), EhInputSection (.eh_frame, already merged), MergeInputSection (mergeable string sections like .rodata.str1 .debug.str, similar to Mach-O __TEXT,__cstring; already merged into a MergeSyntheticSection). We only need to handle InputSection.

!sec->isLive() skips unreachable input sections due to --gc-sections (like -dead_strip).

While writing this comment, I realize that the tests probably should be enhanced to test section garbage collection (dead stripping). Since many functions are actually unreachable from main, you will probably need to add this attribute

// used is to suppress compiler garbage collection in ELF; retain is to suppress linker garbage collection; used is not needed for non-internal linkage symbols

// used is for both compiler/linker GC in Mach-O; retain is ignored for Mach-O

#define RETAIN [[gnu::used,gnu::retain]]

RETAIN void A() {}
RETAIN void B() {}

Thanks for the useful info! I had assumed that ELF was skipping dead sections already because the test I added already passes.

@ellishg
Copy link
Contributor Author

ellishg commented Feb 12, 2025

Friendly ping

Copy link
Contributor

@kyulee-com kyulee-com left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ellishg ellishg merged commit 79fff6a into llvm:main Feb 13, 2025
8 checks passed
@ellishg ellishg deleted the bp-icf branch February 13, 2025 16:57
joaosaffran pushed a commit to joaosaffran/llvm-project that referenced this pull request Feb 14, 2025
ICF runs before BPSectionOrderer. When a section is ICF'ed, it seems
that the original sections are marked as not live, but are still kept
around. Prior to this patch, those ICF'ed sections would be passed to BP
and ordered before being skipped when writing the output. Now, these
sections are no longer passed to BP, saving runtime and possibly
improving BP's output.

In a large binary, I found that the number of sections ordered using BP
decreased, while the number of duplicate sections drastically decreased
as expected.
```
Functions for startup: 50755 -> 50520
Functions for compression: 165734 -> 105328
Duplicate functions: 1827231 -> 55230
```
sivan-shani pushed a commit to sivan-shani/llvm-project that referenced this pull request Feb 24, 2025
ICF runs before BPSectionOrderer. When a section is ICF'ed, it seems
that the original sections are marked as not live, but are still kept
around. Prior to this patch, those ICF'ed sections would be passed to BP
and ordered before being skipped when writing the output. Now, these
sections are no longer passed to BP, saving runtime and possibly
improving BP's output.

In a large binary, I found that the number of sections ordered using BP
decreased, while the number of duplicate sections drastically decreased
as expected.
```
Functions for startup: 50755 -> 50520
Functions for compression: 165734 -> 105328
Duplicate functions: 1827231 -> 55230
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants