Skip to content

Commit 36b7d3b

Browse files
haampieoscardssmithLilithHafner
authored
Add PGO+LTO Makefile (JuliaLang#45641)
Adds a convenient way to enable PGO+LTO on Julia and LLVM together: 1. `cd contrib/pgo-lto` 2. `make -j$(nproc) stage1` 3. `make clean-profiles` 4. `./stage1.build/julia -O3 -e 'using Pkg; Pkg.add("LoopVectorization"); Pkg.test("LoopVectorization")'` 5. `make -j$(nproc) stage2` <details> <summary>* Output looks roughly like as follows</summary> ```c++ $ make -C contrib/pgo-lto top make: Entering directory '/dev/shm/julia/contrib/pgo-lto' llvm-profdata show --topn=50 /dev/shm/julia/contrib/pgo-lto/profiles/merged.prof | c++filt Instrumentation level: IR entry_first = 0 Total functions: 85943 Maximum function count: 7867557260 Maximum internal block count: 3468437590 Top 50 functions with the largest internal block counts: llvm::BitVector::operator|=(llvm::BitVector const&), max count = 7867557260 LateLowerGCFrame::ComputeLiveness(State&), max count = 3468437590 llvm::hashing::detail::hash_combine_recursive_helper::hash_combine_recursive_helper(), max count = 1742259834 llvm::SUnit::addPred(llvm::SDep const&, bool), max count = 511396575 llvm::LiveRange::overlaps(llvm::LiveRange const&, llvm::CoalescerPair const&, llvm::SlotIndexes const&) const, max count = 508061762 llvm::StringMapImpl::LookupBucketFor(llvm::StringRef), max count = 505682177 std::map<llvm::BasicBlock*, BBState, std::less<llvm::BasicBlock*>, std::allocator<std::pair<llvm::BasicBlock* const, BBState> > >::operator[](llvm::BasicBlock* const&), max count = 395628888 llvm::LiveRange::advanceTo(llvm::LiveRange::Segment const*, llvm::SlotIndex) const, max count = 384642728 llvm::LiveRange::isLiveAtIndexes(llvm::ArrayRef<llvm::SlotIndex>) const, max count = 380291040 llvm::PassRegistry::enumerateWith(llvm::PassRegistrationListener*), max count = 352313953 ijl_method_instance_add_backedge, max count = 349608221 llvm::SUnit::ComputeHeight(), max count = 336604330 llvm::LiveRange::advanceTo(llvm::LiveRange::Segment*, llvm::SlotIndex), max count = 331030109 llvm::SmallPtrSetImplBase::insert_imp(void const*), max count = 272966545 llvm::LiveIntervals::checkRegMaskInterference(llvm::LiveInterval&, llvm::BitVector&), max count = 257449540 LateLowerGCFrame::ComputeLiveSets(State&), max count = 252096274 /dev/shm/julia/src/jltypes.c:has_free_typevars, max count = 230879464 ijl_get_pgcstack, max count = 216953592 LateLowerGCFrame::RefineLiveSet(llvm::BitVector&, State&, std::vector<int, std::allocator<int> > const&), max count = 188013152 /dev/shm/julia/src/flisp/flisp.c:apply_cl, max count = 174863813 /dev/shm/julia/src/flisp/builtins.c:fl_memq, max count = 168621603 ``` </details> This results quite often in spectacular speedups for time to first X as it reduces the time spent in LLVM optimization passes by 25 or even 30%. Example 1: ```julia using LoopVectorization function f!(a, b) @turbo for i in eachindex(a) a[i] *= b[i] end return a end f!(rand(1), rand(1)) ``` ```console $ time ./julia -O3 lv.jl ``` Without PGO+LTO: 14.801s With PGO+LTO: 11.978s (-19%) Example 2: ```console $ time ./julia -e 'using Pkg; Pkg.test("Unitful");' ``` Without PGO+LTO: 1m47.688s With PGO+LTO: 1m35.704s (-11%) Example 3 (taken from issue JuliaLang#45395, which is almost only LLVM): ```console $ JULIA_LLVM_ARGS=-time-passes ./julia script-45395.jl ``` Without PGO+LTO: ``` ===-------------------------------------------------------------------------=== ... Pass execution timing report ... ===-------------------------------------------------------------------------=== Total Execution Time: 101.0130 seconds (98.6253 wall clock) ---User Time--- --System Time-- --User+System-- ---Wall Time--- --- Name --- 53.6961 ( 54.7%) 0.1050 ( 3.8%) 53.8012 ( 53.3%) 53.8045 ( 54.6%) Unroll loops 25.5423 ( 26.0%) 0.0072 ( 0.3%) 25.5495 ( 25.3%) 25.5444 ( 25.9%) Global Value Numbering 7.1995 ( 7.3%) 0.0526 ( 1.9%) 7.2521 ( 7.2%) 7.2517 ( 7.4%) Induction Variable Simplification 6.0541 ( 5.1%) 0.0098 ( 0.3%) 5.0639 ( 5.0%) 5.0561 ( 5.1%) Combine redundant instructions mmtk#2 ``` With PGO+LTO: ``` ===-------------------------------------------------------------------------=== ... Pass execution timing report ... ===-------------------------------------------------------------------------=== Total Execution Time: 72.6507 seconds (70.1337 wall clock) ---User Time--- --System Time-- --User+System-- ---Wall Time--- --- Name --- 36.0894 ( 51.7%) 0.0825 ( 2.9%) 36.1719 ( 49.8%) 36.1738 ( 51.6%) Unroll loops 16.5713 ( 23.7%) 0.0129 ( 0.5%) 16.5843 ( 22.8%) 16.5794 ( 23.6%) Global Value Numbering 5.9047 ( 8.5%) 0.0395 ( 1.4%) 5.9442 ( 8.2%) 5.9438 ( 8.5%) Induction Variable Simplification 4.7566 ( 6.8%) 0.0078 ( 0.3%) 4.7645 ( 6.6%) 4.7575 ( 6.8%) Combine redundant instructions mmtk#2 ``` Or -28% time spent in LLVM. `perf` reports show this is mostly fewer instructions and reduction in icache misses. --- Finally there's a significant reduction in binary sizes. For libLLVM.so: ``` 79M usr/lib/libLLVM-13jl.so (before) 67M usr/lib/libLLVM-13jl.so (after) ``` And it can be reduced by another 2MB with `--icf=safe` when using LLD as a linker anyways. - [x] Two out-of-source builds would be better than a single in-source build, so that it's easier to find good profile data --------- Co-authored-by: Oscar Smith <[email protected]> Co-authored-by: Lilith Orion Hafner <[email protected]>
1 parent 27b31d1 commit 36b7d3b

File tree

2 files changed

+84
-0
lines changed

2 files changed

+84
-0
lines changed

contrib/pgo-lto/.gitignore

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
profiles
2+
stage0*
3+
stage1*
4+
stage2*

contrib/pgo-lto/Makefile

Lines changed: 80 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,80 @@
1+
.PHONY: top clean clean-profiles
2+
3+
STAGE0_BUILD:=$(CURDIR)/stage0.build
4+
STAGE1_BUILD:=$(CURDIR)/stage1.build
5+
STAGE2_BUILD:=$(CURDIR)/stage2.build
6+
7+
STAGE0_TOOLS:=$(STAGE0_BUILD)/usr/tools/
8+
9+
PROFILE_DIR:=$(CURDIR)/profiles
10+
PROFILE_FILE:=$(PROFILE_DIR)/merged.prof
11+
PROFRAW_FILES:=$(wildcard $(PROFILE_DIR)/*.profraw)
12+
JULIA_ROOT:=$(CURDIR)/../..
13+
14+
LLVM_CXXFILT:=$(STAGE0_TOOLS)llvm-cxxfilt
15+
LLVM_PROFDATA:=$(STAGE0_TOOLS)llvm-profdata
16+
LLVM_OBJCOPY:=$(STAGE0_TOOLS)llvm-objcopy
17+
18+
# When building a single libLLVM.so we need to increase -vp-counters-per-site
19+
# significantly
20+
COUNTERS_PER_SITE:=6
21+
22+
AFTER_STAGE1_MESSAGE:='Run `make clean-profiles` to start with a clean slate. $\
23+
Then run Julia to collect realistic profile data, for example: `$(STAGE1_BUILD)/julia -O3 -e $\
24+
'\''using Pkg; Pkg.add("LoopVectorization"); Pkg.test("LoopVectorization")'\''`. This $\
25+
should produce about 15MB of data in $(PROFILE_DIR). Note that running extensive $\
26+
scripts may result in counter overflows, which can be detected by running $\
27+
`make top`. Afterwards run `make stage2`.'
28+
29+
TOOLCHAIN_FLAGS = $\
30+
"CC=$(STAGE0_TOOLS)clang" $\
31+
"CXX=$(STAGE0_TOOLS)clang++" $\
32+
"LD=$(STAGE0_TOOLS)ld.lld" $\
33+
"AR=$(STAGE0_TOOLS)llvm-ar" $\
34+
"RANLIB=$(STAGE0_TOOLS)llvm-ranlib" $\
35+
"CFLAGS+=$(PGO_CFLAGS)" $\
36+
"CXXFLAGS+=$(PGO_CXXFLAGS)" $\
37+
"LDFLAGS+=$(PGO_LDFLAGS)"
38+
39+
$(STAGE0_BUILD) $(STAGE1_BUILD) $(STAGE2_BUILD):
40+
$(MAKE) -C $(JULIA_ROOT) O=$@ configure
41+
42+
stage0: export USE_BINARYBUILDER_LLVM=1
43+
stage0: | $(STAGE0_BUILD)
44+
# Turn [cd]tors into init/fini_array sections in libclang_rt, since lld
45+
# doesn't do that, and otherwise the profile constructor is not executed
46+
$(MAKE) -C $(STAGE0_BUILD)/deps install-clang install-llvm install-lld install-llvm-tools && \
47+
find $< -name 'libclang_rt.profile-*.a' -exec $(LLVM_OBJCOPY) --rename-section .ctors=.init_array --rename-section .dtors=.fini_array {} + && \
48+
touch $@
49+
50+
$(STAGE1_BUILD): stage0
51+
stage1: PGO_CFLAGS:=-fprofile-generate=$(PROFILE_DIR) -Xclang -mllvm -Xclang -vp-counters-per-site=$(COUNTERS_PER_SITE)
52+
stage1: PGO_CXXFLAGS:=-fprofile-generate=$(PROFILE_DIR) -Xclang -mllvm -Xclang -vp-counters-per-site=$(COUNTERS_PER_SITE)
53+
stage1: PGO_LDFLAGS:=-fuse-ld=lld -flto=thin -fprofile-generate=$(PROFILE_DIR)
54+
stage1: export USE_BINARYBUILDER_LLVM=0
55+
stage1: | $(STAGE1_BUILD)
56+
$(MAKE) -C $(STAGE1_BUILD) $(TOOLCHAIN_FLAGS) && touch $@
57+
@echo $(AFTER_STAGE1_MESSAGE)
58+
59+
stage2: PGO_CFLAGS:=-fprofile-use=$(PROFILE_FILE)
60+
stage2: PGO_CXXFLAGS:=-fprofile-use=$(PROFILE_FILE)
61+
stage2: PGO_LDFLAGS:=-fuse-ld=lld -flto=thin -fprofile-use=$(PROFILE_FILE) -Wl,--icf=safe
62+
stage2: export USE_BINARYBUILDER_LLVM=0
63+
stage2: $(PROFILE_FILE) | $(STAGE2_BUILD)
64+
$(MAKE) -C $(STAGE2_BUILD) $(TOOLCHAIN_FLAGS) && touch $@
65+
66+
install: stage2
67+
$(MAKE) -C $(STAGE2_BUILD) USE_BINARYBUILDER_LLVM=0 install
68+
69+
$(PROFILE_FILE): stage1 $(PROFRAW_FILES)
70+
$(LLVM_PROFDATA) merge -output=$@ $(PROFRAW_FILES)
71+
72+
# show top 50 functions
73+
top: $(PROFILE_FILE)
74+
$(LLVM_PROFDATA) show --topn=50 $< | $(LLVM_CXXFILT)
75+
76+
clean-profiles:
77+
rm -rf $(PROFILE_DIR)
78+
79+
clean:
80+
rm -f stage0 stage1 stage2 $(PROFILE_FILE)

0 commit comments

Comments
 (0)