-
Notifications
You must be signed in to change notification settings - Fork 15k
Closed
Labels
Description
TL;DR
We applied BOLT with an instrumentation profile to a bootstrapped AArch64 Clang build and observed a consistent performance regression (~260s → ~280s, ≈7.7%) despite very positive Dyno statistics (e.g., taken branches −66.4%). Root cause: the instrumentation profile only covered 1.7% of functions, which caused BOLT to make globally harmful layout decisions while producing strong local improvements.
Environment
- OS: Ubuntu 22.04
- Arch: AArch64
- LLVM/BOLT: 22
- Binary: bootstrapped Release Clang (single monolithic clang binary used to bootstrap)
- Workload: full
ninja clangbuild (the instrumented compiler was used to drive profile collection)
Repro summary
- Build baseline clang (Release).
- Instrument baseline with
llvm-bolt -instrument→ produce instrumented clang. - Use the instrumented clang to run a full
ninja clangbuild that generates.fdata. - Run
llvm-bolton the baseline clang with the generated.fdata. - Measure end-to-end
ninja clangruntime or the same benchmark used for baseline.
Observed result
- Baseline Clang: ~260 s
- BOLT-optimized clang.bolt: ~280 s (regression ≈ 7.7%)
BOLT reported excellent dyno stats (e.g., taken branches −66.4%), but the profile coverage was tiny:
BOLT-INFO: 2376 out of 142805 functions in the binary (1.7%) have non-empty execution profile
questions
- Has anyone seen similar regressions caused by sparse instrumentation profiles (especially on AArch64)?
- Would it be useful for BOLT to warn when function coverage is below a threshold (e.g., <5%) before large global reorders?
- Recommended best practices for generating robust instrumentation profiles for multi-process builds (ninja), or any scripts/recipes you can share?