-
Notifications
You must be signed in to change notification settings - Fork 15.3k
Closed
Labels
Description
Description:
Bolt crashes during optimization of the libvulkan_radeon.so library (part of the Mesa Radeon Vulkan driver) when using the --lite=false flag. The crash occurs during the basic block reordering phase (ExtTSPReorderAlgorithm) and is caused by an error in the x86 code emitter: "Cannot encode high byte register in REX-prefixed instruction".
Environment:
- Operating System: CachyOS
- LLVM/BOLT Version: Recent snapshot of LLVM-git (3def49c)
- Host Compiler: GCC 14.2.1
- Target Architecture: x86-64
- Binary:
libvulkan_radeon.so(from Mesa Radeon Vulkan driver) - BOLT command:
llvm-bolt "$file" --data "${srcdir}/bolt_profile/vkcube.perf.data" --dyno-stats --lite=false --cu-processing-batch-size=64 --eliminate-unreachable --frame-opt=all --icf=all --jump-tables=aggressive --min-branch-clusters --stoke --sctc-mode=always --plt=all --hot-data --hot-text --frame-opt-rm-stores --peepholes=all --infer-stale-profile=1 --x86-strip-redundant-address-size --indirect-call-promotion=all --reg-reassign --use-aggr-reg-reassign --reorder-blocks=ext-tsp --reorder-functions=cdsort --split-all-cold --split-eh --split-functions --split-strategy=cdsplit --skip-funcs=.text/1 -o "_build64/bolt/$(basename "$file")"(part of a larger build script)
Steps to Reproduce:
- Build Mesa with the Radeon Vulkan driver enabled.
- Collect profiling data using
perfwhile running a Vulkan application (e.g.,vkcube). - Attempt to optimize
libvulkan_radeon.sousing llvm-bolt with the command line arguments specified above, particularly including--lite=false.
Expected Behavior:
llvm-bolt should successfully optimize libvulkan_radeon.so without crashing. This does work in lite mode.
Actual Behavior:
llvm-bolt crashes with the following error message:
LLVM ERROR: Cannot encode high byte register in REX-prefixed instruction
Stack Trace:
LLVM ERROR: Cannot encode high byte register in REX-prefixed instruction
#0 0x0000640fccb7a9c5 llvm::sys::PrintStackTrace(llvm::raw_ostream&, int) Signals.cpp:0:0
#1 0x0000640fccb7adbc SignalHandler(int) Signals.cpp:0:0
#2 0x0000766b28647530 (/usr/lib/libc.so.6+0x47530)
#3 0x0000766b286b6ccd pthread_kill (/usr/lib/libc.so.6+0xb6ccd)
#4 0x0000766b28647472 raise (/usr/lib/libc.so.6+0x47472)
#5 0x0000766b286244a3 abort (/usr/lib/libc.so.6+0x244a3)
#6 0x0000640fccb56622 llvm::report_fatal_error(llvm::Twine const&, bool) (/home/marcus/llvm20/bin/llvm-bolt+0x3756622)
#7 0x0000640fcb64ae45 (/home/marcus/llvm20/bin/llvm-bolt+0x224ae45)
#8 0x0000640fcbb100d8 (anonymous namespace)::X86MCCodeEmitter::emitPrefixImpl(unsigned int&, llvm::MCInst const&, llvm::MCSubtargetInfo const&, llvm::SmallVectorImpl<char>&) const X86MCCodeEmitter.cpp:0:0
#9 0x0000640fcbb0c323 (anonymous namespace)::X86MCCodeEmitter::encodeInstruction(llvm::MCInst const&, llvm::SmallVectorImpl<char>&, llvm::SmallVectorImpl<llvm::MCFixup>&, llvm::MCSubtargetInfo const&) const (.4c1216ec09f139b7b42615b3f5e2d4e3) X86MCCodeEmitter.cpp:0:0
#10 0x0000640fcd0ef6ed llvm::bolt::BinaryBasicBlock::estimateSize(llvm::MCCodeEmitter const*) const (/home/marcus/llvm20/bin/llvm-bolt+0x3cef6ed)
#11 0x0000640fcd0695e6 llvm::bolt::ExtTSPReorderAlgorithm::reorderBasicBlocks(llvm::bolt::BinaryFunction&, llvm::SmallVector<llvm::bolt::BinaryBasicBlock*, 0u>&) const (/home/marcus/llvm20/bin/llvm-bolt+0x3c695e6)
#12 0x0000640fccfe07fe std::_Function_handler<void (llvm::bolt::BinaryFunction&), llvm::bolt::ReorderBasicBlocks::runOnFunctions(llvm::bolt::BinaryContext&)::$_0>::_M_invoke(std::_Any_data const&, llvm::bolt::BinaryFunction&) BinaryPasses.cpp:0:0
#13 0x0000640fcd1905ca std::_Function_handler<void (), std::_Bind<llvm::bolt::ParallelUtilities::runOnEachFunction(llvm::bolt::BinaryContext&, llvm::bolt::ParallelUtilities::SchedulingPolicy, std::function<void (llvm::bolt::BinaryFunction&)>, std::function<bool (llvm::bolt::BinaryFunction const&)>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, bool, unsigned int)::$_0 (std::_Rb_tree_iterator<std::pair<unsigned long const, llvm::bolt::BinaryFunction>>, std::_Rb_tree_iterator<std::pair<unsigned long const, llvm::bolt::BinaryFunction>>)>>::_M_invoke(std::_Any_data const&) ParallelUtilities.cpp:0:0
#14 0x0000640fccca6578 std::_Function_handler<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> (), std::__future_base::_Task_setter<std::unique_ptr<std::__future_base::_Result<void>, std::__future_base::_Result_base::_Deleter>, std::thread::_Invoker<std::tuple<std::function<void ()>>>, void>>::_M_invoke(std::_Any_data const&) DWARFRewriter.cpp:0:0
#15 0x0000640fccc03da1 std::__future_base::_State_baseV2::_M_do_set(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*, bool*) JITLinkLinker.cpp:0:0
#16 0x0000766b286ba3f7 (/usr/lib/libc.so.6+0xba3f7)
#17 0x0000766b286ba479 __pthread_once (/usr/lib/libc.so.6+0xba479)
#18 0x0000640fcbedba60 void std::call_once<void (std::__future_base::_State_baseV2::*)(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*, bool*), std::__future_base::_State_baseV2*, std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*, bool*>(std::once_flag&, void (std::__future_base::_State_baseV2::*&&)(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*, bool*), std::__future_base::_State_baseV2*&&, std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*&&, bool*&&) JITLinkLinker.cpp:0:0
#19 0x0000640fcbedb9b1 std::__future_base::_State_baseV2::_M_set_result(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>, bool) JITLinkLinker.cpp:0:0
#20 0x0000640fcbedb93a std::__future_base::_Deferred_state<std::thread::_Invoker<std::tuple<std::function<void ()>>>, void>::_M_complete_async() DWARFRewriter.cpp:0:0
#21 0x0000640fcbec8e6d std::__future_base::_State_baseV2::wait() DWARFRewriter.cpp:0:0
#22 0x0000640fcbedbd76 llvm::StdThreadPool::processTasks(llvm::ThreadPoolTaskGroup*) ThreadPool.cpp:0:0
#23 0x0000640fcbedbb03 void* llvm::thread::ThreadProxy<std::tuple<llvm::StdThreadPool::grow(int)::$_0>>(void*) ThreadPool.cpp:0:0
#24 0x0000766b286b4c7d (/usr/lib/libc.so.6+0xb4c7d)
#25 0x0000766b28760898 (/usr/lib/libc.so.6+0x160898)
PLEASE submit a bug report to https://github.com/llvm/llvm-project/issues/ and include the crash backtrace.
Additional Information:
The following output from llvm-bolt provides further context about the optimization process and the characteristics of the profiled binary:
DRIVER | LAYER: Using "AMD Radeon RX Vega (RADV VEGA10)" with driver: "/tmp/makepkg/mesa-tkg-git/src/_build64/src/amd/vulkan/libvulkan_radeon.so"
Profiling completed for vkcube.
[ perf record: Woken up 324 times to write data ]
[ perf record: Captured and wrote 83,448 MB perf.data (121700 samples) ]
Perf data saved to /tmp/makepkg/mesa-tkg-git/src/bolt_profile/vkcube.perf.data
Optimizing /tmp/makepkg/mesa-tkg-git/src/_build64/src/amd/vulkan/libvulkan_radeon.so with BOLT...
BOLT-INFO: shared object or position-independent executable detected
PERF2BOLT: Starting data aggregation job for /tmp/makepkg/mesa-tkg-git/src/bolt_profile/vkcube.perf.data
PERF2BOLT: spawning perf job to read branch events
PERF2BOLT: spawning perf job to read mem events
PERF2BOLT: spawning perf job to read process events
PERF2BOLT: spawning perf job to read task events
BOLT-INFO: Target architecture: x86_64
BOLT-INFO: BOLT version: 3def49cb64ec1298290724081bd37dbdeb2ea5f8
BOLT-INFO: first alloc address is 0x0
BOLT-INFO: creating new program header table at address 0x1600000, offset 0x1600000
BOLT-INFO: enabling relocation mode
BOLT-WARNING: ignoring symbol __ehdr_start at 0x0, which lies outside .note.gnu.property
BOLT-WARNING: ignoring symbol __executable_start at 0x0, which lies outside .note.gnu.property
BOLT-WARNING: Failed to analyze 32 relocations
BOLT-INFO: pre-processing profile using perf data aggregator
BOLT-INFO: binary build-id is: 8e5ea2cf1ae37e0205ba1ecfbe7e250686d83aa8
PERF2BOLT: spawning perf job to read buildid list
PERF2BOLT: matched build-id and file name
PERF2BOLT: waiting for perf mmap events collection to finish...
PERF2BOLT: parsing perf-script mmap events output
PERF2BOLT: waiting for perf task events collection to finish...
PERF2BOLT: parsing perf-script task events output
PERF2BOLT: input binary is associated with 1 PID(s)
PERF2BOLT: waiting for perf events collection to finish...
PERF2BOLT: parse branch events...
PERF2BOLT: read 121700 samples and 3360724 LBR entries
PERF2BOLT: 0 samples (0.0%) were ignored
PERF2BOLT: traces mismatching disassembled function contents: 2055 (0.1%)
PERF2BOLT: out of range traces involving unknown regions: 1665720 (51.4%)
PERF2BOLT: waiting for perf mem events collection to finish...
PERF2BOLT: processing branch events...
BOLT-INFO: 1062 out of 6973 functions in the binary (15.2%) have non-empty execution profile
BOLT-INFO: 31 functions with profile could not be optimized
BOLT-INFO: among the hottest 1000 functions top 5% function CFG discontinuity is 100.00%
BOLT-INFO: validate-mem-refs updated 1 object references
BOLT-INFO: 145840 instructions were shortened
BOLT-INFO: removed 20771 empty blocks
BOLT-INFO: merged 4 duplicate CFG edges
BOLT-INFO: ICF folded 314 out of 7631 functions in 5 passes. 0 functions had jump tables.
BOLT-INFO: Removing all identical functions will save 46.51 KB of code space. Folded functions were called 76 times based on profile.
BOLT-INFO: ICP Total indirect calls = 70948, 33 callsites cover 99% of all indirect calls
BOLT-INFO: ICP total indirect callsites with profile = 37
BOLT-INFO: ICP total jump table callsites = 5
BOLT-INFO: ICP total number of calls = 144434
BOLT-INFO: ICP percentage of calls that are indirect = 48.7%
BOLT-INFO: ICP percentage of indirect calls that can be optimized = 100.0%
BOLT-INFO: ICP percentage of indirect callsites that are optimized = 89.2%
BOLT-INFO: ICP number of method load elimination candidates = 0
BOLT-INFO: ICP percentage of method calls candidates that have loads eliminated = 0.0%
BOLT-INFO: ICP percentage of indirect branches that are optimized = 0.0%
BOLT-INFO: ICP percentage of jump table callsites that are optimized = 0.0%
BOLT-INFO: ICP number of jump table callsites that can use hot indices = 0
BOLT-INFO: ICP percentage of jump table callsites that use hot indices = 0.0%
BOLT-INFO: Reg Reassignment Pass Stats:
162 functions affected.
3631 static bytes saved.
380470 dynamic bytes saved.
BOLT-INFO: 13835 PLT calls in the binary were optimized.