-
Notifications
You must be signed in to change notification settings - Fork 1
Description
LPF mpimsg engine currently does not pass post-install checks on Ubuntu 22.04 for several reasons:
- the initialization routine breaks
- the post-install debug checks hang
- the detection of MPI with Clang fails
This issue tracks these problems. I pushed several workarounds for these problems on the branch associated to this issue, but some of them deserve better thinking than what I did.
In the following paragraphs I am detailing each issue with its current workaround.
1. the initialization routine breaks
The mpimsg engine is initialized in the routine mpi_initializer in src/MPI/init.cpp, which expects int argc, char ** argv as parameters to be passed to MPI_thread_Init(). mpi_initializer is invoked during LD_PRELOAD. However, the stack initialization with argc/argv is a non-standard, undocumented feature of the Linux dynamic linker, probably removed in recent versions: the variables are random, related assertions may fail or any access to argv results in segfault.
Current solution: do not use argc/argv, the initialization routine now takes no inputs.
Pros: problem solved in a robust way, no need to re-think the solution.
Cons: cannot pass implementation-specific parameters to MPI initialization (not used in practice)
2. the post-install debug checks hang
The post-install check at post-install/post-install-test.cmake.in, line 96, hangs with engine = mpimsg and any nprocs (I manually tried 1, which works, but any bigger value does not). The MPI-spawned processes hang. This is due to the call to std::abort() at src/debug/core.cpp, line L939. Some process/library of Ubuntu 22.04 (probably MPI itself, version 4.0 for Ubuntu 22.04) installs a signal handler for SIGABRT (I checked it in the test), which causes the application to hang when the debug library call std::abort().
Current solution: skip post-install debug checks. It is clearly just a hack.
A more refined solution would be to have an actual lpf_abort() routine calling MPI_Abort(), but I don't know whether it is in the spirit of LPF. Another possible solution is to remove calls to std::abort() and change the test to properly handle failures. I am not an LPF expert, so I have no preference and there are maybe better solutions.
Finally, one can intercept the SIGABRT in each backend to handle failures and call MPI_Abort(), although this may conflict with the underlying MPI implementation.
3. detection of MPI with Clang fails
During MPI detection (find_package(MPI) in cmake/mpi.cmake) CMake cannot find it if the compiler passed is Clang. Probably, the compilation of some internal tests fails due to some compiler-specific options that CMake parses. For example, MPICH 4.0 in Ubuntu 22.04 has -flto=auto -ffat-lto-objects in the variable MPI_C_COMPILE_OPTIONS to enable Link-Time Optimization (LTO). This option causes Clang to fail, since the LTO information of MPI binary is built with gcc.
Current solution: if the compiler is Clang, disable LTO during detection via MPI_COMPILER_FLAGS="-fno-lto", which is appended at the end of internal compiler definitions.
Pros: binaries are now built also with Clang.
Cons: may cause performance degradation (probably small); implicitly assumes MPI to be built with gcc
A robust solution may be very complex and may depend on CMake detection logic.