Skip to content

Passing post-install for mpimsg engine checks on Ubuntu 22.04 #6

@alberto-scolari

Description

@alberto-scolari

LPF mpimsg engine currently does not pass post-install checks on Ubuntu 22.04 for several reasons:

  1. the initialization routine breaks
  2. the post-install debug checks hang
  3. the detection of MPI with Clang fails

This issue tracks these problems. I pushed several workarounds for these problems on the branch associated to this issue, but some of them deserve better thinking than what I did.

In the following paragraphs I am detailing each issue with its current workaround.

1. the initialization routine breaks

The mpimsg engine is initialized in the routine mpi_initializer in src/MPI/init.cpp, which expects int argc, char ** argv as parameters to be passed to MPI_thread_Init(). mpi_initializer is invoked during LD_PRELOAD. However, the stack initialization with argc/argv is a non-standard, undocumented feature of the Linux dynamic linker, probably removed in recent versions: the variables are random, related assertions may fail or any access to argv results in segfault.

Current solution: do not use argc/argv, the initialization routine now takes no inputs.
Pros: problem solved in a robust way, no need to re-think the solution.
Cons: cannot pass implementation-specific parameters to MPI initialization (not used in practice)

2. the post-install debug checks hang

The post-install check at post-install/post-install-test.cmake.in, line 96, hangs with engine = mpimsg and any nprocs (I manually tried 1, which works, but any bigger value does not). The MPI-spawned processes hang. This is due to the call to std::abort() at src/debug/core.cpp, line L939. Some process/library of Ubuntu 22.04 (probably MPI itself, version 4.0 for Ubuntu 22.04) installs a signal handler for SIGABRT (I checked it in the test), which causes the application to hang when the debug library call std::abort().

Current solution: skip post-install debug checks. It is clearly just a hack.
A more refined solution would be to have an actual lpf_abort() routine calling MPI_Abort(), but I don't know whether it is in the spirit of LPF. Another possible solution is to remove calls to std::abort() and change the test to properly handle failures. I am not an LPF expert, so I have no preference and there are maybe better solutions.
Finally, one can intercept the SIGABRT in each backend to handle failures and call MPI_Abort(), although this may conflict with the underlying MPI implementation.

3. detection of MPI with Clang fails

During MPI detection (find_package(MPI) in cmake/mpi.cmake) CMake cannot find it if the compiler passed is Clang. Probably, the compilation of some internal tests fails due to some compiler-specific options that CMake parses. For example, MPICH 4.0 in Ubuntu 22.04 has -flto=auto -ffat-lto-objects in the variable MPI_C_COMPILE_OPTIONS to enable Link-Time Optimization (LTO). This option causes Clang to fail, since the LTO information of MPI binary is built with gcc.

Current solution: if the compiler is Clang, disable LTO during detection via MPI_COMPILER_FLAGS="-fno-lto", which is appended at the end of internal compiler definitions.
Pros: binaries are now built also with Clang.
Cons: may cause performance degradation (probably small); implicitly assumes MPI to be built with gcc
A robust solution may be very complex and may depend on CMake detection logic.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions