Skip to content

libmpi.so is linked with the wrong libopen-pal.soΒ #12567

@SeyedMir

Description

@SeyedMir

Background information

What version of Open MPI are you using? v5.0.3 tag of the git repo.

Describe how Open MPI was installed

Installed from git clone. Configured as below (after ./autogen.pl):

--enable-mpirun-prefix-by-default --with-cuda=$CUDA_HOME --with-cuda-libdir=$CUDA_HOME/lib64/stubs --with-ucx=$UCX_HOME --with-ucx-libdir=$UCX_HOME/lib --enable-mca-no-build=btl-uct --with-pmix=internal --with-hwloc=internal --with-libevent=internal --with-slurm

If you are building/installing from a git clone, please copy-n-paste the output from git submodule status.

+6f81bfd163f3275d2b0630974968c82759dd4439 3rd-party/openpmix (v1.1.3-3983-g6f81bfd1)
+4f27008906d96845e22df6502d6a9a29d98dec83 3rd-party/prrte (psrvr-v2.0.0rc1-4746-g4f27008906)
 dfff67569fb72dbf8d73a1dcf74d091dad93f71b config/oac (heads/main)

Please describe the system on which you are running

  • Operating system/version: Ubuntu 22.04.3
  • Computer hardware: x86_64
  • Network type: IB

Details of the problem

After building Open MPI, the resulting libmpi.so is linked with an existing libopen-pal.so.40 on the system which does not provide the needed symbols. As a result, using mpicc leads to errors like below:

./bin/mpicc test.c
/usr/bin/ld: /home/scratch.hmirsadeghi_sw/repos/ompi/_build_rel_v5.0.3/_install/lib/libmpi.so: undefined reference to `mca_common_sm_fini'
/usr/bin/ld: /home/scratch.hmirsadeghi_sw/repos/ompi/_build_rel_v5.0.3/_install/lib/libmpi.so: undefined reference to `opal_common_ucx_support_level'
/usr/bin/ld: /home/scratch.hmirsadeghi_sw/repos/ompi/_build_rel_v5.0.3/_install/lib/libmpi.so: undefined reference to `opal_finalize_set_domain'
/usr/bin/ld: /home/scratch.hmirsadeghi_sw/repos/ompi/_build_rel_v5.0.3/_install/lib/libmpi.so: undefined reference to `opal_built_with_rocm_support'

Using mpirun leads to the error below:
libmpi.so.40: undefined symbol: opal_smsc_base_framework

Some more details:

readelf -d libmpi.so | grep NEEDED | grep open-pal
 0x0000000000000001 (NEEDED)             Shared library: [libopen-pal.so.40]
ldd libmpi.so | grep open-pal
        libopen-pal.so.40 => /lib/x86_64-linux-gnu/libopen-pal.so.40

This happens despite the fact that the correct libopen-pal files are built and exist in the lib directory of the prefix:

libopen-pal.so       
libopen-pal.so.80    
libopen-pal.so.80.0.3

As a dirty work around, I have to create a libopen-pal.so.40 symlink to the correct libopen-pal.so in the installation lib path (I already set LD_LIBRARY_PATH to the prefix lib).

So, my question is why is libmpi.so linked with a libopen-pal.so.40 that does not provide the symbols it needs? and how can I avoid that?

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions