-
Notifications
You must be signed in to change notification settings - Fork 929
Description
Background information
What version of Open MPI are you using? v5.0.3 tag of the git repo.
Describe how Open MPI was installed
Installed from git clone. Configured as below (after ./autogen.pl):
--enable-mpirun-prefix-by-default --with-cuda=$CUDA_HOME --with-cuda-libdir=$CUDA_HOME/lib64/stubs --with-ucx=$UCX_HOME --with-ucx-libdir=$UCX_HOME/lib --enable-mca-no-build=btl-uct --with-pmix=internal --with-hwloc=internal --with-libevent=internal --with-slurm
If you are building/installing from a git clone, please copy-n-paste the output from git submodule status.
+6f81bfd163f3275d2b0630974968c82759dd4439 3rd-party/openpmix (v1.1.3-3983-g6f81bfd1)
+4f27008906d96845e22df6502d6a9a29d98dec83 3rd-party/prrte (psrvr-v2.0.0rc1-4746-g4f27008906)
dfff67569fb72dbf8d73a1dcf74d091dad93f71b config/oac (heads/main)
Please describe the system on which you are running
- Operating system/version: Ubuntu 22.04.3
- Computer hardware: x86_64
- Network type: IB
Details of the problem
After building Open MPI, the resulting libmpi.so is linked with an existing libopen-pal.so.40 on the system which does not provide the needed symbols. As a result, using mpicc leads to errors like below:
./bin/mpicc test.c
/usr/bin/ld: /home/scratch.hmirsadeghi_sw/repos/ompi/_build_rel_v5.0.3/_install/lib/libmpi.so: undefined reference to `mca_common_sm_fini'
/usr/bin/ld: /home/scratch.hmirsadeghi_sw/repos/ompi/_build_rel_v5.0.3/_install/lib/libmpi.so: undefined reference to `opal_common_ucx_support_level'
/usr/bin/ld: /home/scratch.hmirsadeghi_sw/repos/ompi/_build_rel_v5.0.3/_install/lib/libmpi.so: undefined reference to `opal_finalize_set_domain'
/usr/bin/ld: /home/scratch.hmirsadeghi_sw/repos/ompi/_build_rel_v5.0.3/_install/lib/libmpi.so: undefined reference to `opal_built_with_rocm_support'
Using mpirun leads to the error below:
libmpi.so.40: undefined symbol: opal_smsc_base_framework
Some more details:
readelf -d libmpi.so | grep NEEDED | grep open-pal
0x0000000000000001 (NEEDED) Shared library: [libopen-pal.so.40]
ldd libmpi.so | grep open-pal
libopen-pal.so.40 => /lib/x86_64-linux-gnu/libopen-pal.so.40
This happens despite the fact that the correct libopen-pal files are built and exist in the lib directory of the prefix:
libopen-pal.so
libopen-pal.so.80
libopen-pal.so.80.0.3
As a dirty work around, I have to create a libopen-pal.so.40 symlink to the correct libopen-pal.so in the installation lib path (I already set LD_LIBRARY_PATH to the prefix lib).
So, my question is why is libmpi.so linked with a libopen-pal.so.40 that does not provide the symbols it needs? and how can I avoid that?