- 
                Notifications
    
You must be signed in to change notification settings  - Fork 929
 
Description
I'm going to try my best to describe the issue. We tried to debug this internally with people working on pmix and didn't really get to a solution other than downgrading to openmpi 4.1.7 (which works as expected).
Background information
I am working on two x86 nodes running Rocky 9.1 with two Nvidia Bluefield-2 DPUs (one per node) running a recent Nvidia provided bfb image (Linux 5.4.0-1023-bluefield #26-Ubuntu SMP PREEMPT Wed Dec 1 23:59:51 UTC 2021 aarch64 to be precise).
The NIC/DPU is configured in Infininband mode and ssh connection between all 4 hosts is functional. Launching a simple MPI hello world works using openmpi 4.1.7 (tried with seperate installations of --without-ucx and --with-ucx=version 1.17.0).
What version of Open MPI are you using? (e.g., v4.1.6, v5.0.1, git branch name and hash, etc.)
I tried 5.0.3, 5.0.5, 5.0.6 from the official tarbals from the openmpi download page. Each was compiled with --with-pmix=internal --with-hwloc=internal and then once with ucx 1.17.0 and without ucx. (i'm working with ucx so i wanted serperate mpi installs to compare).
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
root installation:
$/usr/local/src: sudo tar -xvf openmpi
cd /usr/local/src/openmpi/
sudo ./configure --prefix=/opt/openmpi-5.0.6 --with-pmix=internal --with-hwloc=internal --without-ucx # or with --with-ucx=/opt/ucx-1.17.0
sudo make -j8 all
sudo make install
Please describe the system on which you are running
- Operating system/version: Rocky 9.1 and ubuntu 22.04
 - Computer hardware: x86 and ARM processors
 - Network type: Infiniband
 
Details of the problem
I'm trying to launch mpi processes between the DPU and the host. the process starts on both ranks, the remote (dpu) rank finishes initialization and prints a debug output containing its rank, the remote stdout gets captured and arrives back at the mpirun host, but the x86 host never finishes initialization. With or without ucx available makes no difference.
host to host setup works. dpu to dpu setup works (when mpirun from host, NOT when mpirun from dpu) and host to dpu hangs. All commands from host:
# works host to host
/opt/openmpi-5.0.6/bin/mpirun --prefix /opt/openmpi-5.0.6 --host wolpy09-ib,wolpy10-ib -np 2 --mca btl_tcp_if_include 10.12.0.0/16 /var/mpi/dfherr/5.0.6/MPI_Helloworld
#works dpu to dpu (started from host)
/opt/openmpi-5.0.6/bin/mpirun --prefix /opt/openmpi-5.0.6 --host wolpy09-dpu-ib,wolpy10-dpu-ib -np 2 --mca btl_tcp_if_include 10.12.0.0/16 /var/mpi/dfherr/5.0.6/MPI_Helloworld
# hangs
/opt/openmpi-5.0.6/bin/mpirun --prefix /opt/openmpi-5.0.6 --host wolpy10-ib,wolpy10-dpu-ib -np 2 --mca btl_tcp_if_include 10.12.0.0/16 /var/mpi/dfherr/5.0.6/MPI_Helloworld
All of the above commands including starting mpirun on the DPUs work with openmpi 4.1.7 both with and without ucx compiled.
With additional debug output the hang seems to always occur after a dmdx key exchange was done:
look comment below for up-to-date debug output
--debug-daemons --leave-session-attached --mca odls_base_verbose 10 --mca state_base_verbose 10 
--prtemca pmix_server_verbose 10 --mca prte_data_server_verbose 100 --mca pmix_base_verbose 10 
--mca pmix_server_base_verbose 100 --mca ras_base_verbose 100 --mca plm_base_verbose 100
happy to provide further debug output. For now I'm fine running openmpi 4.1.7, but I felt I should report this issue with openmpi 5.0.x regardless.