Skip to content

openmpi 3.1.3: vader hang at mca_btl_vader_component_progress() #6088

@LiweiPeng

Description

@LiweiPeng

Thank you for taking the time to submit an issue!

Background information

What version of Open MPI are you using? (e.g., v1.10.3, v2.1.0, git branch name and hash, etc.)

openmpi 3.1.3 has this issue. openmpi 3.1.1 doesn't have.

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

openmpi is built from source code with: ./configure --prefix=/opt/rdma/mpi/openmpi --enable-mpirun-prefix-by-default --with-cuda --disable-io-romio --enable-picky

Please describe the system on which you are running

  • Operating system/version: CentOS 7.4, x86_64.
  • Computer hardware: PC
  • Network type: same node shared memory

Details of the problem

Run intel mpi alltoall on same node using shared memory. OpenMPI 3.1.3 hang. OpenMPI 3.1.1 doesn't have this issue. For OpenMPI 3.1.3, when replacing vader to 'smcuda', the hang goes away and MPI works as normal.

shell$ mpirun -n 2 --mca btl vader,self IMB-MPI1 alltoall 

(gdb) bt
#0  0x00007f4c86bbd16e in mca_btl_vader_component_progress () from /opt/rdma/mpi/openmpi/lib/openmpi/mca_btl_vader.so
#1  0x00007f4c96f3a4ec in opal_progress () from /opt/rdma/mpi/openmpi/lib/libopen-pal.so.40
#2  0x00007f4c97afb885 in ompi_request_default_wait () from /opt/rdma/mpi/openmpi/lib/libmpi.so.40
#3  0x00007f4c97b4e5aa in ompi_coll_base_barrier_intra_two_procs () from /opt/rdma/mpi/openmpi/lib/libmpi.so.40
#4  0x00007f4c97b109d7 in PMPI_Barrier () from /opt/rdma/mpi/openmpi/lib/libmpi.so.40
#5  0x000000000040b4e3 in IMB_alltoall ()
#6  0x0000000000405bad in IMB_init_buffers_iter ()
#7  0x0000000000402105 in main ()

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions