Skip to content

openMPI dropped inbound connection #12918

@KansaiTraining

Description

@KansaiTraining

I have found a couple of issues that seem similar to this one but I can't relate if they have been solved or how they apply to my situation

I am running slurm with srun using openMPI and when I run a job using only one node it completes (with some warnings) but when I run it on two nodes I got

5A301-0407-G5500-12:89116] btl: tcp: attempting to connect() to [[62864,0],0] address 10.3.29.82 on port 1031
--------------------------------------------------------------------------
Open MPI detected an inbound MPI TCP connection request from a peer
that appears to be part of this MPI job (i.e., it identified itself as
part of this Open MPI job), but it is from an IP address that is
unexpected.  This is highly unusual.

The inbound connection has been dropped, and the peer should simply
try again with a different IP interface (i.e., the job should
hopefully be able to continue).

  Local host:          5A301-0407-G5500-11
  Local PID:           49564
  Peer hostname:       5A301-0407-G5500-12 ([[62864,0],8])
  Source IP of socket: 10.3.29.53
  Known IPs of peer:   a03:190d::a03:190d::, a03:1871::a03:1871::, a03:1d53::a03:1d53::, a03:1d0d::a03:1d0d::
--------------------------------------------------------------------------

I investigated and it seems node 11 can not communicate with node 12.
One thing that bugs me is I don't know what a03:190d::a03:190d::, a03:1871::a03:1871::, a03:1d53::a03:1d53::, a03:1d0d::a03:1d0d:: are, (yes IPv6) since:

  1. the similar errors in the internet usually have alternative ipv4 IPs here
  2. These IPv6 addresses can't be found anywhere when I do ip addr

I investigated further and 10.3.29.53 is Node 12's 25G RoCEv2 Network interface
Also 10.3.29.82 (the one in the verbose log above) is Node 11's 25G RoCEv2 Control Network interface

Another thing that confuses me is the log says Node12 is attempting to connect to Node11 RoCEv2 control network but the error seems that on the contrary Node 11 is trying to connect to Node 12 but on an unexpected IP

I have tried limiting the OMPI_MCA_btl_tcp_if_include to some values but only once the error disappeared but the process got stuck after that.
I am at lost how to proceed further

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions