Skip to content

mpirun 5.0.5 - TCP connection failure between hosts with multiple network interfaces #13155

@amjal

Description

@amjal

Background information

What version of Open MPI are you using? (e.g., v4.1.6, v5.0.1, git branch name and hash, etc.)

$ ompi_info --version
Open MPI v5.0.5
$ prte_info --all | head
                    PRTE: 3.0.6rc12025-03-17
      PRTE repo revision: 2025-03-17
       PRTE release date: @PMIX_RELEASE_DATE@
                    PMIx: OpenPMIx 5.0.3rc1 (PMIx Standard: 4.2, Stable ABI:
                          0.0, Provisional ABI: 0.0)

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

Downloaded the source code from https://www.open-mpi.org/. Then:

$ ./configure --with-cuda=/usr/local/cuda-12.6 --with-gdrcopy
$ sudo make -j install 
$ sudo ldconfig 

Please describe the system on which you are running

  • Operating system/version: Ubuntu 22.04.4 LTS

  • Computer hardware: x86_64

  • Network type:
    I have two VM nodes on separate physical machines:

    • node 1 is named ucc-h2:

      • interface ens3, 192.168.122.15/24, is a virtual interface connected to a libvirt bridge
      • interface ens7, 192.168.1.12/24, is a host interface assigned to the node using PCI passthrough.
    • node 2 is named ucc-h5:

      • interface enp5s1, 192.168.122.195/24, is a virtual interface connected to a libvirt bridge
      • interface ens7, 192.168.3.11/24, is a host interface assigned to the node using PCI passthrough.

ens3 and enp5s1 don't ping each other. They are used for management (mostly ssh). The two ens7 interfaces are connected through a router so they can ping each other.

  • Other relevant networking stuff:
    • nc works both ways. Meaning nc -l <port num> on ucc-h2 and nc -N ucc-h2 <port num> on ucc-h5 works fine and vice versa.
    • ssh <hostname> works without requiring password on both hosts.
    • amir@ucc-h2:~$ ip route get 192.168.3.11
         192.168.3.11 via 192.168.1.1 dev ens7 src 192.168.1.12 uid 1000
         cache
    •  amir@ucc-h5:~$ ip route get 192.168.1.12
            192.168.1.12 via 192.168.3.1 dev ens7 src 192.168.3.11 uid 1000
             cache

Yeah, I think that's it. But please let me know if I'm missing something I'll be happy to provide more info.

Details of the problem

I am trying to get mpirun -n 1 --host <hostname> hostname to work on both hosts.

amir@ucc-h2:~$ mpirun -n 1 --host ucc-h5 hostname
ucc-h5

So it's working fine on ucc-h2. But on ucc-h5:

amir@ucc-h5:~$ mpirun -n 1 --host ucc-h2 hostname
--------------------------------------------------------------------------
PRTE has lost communication with a remote daemon.

  HNP daemon   : [prterun-ucc-h5-196668@0,0] on node ucc-h5
  Remote daemon: [prterun-ucc-h5-196668@0,1] on node ucc-h2

This is usually due to either a failure of the TCP network
connection to the node, or possibly an internal failure of
the daemon itself. We cannot recover from this failure, and
therefore will terminate the job.
--------------------------------------------------------------------------

It's timing out.

I ran the commands on both ucc-h2 and ucc-h5 with high verbosity to compare them:

 mpirun --mca plm_base_verbose 100 --debug-daemons --prtemca oob_base_verbose 100 -n 1 --host <hostname> hostname

The full outputs are really long so I've included them in separate files in this gist.

But in summary...

In both cases ( the successful one and the failing one) the remote daemon tries to establish connection to the master node using both interfaces. The connection using the wrong interface (ens3@ucc-h2 or enp5s0@ucc-h5) times out after a couple of retries:

prte_tcp_peer_try_connect: attempting to connect to proc [prterun-ucc-h2-281015@0,0] on 192.168.122.15:36849 - 1 retries
prte_tcp_peer_try_connect: 192.168.122.15:36849 is down

Then tries the other interface (which is right one):

prte_tcp_peer_try_connect: attempting to connect to proc [prterun-ucc-h2-281015@0,0] on 192.168.1.12:36849 - 0 retries

Here is, as far as I could tell, where thing are different depending which node is the master node, causing the asymmetric behaviour.
If the master node is ucc-h2:

[ucc-h5:196281] [prterun-ucc-h2-281015@0,1] prte_tcp_peer_try_connect: attempting to connect to proc [prterun-ucc-h2-281015@0,0] on 192.168.1.12:36849 - 0 retries
[ucc-h5:196281] [prterun-ucc-h2-281015@0,1] oob:tcp:peer creating socket to [prterun-ucc-h2-281015@0,0]
[ucc-h5:196281] [prterun-ucc-h2-281015@0,1] waiting for connect completion to [prterun-ucc-h2-281015@0,0] - activating send event
[ucc-h5:196281] [prterun-ucc-h2-281015@0,1] tcp:send_handler called to send to peer [prterun-ucc-h2-281015@0,0]
[ucc-h5:196281] [prterun-ucc-h2-281015@0,1] tcp:send_handler CONNECTING
[ucc-h5:196281] [prterun-ucc-h2-281015@0,1]:tcp:complete_connect called for peer [prterun-ucc-h2-281015@0,0] on socket 36
[ucc-h2:281015] [prterun-ucc-h2-281015@0,0] prte_oob_tcp_listen_thread: incoming connection: (40, 0) 192.168.3.11:59759
[ucc-h5:196281] [prterun-ucc-h2-281015@0,1] tcp_peer_complete_connect: sending ack to [prterun-ucc-h2-281015@0,0]

The connection is established.
If the master node is ucc-h5:

[ucc-h2:280555] [prterun-ucc-h5-195762@0,1] prte_tcp_peer_try_connect: attempting to connect to proc [prterun-ucc-h5-195762@0,0] on 192.168.3.11:47555 - 0 retries
[ucc-h2:280555] [prterun-ucc-h5-195762@0,1] oob:tcp:peer creating socket to [prterun-ucc-h5-195762@0,0]
[ucc-h2:280555] [prterun-ucc-h5-195762@0,1] waiting for connect completion to [prterun-ucc-h5-195762@0,0] - activating send event
------------------------------------------------------------
A process or daemon was unable to complete a TCP connection
to another process:
  Local host:    ucc-h2
  Remote host:   192.168.3.11
This is usually caused by a firewall on the remote host. Please
check that any firewall (e.g., iptables) has been disabled and
try again.
------------------------------------------------------------
[ucc-h2:280555] [prterun-ucc-h5-195762@0,1] tcp:send_handler called to send to peer [prterun-ucc-h5-195762@0,0]
[ucc-h2:280555] [prterun-ucc-h5-195762@0,1] tcp:send_handler CONNECTING
[ucc-h2:280555] [prterun-ucc-h5-195762@0,1]:tcp:complete_connect called for peer [prterun-ucc-h5-195762@0,0] on socket 36
[ucc-h2:280555] [prterun-ucc-h5-195762@0,1]-[prterun-ucc-h5-195762@0,0] tcp_peer_complete_connect: connection failed: Connection timed out (110)
[ucc-h2:280555] [prterun-ucc-h5-195762@0,1] tcp_peer_close for [prterun-ucc-h5-195762@0,0] sd 36 state CONNECTING
[ucc-h2:280555] [prterun-ucc-h5-195762@0,1]:[oob_tcp_connection.c:1066] connect to [prterun-ucc-h5-195762@0,0]
[ucc-h2:280555] [prterun-ucc-h5-195762@0,1] prte_tcp_peer_try_connect: attempting to connect to proc [prterun-ucc-h5-195762@0,0]
[ucc-h2:280555] [prterun-ucc-h5-195762@0,1] prte_tcp_peer_try_connect: attempting to connect to proc [prterun-ucc-h5-195762@0,0] on socket -1
[ucc-h2:280555] [prterun-ucc-h5-195762@0,1] prte_tcp_peer_try_connect: attempting to connect to proc [prterun-ucc-h5-195762@0,0] on 192.168.122.195:47555 - 1 retries
[ucc-h2:280555] [prterun-ucc-h5-195762@0,1] prte_tcp_peer_try_connect: 192.168.122.195:47555 is down
[ucc-h2:280555] [prterun-ucc-h5-195762@0,1] prte_tcp_peer_try_connect: attempting to connect to proc [prterun-ucc-h5-195762@0,0] on 192.168.3.11:47555 - 1 retries
[ucc-h2:280555] [prterun-ucc-h5-195762@0,1] prte_tcp_peer_try_connect: 192.168.3.11:47555 is down

It finds both addresses down.

I also dug a little in network traffic and found this on ens7@ucc-h2 when ucc-h5 is the master node (the failing case):

Image

Looks like ucc-h2 is trying to talk to ucc-h5 through its ens7 interface but with the source IP of its ens3 interface! I don't have enough experience with networking to know how this could happen. This was actually a surprise to me. I don't know if this is the root cause or is a symptom of another issue.

I know that there are issues on openmpi acting weird when there are multiple interfaces on the host like #5818 and #12232. But I can't find my answer there.

I have tried all sorts of if_include/if_exclude flags on multiple mcas like opal, oob, prte, etc using both interface names and CIDR as parameters. But it's likely I have made a mistake so please let me know how it is properly done, I'm open to suggestions. For example, I tried this which made the most sense to me:

mpirun --mca plm_base_verbose 100 --debug-daemons --prtemca oob_base_verbose 100 --mca oob_tcp_if_exclude 192.168.122.0/24 --prtemca oob_tcp_if_exclude 192.168.122.0/24 -n 1 --host ucc-h2 hostname 

But it didn't change any outcome, it's still trying both interfaces.

This is the furthest I've been able to go. I appreciate any hints or directions for investigating this issue further. I haven't been able to reproduce/isolate it on the network side because all the tools that I know work normally. The issue only appears when using mpirun.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions