mpirun 5.0.5 -  TCP connection failure between hosts with multiple network interfaces

## Background information

### What version of Open MPI are you using? (e.g., v4.1.6, v5.0.1, git branch name and hash, etc.)
```shell
$ ompi_info --version
Open MPI v5.0.5
``` 
```shell
$ prte_info --all | head
                    PRTE: 3.0.6rc12025-03-17
      PRTE repo revision: 2025-03-17
       PRTE release date: @PMIX_RELEASE_DATE@
                    PMIx: OpenPMIx 5.0.3rc1 (PMIx Standard: 4.2, Stable ABI:
                          0.0, Provisional ABI: 0.0)
```


### Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
Downloaded the source code from https://www.open-mpi.org/. Then:
```shell
$ ./configure --with-cuda=/usr/local/cuda-12.6 --with-gdrcopy
$ sudo make -j install 
$ sudo ldconfig 
``` 

### Please describe the system on which you are running

* Operating system/version: Ubuntu 22.04.4 LTS
* Computer hardware: x86_64
* Network type:
I have two VM nodes on separate physical machines:

    * node 1 is named ucc-h2:
       * interface _ens3_, 192.168.122.15/24, is a virtual interface connected to a libvirt bridge
       * interface _ens7_, 192.168.1.12/24, is a host interface assigned to the node using PCI passthrough.

    * node 2 is named ucc-h5:
       * interface _enp5s1_, 192.168.122.195/24, is a virtual interface connected to a libvirt bridge
       * interface _ens7_, 192.168.3.11/24, is a host interface assigned to the node using PCI passthrough.

ens3 and enp5s1 don't ping each other. They are used for management (mostly ssh). The two ens7 interfaces are connected through a router so they can ping each other. 

* Other relevant networking stuff: 
    * `nc` works both ways. Meaning `nc -l <port num>` on ucc-h2 and `nc -N ucc-h2 <port num>` on ucc-h5 works fine and vice versa.
    * `ssh <hostname>` works without requiring password on both hosts. 
    * ```shell 
      amir@ucc-h2:~$ ip route get 192.168.3.11
         192.168.3.11 via 192.168.1.1 dev ens7 src 192.168.1.12 uid 1000
         cache
      ```
    * ```shell 
       amir@ucc-h5:~$ ip route get 192.168.1.12
            192.168.1.12 via 192.168.3.1 dev ens7 src 192.168.3.11 uid 1000
             cache
       ```

Yeah, I think that's it. But please let me know if I'm missing something I'll be happy to provide more info. 

## Details of the problem
I am trying to get `mpirun -n 1 --host <hostname> hostname` to work on both hosts. 

```shell
amir@ucc-h2:~$ mpirun -n 1 --host ucc-h5 hostname
ucc-h5
```
So it's working fine on ucc-h2. But on ucc-h5: 
```shell
amir@ucc-h5:~$ mpirun -n 1 --host ucc-h2 hostname
--------------------------------------------------------------------------
PRTE has lost communication with a remote daemon.

  HNP daemon   : [prterun-ucc-h5-196668@0,0] on node ucc-h5
  Remote daemon: [prterun-ucc-h5-196668@0,1] on node ucc-h2

This is usually due to either a failure of the TCP network
connection to the node, or possibly an internal failure of
the daemon itself. We cannot recover from this failure, and
therefore will terminate the job.
--------------------------------------------------------------------------
```
**It's timing out.**  

I ran the commands on both ucc-h2 and ucc-h5 with high verbosity to compare them: 
```shell
 mpirun --mca plm_base_verbose 100 --debug-daemons --prtemca oob_base_verbose 100 -n 1 --host <hostname> hostname
```
The full outputs are really long so I've included them in separate files [in this gist](https://gist.github.com/amjal/8df7a66ad7ef727c7bcb41d0ed472434). 

<details>

<summary>But in summary...</summary> 

In both cases ( the successful one and the failing one) the remote daemon tries to establish connection to the master node using **both interfaces**. The connection using the wrong interface (ens3@ucc-h2 or enp5s0@ucc-h5) times out after a couple of retries: 
```
prte_tcp_peer_try_connect: attempting to connect to proc [prterun-ucc-h2-281015@0,0] on 192.168.122.15:36849 - 1 retries
prte_tcp_peer_try_connect: 192.168.122.15:36849 is down
```
Then tries the other interface (which is right one):

```
prte_tcp_peer_try_connect: attempting to connect to proc [prterun-ucc-h2-281015@0,0] on 192.168.1.12:36849 - 0 retries
```
Here is, as far as I could tell, where thing are different depending which node is the master node, causing the asymmetric behaviour. 
If the master node is ucc-h2: 
```
[ucc-h5:196281] [prterun-ucc-h2-281015@0,1] prte_tcp_peer_try_connect: attempting to connect to proc [prterun-ucc-h2-281015@0,0] on 192.168.1.12:36849 - 0 retries
[ucc-h5:196281] [prterun-ucc-h2-281015@0,1] oob:tcp:peer creating socket to [prterun-ucc-h2-281015@0,0]
[ucc-h5:196281] [prterun-ucc-h2-281015@0,1] waiting for connect completion to [prterun-ucc-h2-281015@0,0] - activating send event
[ucc-h5:196281] [prterun-ucc-h2-281015@0,1] tcp:send_handler called to send to peer [prterun-ucc-h2-281015@0,0]
[ucc-h5:196281] [prterun-ucc-h2-281015@0,1] tcp:send_handler CONNECTING
[ucc-h5:196281] [prterun-ucc-h2-281015@0,1]:tcp:complete_connect called for peer [prterun-ucc-h2-281015@0,0] on socket 36
[ucc-h2:281015] [prterun-ucc-h2-281015@0,0] prte_oob_tcp_listen_thread: incoming connection: (40, 0) 192.168.3.11:59759
[ucc-h5:196281] [prterun-ucc-h2-281015@0,1] tcp_peer_complete_connect: sending ack to [prterun-ucc-h2-281015@0,0]
```
The connection is established. 
If the master node is ucc-h5: 
```
[ucc-h2:280555] [prterun-ucc-h5-195762@0,1] prte_tcp_peer_try_connect: attempting to connect to proc [prterun-ucc-h5-195762@0,0] on 192.168.3.11:47555 - 0 retries
[ucc-h2:280555] [prterun-ucc-h5-195762@0,1] oob:tcp:peer creating socket to [prterun-ucc-h5-195762@0,0]
[ucc-h2:280555] [prterun-ucc-h5-195762@0,1] waiting for connect completion to [prterun-ucc-h5-195762@0,0] - activating send event
------------------------------------------------------------
A process or daemon was unable to complete a TCP connection
to another process:
  Local host:    ucc-h2
  Remote host:   192.168.3.11
This is usually caused by a firewall on the remote host. Please
check that any firewall (e.g., iptables) has been disabled and
try again.
------------------------------------------------------------
[ucc-h2:280555] [prterun-ucc-h5-195762@0,1] tcp:send_handler called to send to peer [prterun-ucc-h5-195762@0,0]
[ucc-h2:280555] [prterun-ucc-h5-195762@0,1] tcp:send_handler CONNECTING
[ucc-h2:280555] [prterun-ucc-h5-195762@0,1]:tcp:complete_connect called for peer [prterun-ucc-h5-195762@0,0] on socket 36
[ucc-h2:280555] [prterun-ucc-h5-195762@0,1]-[prterun-ucc-h5-195762@0,0] tcp_peer_complete_connect: connection failed: Connection timed out (110)
[ucc-h2:280555] [prterun-ucc-h5-195762@0,1] tcp_peer_close for [prterun-ucc-h5-195762@0,0] sd 36 state CONNECTING
[ucc-h2:280555] [prterun-ucc-h5-195762@0,1]:[oob_tcp_connection.c:1066] connect to [prterun-ucc-h5-195762@0,0]
[ucc-h2:280555] [prterun-ucc-h5-195762@0,1] prte_tcp_peer_try_connect: attempting to connect to proc [prterun-ucc-h5-195762@0,0]
[ucc-h2:280555] [prterun-ucc-h5-195762@0,1] prte_tcp_peer_try_connect: attempting to connect to proc [prterun-ucc-h5-195762@0,0] on socket -1
[ucc-h2:280555] [prterun-ucc-h5-195762@0,1] prte_tcp_peer_try_connect: attempting to connect to proc [prterun-ucc-h5-195762@0,0] on 192.168.122.195:47555 - 1 retries
[ucc-h2:280555] [prterun-ucc-h5-195762@0,1] prte_tcp_peer_try_connect: 192.168.122.195:47555 is down
[ucc-h2:280555] [prterun-ucc-h5-195762@0,1] prte_tcp_peer_try_connect: attempting to connect to proc [prterun-ucc-h5-195762@0,0] on 192.168.3.11:47555 - 1 retries
[ucc-h2:280555] [prterun-ucc-h5-195762@0,1] prte_tcp_peer_try_connect: 192.168.3.11:47555 is down
```
It finds both addresses down. 
</details>

I also dug a little in network traffic and found this on ens7@ucc-h2 when ucc-h5 is the master node (the failing case): 

![Image](https://github.com/user-attachments/assets/bb523074-0c4a-4823-9440-129de1e9ba5d)

Looks like ucc-h2 is trying to talk to ucc-h5 through its _ens7_ interface but with the source IP of its _ens3_ interface! I don't have enough experience with networking to know how this could happen. This was actually a surprise to me.  I don't know if this is the root cause or is a symptom of another issue.

I know that there are issues on openmpi acting weird when there are multiple interfaces on the host like #5818 and #12232. But I can't find my answer there. 

I have tried all sorts of if_include/if_exclude flags on multiple mcas like opal, oob, prte, etc using both interface names and CIDR as parameters. But it's likely I have made a mistake so please let me know how it is properly done, I'm open to suggestions. For example, I tried this which made the most sense to me: 
```shell 
mpirun --mca plm_base_verbose 100 --debug-daemons --prtemca oob_base_verbose 100 --mca oob_tcp_if_exclude 192.168.122.0/24 --prtemca oob_tcp_if_exclude 192.168.122.0/24 -n 1 --host ucc-h2 hostname 
``` 
But it didn't change any outcome, it's still trying both interfaces. 

This is the furthest I've been able to go. I appreciate any hints or directions for investigating this issue further. I haven't been able to reproduce/isolate it on the network side because all the tools that I know work normally. The issue only appears when using `mpirun`. 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

mpirun 5.0.5 - TCP connection failure between hosts with multiple network interfaces #13155

Background information

What version of Open MPI are you using? (e.g., v4.1.6, v5.0.1, git branch name and hash, etc.)

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

Please describe the system on which you are running

Details of the problem

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

mpirun 5.0.5 - TCP connection failure between hosts with multiple network interfaces #13155

Description

Background information

What version of Open MPI are you using? (e.g., v4.1.6, v5.0.1, git branch name and hash, etc.)

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

Please describe the system on which you are running

Details of the problem

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions