-
Notifications
You must be signed in to change notification settings - Fork 936
Description
Background information
What version of Open MPI are you using? (e.g., v4.1.6, v5.0.1, git branch name and hash, etc.)
$ ompi_info --version
Open MPI v5.0.5$ prte_info --all | head
PRTE: 3.0.6rc12025-03-17
PRTE repo revision: 2025-03-17
PRTE release date: @PMIX_RELEASE_DATE@
PMIx: OpenPMIx 5.0.3rc1 (PMIx Standard: 4.2, Stable ABI:
0.0, Provisional ABI: 0.0)Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
Downloaded the source code from https://www.open-mpi.org/. Then:
$ ./configure --with-cuda=/usr/local/cuda-12.6 --with-gdrcopy
$ sudo make -j install
$ sudo ldconfig Please describe the system on which you are running
-
Operating system/version: Ubuntu 22.04.4 LTS
-
Computer hardware: x86_64
-
Network type:
I have two VM nodes on separate physical machines:-
node 1 is named ucc-h2:
- interface ens3, 192.168.122.15/24, is a virtual interface connected to a libvirt bridge
- interface ens7, 192.168.1.12/24, is a host interface assigned to the node using PCI passthrough.
-
node 2 is named ucc-h5:
- interface enp5s1, 192.168.122.195/24, is a virtual interface connected to a libvirt bridge
- interface ens7, 192.168.3.11/24, is a host interface assigned to the node using PCI passthrough.
-
ens3 and enp5s1 don't ping each other. They are used for management (mostly ssh). The two ens7 interfaces are connected through a router so they can ping each other.
- Other relevant networking stuff:
ncworks both ways. Meaningnc -l <port num>on ucc-h2 andnc -N ucc-h2 <port num>on ucc-h5 works fine and vice versa.ssh <hostname>works without requiring password on both hosts.-
amir@ucc-h2:~$ ip route get 192.168.3.11 192.168.3.11 via 192.168.1.1 dev ens7 src 192.168.1.12 uid 1000 cache -
amir@ucc-h5:~$ ip route get 192.168.1.12 192.168.1.12 via 192.168.3.1 dev ens7 src 192.168.3.11 uid 1000 cache
Yeah, I think that's it. But please let me know if I'm missing something I'll be happy to provide more info.
Details of the problem
I am trying to get mpirun -n 1 --host <hostname> hostname to work on both hosts.
amir@ucc-h2:~$ mpirun -n 1 --host ucc-h5 hostname
ucc-h5So it's working fine on ucc-h2. But on ucc-h5:
amir@ucc-h5:~$ mpirun -n 1 --host ucc-h2 hostname
--------------------------------------------------------------------------
PRTE has lost communication with a remote daemon.
HNP daemon : [prterun-ucc-h5-196668@0,0] on node ucc-h5
Remote daemon: [prterun-ucc-h5-196668@0,1] on node ucc-h2
This is usually due to either a failure of the TCP network
connection to the node, or possibly an internal failure of
the daemon itself. We cannot recover from this failure, and
therefore will terminate the job.
--------------------------------------------------------------------------It's timing out.
I ran the commands on both ucc-h2 and ucc-h5 with high verbosity to compare them:
mpirun --mca plm_base_verbose 100 --debug-daemons --prtemca oob_base_verbose 100 -n 1 --host <hostname> hostnameThe full outputs are really long so I've included them in separate files in this gist.
But in summary...
In both cases ( the successful one and the failing one) the remote daemon tries to establish connection to the master node using both interfaces. The connection using the wrong interface (ens3@ucc-h2 or enp5s0@ucc-h5) times out after a couple of retries:
prte_tcp_peer_try_connect: attempting to connect to proc [prterun-ucc-h2-281015@0,0] on 192.168.122.15:36849 - 1 retries
prte_tcp_peer_try_connect: 192.168.122.15:36849 is down
Then tries the other interface (which is right one):
prte_tcp_peer_try_connect: attempting to connect to proc [prterun-ucc-h2-281015@0,0] on 192.168.1.12:36849 - 0 retries
Here is, as far as I could tell, where thing are different depending which node is the master node, causing the asymmetric behaviour.
If the master node is ucc-h2:
[ucc-h5:196281] [prterun-ucc-h2-281015@0,1] prte_tcp_peer_try_connect: attempting to connect to proc [prterun-ucc-h2-281015@0,0] on 192.168.1.12:36849 - 0 retries
[ucc-h5:196281] [prterun-ucc-h2-281015@0,1] oob:tcp:peer creating socket to [prterun-ucc-h2-281015@0,0]
[ucc-h5:196281] [prterun-ucc-h2-281015@0,1] waiting for connect completion to [prterun-ucc-h2-281015@0,0] - activating send event
[ucc-h5:196281] [prterun-ucc-h2-281015@0,1] tcp:send_handler called to send to peer [prterun-ucc-h2-281015@0,0]
[ucc-h5:196281] [prterun-ucc-h2-281015@0,1] tcp:send_handler CONNECTING
[ucc-h5:196281] [prterun-ucc-h2-281015@0,1]:tcp:complete_connect called for peer [prterun-ucc-h2-281015@0,0] on socket 36
[ucc-h2:281015] [prterun-ucc-h2-281015@0,0] prte_oob_tcp_listen_thread: incoming connection: (40, 0) 192.168.3.11:59759
[ucc-h5:196281] [prterun-ucc-h2-281015@0,1] tcp_peer_complete_connect: sending ack to [prterun-ucc-h2-281015@0,0]
The connection is established.
If the master node is ucc-h5:
[ucc-h2:280555] [prterun-ucc-h5-195762@0,1] prte_tcp_peer_try_connect: attempting to connect to proc [prterun-ucc-h5-195762@0,0] on 192.168.3.11:47555 - 0 retries
[ucc-h2:280555] [prterun-ucc-h5-195762@0,1] oob:tcp:peer creating socket to [prterun-ucc-h5-195762@0,0]
[ucc-h2:280555] [prterun-ucc-h5-195762@0,1] waiting for connect completion to [prterun-ucc-h5-195762@0,0] - activating send event
------------------------------------------------------------
A process or daemon was unable to complete a TCP connection
to another process:
Local host: ucc-h2
Remote host: 192.168.3.11
This is usually caused by a firewall on the remote host. Please
check that any firewall (e.g., iptables) has been disabled and
try again.
------------------------------------------------------------
[ucc-h2:280555] [prterun-ucc-h5-195762@0,1] tcp:send_handler called to send to peer [prterun-ucc-h5-195762@0,0]
[ucc-h2:280555] [prterun-ucc-h5-195762@0,1] tcp:send_handler CONNECTING
[ucc-h2:280555] [prterun-ucc-h5-195762@0,1]:tcp:complete_connect called for peer [prterun-ucc-h5-195762@0,0] on socket 36
[ucc-h2:280555] [prterun-ucc-h5-195762@0,1]-[prterun-ucc-h5-195762@0,0] tcp_peer_complete_connect: connection failed: Connection timed out (110)
[ucc-h2:280555] [prterun-ucc-h5-195762@0,1] tcp_peer_close for [prterun-ucc-h5-195762@0,0] sd 36 state CONNECTING
[ucc-h2:280555] [prterun-ucc-h5-195762@0,1]:[oob_tcp_connection.c:1066] connect to [prterun-ucc-h5-195762@0,0]
[ucc-h2:280555] [prterun-ucc-h5-195762@0,1] prte_tcp_peer_try_connect: attempting to connect to proc [prterun-ucc-h5-195762@0,0]
[ucc-h2:280555] [prterun-ucc-h5-195762@0,1] prte_tcp_peer_try_connect: attempting to connect to proc [prterun-ucc-h5-195762@0,0] on socket -1
[ucc-h2:280555] [prterun-ucc-h5-195762@0,1] prte_tcp_peer_try_connect: attempting to connect to proc [prterun-ucc-h5-195762@0,0] on 192.168.122.195:47555 - 1 retries
[ucc-h2:280555] [prterun-ucc-h5-195762@0,1] prte_tcp_peer_try_connect: 192.168.122.195:47555 is down
[ucc-h2:280555] [prterun-ucc-h5-195762@0,1] prte_tcp_peer_try_connect: attempting to connect to proc [prterun-ucc-h5-195762@0,0] on 192.168.3.11:47555 - 1 retries
[ucc-h2:280555] [prterun-ucc-h5-195762@0,1] prte_tcp_peer_try_connect: 192.168.3.11:47555 is down
It finds both addresses down.
I also dug a little in network traffic and found this on ens7@ucc-h2 when ucc-h5 is the master node (the failing case):
Looks like ucc-h2 is trying to talk to ucc-h5 through its ens7 interface but with the source IP of its ens3 interface! I don't have enough experience with networking to know how this could happen. This was actually a surprise to me. I don't know if this is the root cause or is a symptom of another issue.
I know that there are issues on openmpi acting weird when there are multiple interfaces on the host like #5818 and #12232. But I can't find my answer there.
I have tried all sorts of if_include/if_exclude flags on multiple mcas like opal, oob, prte, etc using both interface names and CIDR as parameters. But it's likely I have made a mistake so please let me know how it is properly done, I'm open to suggestions. For example, I tried this which made the most sense to me:
mpirun --mca plm_base_verbose 100 --debug-daemons --prtemca oob_base_verbose 100 --mca oob_tcp_if_exclude 192.168.122.0/24 --prtemca oob_tcp_if_exclude 192.168.122.0/24 -n 1 --host ucc-h2 hostname But it didn't change any outcome, it's still trying both interfaces.
This is the furthest I've been able to go. I appreciate any hints or directions for investigating this issue further. I haven't been able to reproduce/isolate it on the network side because all the tools that I know work normally. The issue only appears when using mpirun.
