-
Notifications
You must be signed in to change notification settings - Fork 929
Description
I'm having an issue with a two node job launch when using srun and looking for some help. A normal mpirun invoking mpirun from within an salloc works fine.
The system is a small number of dual socket nodes running RHEL 9.3 each with an Intel E810-C card (only port zero has a connection) and there are no other network cards in the system. The network connection is configured in an active-backup bond which I know is odd, but how our imaging tool likes things. There is a single 100G switch and only one subnet. /home is shared across the nodes via NFS.
Software is OpenMPI (5.0.3) with user built hwloc (2.10.0), pmix (5.0.2) and slurm (23.11.6).
Build options are:
./configure \
--prefix=/home/brent/sys/openmpi/5.0.3 \
--with-hwloc=/home/brent/sys/hwloc/2.10.0 \
--with-slurm=/home/brent/sys/slurm/23.11.5 \
--with-pmix=/home/brent/sys/pmix/5.0.2 \
--disable-ipv6 \
--enable-orterun-prefix-by-default
Slurm was also built with the same versions of hwloc and pmix.
A simple test works fine with mpirun directly:
$ mpirun -n 2 -H cn03,cn04 ./hello_mpi.503
Hello from rank 0 on cn03
Hello from rank 1 on cn04
An mpirun from within an salloc works fine:
$ salloc -w cn03,cn04 --ntasks-per-node=1 mpirun ./hello_mpi.503
salloc: Granted job allocation 136
salloc: Nodes cn[03-04] are ready for job
Hello from rank 0 on cn03
Hello from rank 1 on cn04
salloc: Relinquishing job allocation 136
However a srun does not work:
[cn04:545610] [[55359,0],1] selected pml ob1, but peer [[55359,0],0] on cn03 selected pml
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
slurmstepd: error: *** STEP 138.0 ON cn03 CANCELLED AT 2024-04-18T13:05:02 ***
srun: error: cn04: task 1: Exited with exit code 14
srun: Terminating StepId=138.0
srun: error: cn03: task 0: Killed
Note that on the error line, the second node is not showing what it did select - which seems odd. It does not matter which two nodes I select, the first node (cn04 above) complains about the second node (cn03 above) not matching in the pml selection.
I added OMPI_MCA_btl_base_verbose=100 OMPI_MCA_pml_ob1_verbose=100 to my launch, cleaned up the output a little and then diffed the two. No differences were shown up to the point where the error message comes out.
$ diff good bad
26c26
< [cn04] btl:tcp: 0x904660: if bond0 kidx 6 cnt 0 addr 10.23.0.14 IPv4 bw 100000 lt 100
---
> [cn04] btl:tcp: 0x24e5830: if bond0 kidx 6 cnt 0 addr 10.23.0.14 IPv4 bw 100000 lt 100
57c57
< [cn03] btl:tcp: 0xc33660: if bond0 kidx 7 cnt 0 addr 10.23.0.13 IPv4 bw 100000 lt 100
---
> [cn03] btl:tcp: 0x1a69830: if bond0 kidx 7 cnt 0 addr 10.23.0.13 IPv4 bw 100000 lt 100
63,84c63,70
< [cn03] mca: bml: Using self btl for send to [[56142,1],0] on node cn03
< [cn04] mca: bml: Using self btl for send to [[56142,1],1] on node cn04
< Hello from rank 0 on cn03
< [cn04] mca: bml: Using tcp btl for send to [[56142,1],0] on node cn03
< [cn04] btl: tcp: attempting to connect() to [[56142,1],0] address 10.23.0.13 on port 1024
< [cn04] btl:tcp: would block, so allowing background progress
< [cn03] mca: bml: Using tcp btl for send to [[56142,1],1] on node cn04
< [cn03] btl: tcp: attempting to connect() to [[56142,1],1] address 10.23.0.14 on port 1024
< [cn03] btl:tcp: would block, so allowing background progress
< [cn03] btl:tcp: connect() to 10.23.0.14:1024 completed (complete_connect), sending connect ACK
< [cn03] btl:tcp: now connected to 10.23.0.14, process [[56142,1],1]
< [cn04] btl:tcp: connect() to 10.23.0.13:1024 completed (complete_connect), sending connect ACK
< [cn04] btl:tcp: now connected to 10.23.0.13, process [[56142,1],0]
< Hello from rank 1 on cn04
< [cn04] mca: base: close: component self closed
< [cn04] mca: base: close: unloading component self
< [cn04] mca: base: close: component tcp closed
< [cn04] mca: base: close: unloading component tcp
< [cn03] mca: base: close: component self closed
< [cn03] mca: base: close: unloading component self
< [cn03] mca: base: close: component tcp closed
< [cn03] mca: base: close: unloading component tcp
---
> [cn03] mca: bml: Using self btl for send to [[43367,0],0] on node cn03
> [cn04] [[43367,0],1] selected pml ob1, but peer [[43367,0],0] on cn03 selected pml
>
> slurmstepd: error: *** STEP 131.0 ON cn03 CANCELLED AT 2024-04-18T10:42:56 ***
> srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
> srun: error: cn04: task 1: Exited with exit code 14
> srun: Terminating StepId=131.0
> srun: error: cn03: task 0: Killed
I tried with OpenMPI 4.1.6 as well but ran into similar issues. Figured it would be easier to debug with 5.x and it is where I would like to end up as well. Any guidance on next steps would be appreciated.