Skip to content

job launch fails with srun: [cn05] [[31923,0],1] selected pml ob1, but peer [[31923,0],0] on cn04 selected pml #12475

@bhendersonPlano

Description

@bhendersonPlano

I'm having an issue with a two node job launch when using srun and looking for some help. A normal mpirun invoking mpirun from within an salloc works fine.

The system is a small number of dual socket nodes running RHEL 9.3 each with an Intel E810-C card (only port zero has a connection) and there are no other network cards in the system. The network connection is configured in an active-backup bond which I know is odd, but how our imaging tool likes things. There is a single 100G switch and only one subnet. /home is shared across the nodes via NFS.

Software is OpenMPI (5.0.3) with user built hwloc (2.10.0), pmix (5.0.2) and slurm (23.11.6).

Build options are:

./configure \
     --prefix=/home/brent/sys/openmpi/5.0.3 \
     --with-hwloc=/home/brent/sys/hwloc/2.10.0 \
     --with-slurm=/home/brent/sys/slurm/23.11.5 \
     --with-pmix=/home/brent/sys/pmix/5.0.2 \
     --disable-ipv6 \
     --enable-orterun-prefix-by-default

Slurm was also built with the same versions of hwloc and pmix.


A simple test works fine with mpirun directly:

$ mpirun -n 2 -H cn03,cn04 ./hello_mpi.503
Hello from rank 0 on cn03
Hello from rank 1 on cn04

An mpirun from within an salloc works fine:

$ salloc -w cn03,cn04 --ntasks-per-node=1 mpirun ./hello_mpi.503 
salloc: Granted job allocation 136
salloc: Nodes cn[03-04] are ready for job
Hello from rank 0 on cn03
Hello from rank 1 on cn04
salloc: Relinquishing job allocation 136

However a srun does not work:

[cn04:545610] [[55359,0],1] selected pml ob1, but peer [[55359,0],0] on cn03 selected pml 

srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
slurmstepd: error: *** STEP 138.0 ON cn03 CANCELLED AT 2024-04-18T13:05:02 ***
srun: error: cn04: task 1: Exited with exit code 14
srun: Terminating StepId=138.0
srun: error: cn03: task 0: Killed

Note that on the error line, the second node is not showing what it did select - which seems odd. It does not matter which two nodes I select, the first node (cn04 above) complains about the second node (cn03 above) not matching in the pml selection.

I added OMPI_MCA_btl_base_verbose=100 OMPI_MCA_pml_ob1_verbose=100 to my launch, cleaned up the output a little and then diffed the two. No differences were shown up to the point where the error message comes out.

$ diff good bad
26c26
< [cn04] btl:tcp: 0x904660: if bond0 kidx 6 cnt 0 addr 10.23.0.14 IPv4 bw 100000 lt 100
---
> [cn04] btl:tcp: 0x24e5830: if bond0 kidx 6 cnt 0 addr 10.23.0.14 IPv4 bw 100000 lt 100
57c57
< [cn03] btl:tcp: 0xc33660: if bond0 kidx 7 cnt 0 addr 10.23.0.13 IPv4 bw 100000 lt 100
---
> [cn03] btl:tcp: 0x1a69830: if bond0 kidx 7 cnt 0 addr 10.23.0.13 IPv4 bw 100000 lt 100
63,84c63,70
< [cn03] mca: bml: Using self btl for send to [[56142,1],0] on node cn03
< [cn04] mca: bml: Using self btl for send to [[56142,1],1] on node cn04
< Hello from rank 0 on cn03
< [cn04] mca: bml: Using tcp btl for send to [[56142,1],0] on node cn03
< [cn04] btl: tcp: attempting to connect() to [[56142,1],0] address 10.23.0.13 on port 1024
< [cn04] btl:tcp: would block, so allowing background progress
< [cn03] mca: bml: Using tcp btl for send to [[56142,1],1] on node cn04
< [cn03] btl: tcp: attempting to connect() to [[56142,1],1] address 10.23.0.14 on port 1024
< [cn03] btl:tcp: would block, so allowing background progress
< [cn03] btl:tcp: connect() to 10.23.0.14:1024 completed (complete_connect), sending connect ACK
< [cn03] btl:tcp: now connected to 10.23.0.14, process [[56142,1],1]
< [cn04] btl:tcp: connect() to 10.23.0.13:1024 completed (complete_connect), sending connect ACK
< [cn04] btl:tcp: now connected to 10.23.0.13, process [[56142,1],0]
< Hello from rank 1 on cn04
< [cn04] mca: base: close: component self closed
< [cn04] mca: base: close: unloading component self
< [cn04] mca: base: close: component tcp closed
< [cn04] mca: base: close: unloading component tcp
< [cn03] mca: base: close: component self closed
< [cn03] mca: base: close: unloading component self
< [cn03] mca: base: close: component tcp closed
< [cn03] mca: base: close: unloading component tcp
---
> [cn03] mca: bml: Using self btl for send to [[43367,0],0] on node cn03
> [cn04] [[43367,0],1] selected pml ob1, but peer [[43367,0],0] on cn03 selected pml 
> 
> slurmstepd: error: *** STEP 131.0 ON cn03 CANCELLED AT 2024-04-18T10:42:56 ***
> srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
> srun: error: cn04: task 1: Exited with exit code 14
> srun: Terminating StepId=131.0
> srun: error: cn03: task 0: Killed

I tried with OpenMPI 4.1.6 as well but ran into similar issues. Figured it would be easier to debug with 5.x and it is where I would like to end up as well. Any guidance on next steps would be appreciated.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions