Skip to content

Jobs started with srun using cm/PSM2 fail #12886

@mcarmesin

Description

@mcarmesin

Background information

What version of Open MPI are you using? (e.g., v4.1.6, v5.0.1, git branch name and hash, etc.)

v5.0.5

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

Open MPI was build from the released sources (https://download.open-mpi.org/release/open-mpi/v5.0/openmpi-5.0.5.tar.bz2) using gcc 14.2 and the configured options
./configure --prefix=%{_prefix}
--libdir=%{_libdir}
--enable-shared
--disable-heterogeneous
--enable-prte-prefix-by-default
--enable-mca-dso=btl-smcuda,rcache-rgpusm,rcache-gpusm,accelerator-cuda,coll-cuda,io-romio341
--with-show-load-errors=no
--with-slurm
--with-psm2
--with-pmix=internal
--with-ucx=%{builddir}%{_prefix}
--with-ofi=%{builddir}%{_prefix}

Further we use the manually built libfabric v1.21.0 and ucx v1.16.0.

Please describe the system on which you are running

  • Operating system/version: Rocky 8 / Kernel 4.18
  • Computer hardware: Xeon 6252 Gold
  • Network type: OmniPath 100GBits/s

Details of the problem

On our cluster with slurm 23.11.5, Open MPI 5.0 jobs (here the Open MPI ring example) fail inside a SLURM allocation, when started with srun. The same job runs successfully when started using mpirun. The PML is set to "cm" and MTL to "PSM2" using environment variables. Jobs with PML ucx or ob1 run successfully, but with higher latency.

$ salloc --tasks-per-node=5 -t 0:10:0
salloc: Granted job allocation YYYYY
salloc: Waiting for resource configuration
salloc: Nodes XXXX are ready for job
$ srun ./ring
slurmstepd: error: *** STEP YYYYY ON XXXXX CANCELLED AT 2024-10-28T15:49:29 ***
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
[XXXXX:858788] pmix_ptl_base: send_msg: write failed: Connection reset by peer (104) [sd = 32]
[XXXXX:858788] pmix_ptl_base: send_msg: write failed: Connection reset by peer (104) [sd = 35]
[XXXXX:858788] pmix_ptl_base: send_msg: write failed: Broken pipe (32) [sd = 26]
srun: error: XXXXX: tasks 0-4: Killed
srun: Terminating StepId=YYYYY.0
$ mpirun ./ring
Process 0 sending 10 to 1, tag 201 (5 processes in ring)
Process 0 sent to 1
Process 0 decremented value: 9
Process 0 decremented value: 8
Process 0 decremented value: 7
Process 0 decremented value: 6
Process 0 decremented value: 5
Process 0 decremented value: 4
Process 0 decremented value: 3
Process 0 decremented value: 2
Process 0 decremented value: 1
Process 0 decremented value: 0
Process 0 exiting
Process 1 exiting
Process 2 exiting
Process 3 exiting
Process 4 exiting
$

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions