-
Couldn't load subscription status.
- Fork 929
Description
Background information
What version of Open MPI are you using? (e.g., v4.1.6, v5.0.1, git branch name and hash, etc.)
v5.0.5
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
Open MPI was build from the released sources (https://download.open-mpi.org/release/open-mpi/v5.0/openmpi-5.0.5.tar.bz2) using gcc 14.2 and the configured options
./configure --prefix=%{_prefix}
--libdir=%{_libdir}
--enable-shared
--disable-heterogeneous
--enable-prte-prefix-by-default
--enable-mca-dso=btl-smcuda,rcache-rgpusm,rcache-gpusm,accelerator-cuda,coll-cuda,io-romio341
--with-show-load-errors=no
--with-slurm
--with-psm2
--with-pmix=internal
--with-ucx=%{builddir}%{_prefix}
--with-ofi=%{builddir}%{_prefix}
Further we use the manually built libfabric v1.21.0 and ucx v1.16.0.
Please describe the system on which you are running
- Operating system/version: Rocky 8 / Kernel 4.18
- Computer hardware: Xeon 6252 Gold
- Network type: OmniPath 100GBits/s
Details of the problem
On our cluster with slurm 23.11.5, Open MPI 5.0 jobs (here the Open MPI ring example) fail inside a SLURM allocation, when started with srun. The same job runs successfully when started using mpirun. The PML is set to "cm" and MTL to "PSM2" using environment variables. Jobs with PML ucx or ob1 run successfully, but with higher latency.
$ salloc --tasks-per-node=5 -t 0:10:0
salloc: Granted job allocation YYYYY
salloc: Waiting for resource configuration
salloc: Nodes XXXX are ready for job
$ srun ./ring
slurmstepd: error: *** STEP YYYYY ON XXXXX CANCELLED AT 2024-10-28T15:49:29 ***
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
[XXXXX:858788] pmix_ptl_base: send_msg: write failed: Connection reset by peer (104) [sd = 32]
[XXXXX:858788] pmix_ptl_base: send_msg: write failed: Connection reset by peer (104) [sd = 35]
[XXXXX:858788] pmix_ptl_base: send_msg: write failed: Broken pipe (32) [sd = 26]
srun: error: XXXXX: tasks 0-4: Killed
srun: Terminating StepId=YYYYY.0
$ mpirun ./ring
Process 0 sending 10 to 1, tag 201 (5 processes in ring)
Process 0 sent to 1
Process 0 decremented value: 9
Process 0 decremented value: 8
Process 0 decremented value: 7
Process 0 decremented value: 6
Process 0 decremented value: 5
Process 0 decremented value: 4
Process 0 decremented value: 3
Process 0 decremented value: 2
Process 0 decremented value: 1
Process 0 decremented value: 0
Process 0 exiting
Process 1 exiting
Process 2 exiting
Process 3 exiting
Process 4 exiting
$