-
Notifications
You must be signed in to change notification settings - Fork 929
Description
Background information
What version of Open MPI are you using? (e.g., v4.1.6, v5.0.1, git branch name and hash, etc.)
- Open MPI v5.0.6
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
- from a source/distribution tarball
Please describe the system on which you are running
- Operating system/version: Red Hat Enterprise Linux 8.8
- Computer hardware: HPE Cray EX
- Network type: HPE Slingshot
Details of the problem
Please describe, in detail, the problem that you are having, including the behavior you expect to see, the actual behavior that you are seeing, steps to reproduce the problem, etc. It is most helpful if you can attach a small program that a developer can use to reproduce your problem.
I have noticed that sometimes, without any specific regularity, after initiating an mpirun command, a prterun hang occurs, meaning the main executable application does not start. The likelihood of this happening increases when launching a larger number of MPI processes across different computational nodes.
Open MPI was compiled with PBS support, as outlined below:
./configure \
CC="gcc" \
CXX="g++" \
FC="gfortran" \
--prefix=${install_dir} \
--enable-shared \
--enable-static \
--with-pbs \
--with-libfabric="/opt/cray" \
--with-libfabric-libdir="/opt/cray/lib64" \
--with-tm="/opt/pbs" \
--with-tm-libdir="/opt/pbs/lib"
I run the application as follows:
mpirun -np 32 pw.x -i scf.input
I have observed the same behavior with earlier versions of Open MPI 5 as well. This issue is not exclusively related to the application I used (i.e. Quantum ESPRESSO) and occurs independently of any specific computational node. In other words, simply restarting the job often resolves the problem.