-
Notifications
You must be signed in to change notification settings - Fork 929
Closed
Description
Thank you for taking the time to submit an issue!
Background information
Running installation tests on cluster, v5 (release or from github) works up to <= 94 nodes, then fails instantly. v4 works fine. N.B. this is running a job from a login node via SLURM (salloc + mpiexec).
What version of Open MPI are you using? (e.g., v4.1.6, v5.0.1, git branch name and hash, etc.)
main @ 448c3ba
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
source / git
If you are building/installing from a git clone, please copy-n-paste the output from git submodule status.
e32e0179bc6bd1637f92690511ce6091719fa046 3rd-party/openpmix (v1.1.3-4036-ge32e0179)
0f0a90006cbc880d499b2356d6076e785e7868ba 3rd-party/prrte (psrvr-v2.0.0rc1-4819-g0f0a90006c)
dfff67569fb72dbf8d73a1dcf74d091dad93f71b config/oac (heads/main-1-gdfff675)
Please describe the system on which you are running
- Operating system/version: Rocky 8.4
- Computer hardware: aarch64
- Network type: IB
Details of the problem
Beyond 94 nodes
--------------------------------------------------------------------------
PRTE has lost communication with a remote daemon.
HNP daemon : [prterun-login2-2232463@0,0] on node login2
Remote daemon: [prterun-login2-2232463@0,28] on node fj094
This is usually due to either a failure of the TCP network
connection to the node, or possibly an internal failure of
the daemon itself. We cannot recover from this failure, and
therefore will terminate the job.
--------------------------------------------------------------------------
Metadata
Metadata
Assignees
Labels
No labels