Skip to content

v5.x failure beyond 94 nodes #12872

@tonycurtis

Description

@tonycurtis

Thank you for taking the time to submit an issue!

Background information

Running installation tests on cluster, v5 (release or from github) works up to <= 94 nodes, then fails instantly. v4 works fine. N.B. this is running a job from a login node via SLURM (salloc + mpiexec).

What version of Open MPI are you using? (e.g., v4.1.6, v5.0.1, git branch name and hash, etc.)

main @ 448c3ba

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

source / git

If you are building/installing from a git clone, please copy-n-paste the output from git submodule status.

 e32e0179bc6bd1637f92690511ce6091719fa046 3rd-party/openpmix (v1.1.3-4036-ge32e0179)
 0f0a90006cbc880d499b2356d6076e785e7868ba 3rd-party/prrte (psrvr-v2.0.0rc1-4819-g0f0a90006c)
 dfff67569fb72dbf8d73a1dcf74d091dad93f71b config/oac (heads/main-1-gdfff675)

Please describe the system on which you are running

  • Operating system/version: Rocky 8.4
  • Computer hardware: aarch64
  • Network type: IB

Details of the problem

Beyond 94 nodes

--------------------------------------------------------------------------
PRTE has lost communication with a remote daemon.

  HNP daemon   : [prterun-login2-2232463@0,0] on node login2
  Remote daemon: [prterun-login2-2232463@0,28] on node fj094

This is usually due to either a failure of the TCP network
connection to the node, or possibly an internal failure of
the daemon itself. We cannot recover from this failure, and
therefore will terminate the job.
--------------------------------------------------------------------------

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions