PRTE has lost communication with a remote daemon

Please submit all the information below so that we can understand the working environment that is the context for your question.

* If you have a problem building or installing Open MPI, [be sure to read this](https://docs.open-mpi.org/en/main/getting-help.html#for-problems-building-or-installing-open-mpi).
* If you have a problem launching MPI or OpenSHMEM applications, [be sure to read this](https://docs.open-mpi.org/en/main/getting-help.html#for-problems-launching-mpi-or-openshmem-applications).
* If you have a problem running MPI or OpenSHMEM applications (i.e., after launching them), [be sure to read this](https://docs.open-mpi.org/en/main/getting-help.html#for-problems-running-mpi-or-openshmem-applications).

## Background information

### What version of Open MPI are you using? (e.g., v4.1.6, v5.0.1, git branch name and hash, etc.)

Open MPI 5.0.3

### Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

From distribution tarball.

Build with pmix 4.2.8 and prrte 3.0.5


### If you are building/installing from a git clone, please copy-n-paste the output from `git submodule status`.



### Please describe the system on which you are running

* Operating system/version:
Ubuntu 22.04

* Computer hardware:
AWS EC2 P4D instance, with 8 Nvidia A100 GPUs

```
ubuntu@ip-172-31-8-217:~$ lscpu
Architecture:             x86_64
  CPU op-mode(s):         32-bit, 64-bit
  Address sizes:          46 bits physical, 48 bits virtual
  Byte Order:             Little Endian
CPU(s):                   96
  On-line CPU(s) list:    0-95
Vendor ID:                GenuineIntel
  Model name:             Intel(R) Xeon(R) Platinum 8275CL CPU @ 3.00GHz
    CPU family:           6
    Model:                85
    Thread(s) per core:   2
    Core(s) per socket:   24
    Socket(s):            2
    Stepping:             7
    BogoMIPS:             5999.99
    Flags:                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_ts
                          c arch_perfmon rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2api
                          c movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch pti fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erm
                          s invpcid mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves ida arat pku ospke
Virtualization features:
  Hypervisor vendor:      KVM
  Virtualization type:    full
Caches (sum of all):
  L1d:                    1.5 MiB (48 instances)
  L1i:                    1.5 MiB (48 instances)
  L2:                     48 MiB (48 instances)
  L3:                     71.5 MiB (2 instances)
NUMA:
  NUMA node(s):           2
  NUMA node0 CPU(s):      0-23,48-71
  NUMA node1 CPU(s):      24-47,72-95
Vulnerabilities:
  Gather data sampling:   Unknown: Dependent on hypervisor status
  Itlb multihit:          KVM: Mitigation: VMX unsupported
  L1tf:                   Mitigation; PTE Inversion
  Mds:                    Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown
  Meltdown:               Mitigation; PTI
  Mmio stale data:        Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown
  Reg file data sampling: Not affected
  Retbleed:               Vulnerable
  Spec rstack overflow:   Not affected
  Spec store bypass:      Vulnerable
  Spectre v1:             Mitigation; usercopy/swapgs barriers and __user pointer sanitization
  Spectre v2:             Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Retpoline
  Srbds:                  Not affected
  Tsx async abort:        Not affected
```
* Network type:

AWS Elastic Fabric Adaptor; TCP
-----------------------------

## Details of the problem

When launching an MPI program on a remote host, I will get an error that complains such message:

```
--------------------------------------------------------------------------
PRTE has lost communication with a remote daemon.

  HNP daemon   : [prterun-ip-172-31-8-217-350139@0,0] on node ip-172-31-8-217
  Remote daemon: [prterun-ip-172-31-8-217-350139@0,1] on node 172.31.14.44

This is usually due to either a failure of the TCP network
connection to the node, or possibly an internal failure of
the daemon itself. We cannot recover from this failure, and
therefore will terminate the job.
--------------------------------------------------------------------------
```

I have followed this doc https://docs.open-mpi.org/en/main/getting-help.html#for-problems-launching-mpi-or-openshmem-applications 

to generate the ompi-output.tar.bz2 tarball that contains the ompi info, lstopo, and verbose output of `mpirun --prtemca plm_base_verbose 100 --prtemca rmaps_base_verbose 100`


Related error snippet in the verbose output is

```
======================   ALLOCATED NODES   ======================
    172.31.14.44: slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN
	Flags: NONE
	aliases: NONE
=================================================================
[ip-172-31-8-217:349942] [prterun-ip-172-31-8-217-349942@0,0] checking node 172.31.14.44
[ip-172-31-8-217:349942] [prterun-ip-172-31-8-217-349942@0,0] plm:base:setup_vm add new daemon [prterun-ip-172-31-8-217-349942@0,1]
[ip-172-31-8-217:349942] [prterun-ip-172-31-8-217-349942@0,0] plm:base:setup_vm assigning new daemon [prterun-ip-172-31-8-217-349942@0,1] to node 172.31.14.44
[ip-172-31-8-217:349942] [prterun-ip-172-31-8-217-349942@0,0] plm:ssh: launching vm
[ip-172-31-8-217:349942] [prterun-ip-172-31-8-217-349942@0,0] plm:ssh: local shell: 0 (bash)
[ip-172-31-8-217:349942] [prterun-ip-172-31-8-217-349942@0,0] plm:ssh: assuming same remote shell as local shell
[ip-172-31-8-217:349942] [prterun-ip-172-31-8-217-349942@0,0] plm:ssh: remote shell: 0 (bash)
[ip-172-31-8-217:349942] [prterun-ip-172-31-8-217-349942@0,0] plm:ssh: final template argv:
	/usr/bin/ssh <template> PRTE_PREFIX=/opt/amazon/prrte;export PRTE_PREFIX;LD_LIBRARY_PATH=/opt/amazon/prrte/lib:/opt/amazon/pmix/lib:$LD_LIBRARY_PATH;export LD_LIBRARY_PATH;DYLD_LIBRARY_PATH=/opt/amazon/prrte/lib:/opt/amazon/pmix/lib:$DYLD_LIBRARY_PATH;export DYLD_LIBRARY_PATH;/opt/amazon/prrte/bin/prted --prtemca ess "env" --prtemca ess_base_nspace "prterun-ip-172-31-8-217-349942@0" --prtemca ess_base_vpid "<template>" --prtemca ess_base_num_procs "2" --prtemca prte_hnp_uri "prterun-ip-172-31-8-217-349942@0.0;tcp://172.31.8.217,172.31.2.193,172.31.10.23,172.31.6.99,172.17.0.1:44927:20,20,20,20,16" --prtemca plm_base_verbose "100" --prtemca rmaps_base_verbose "100" --prtemca pmix_session_server "1" --prtemca plm "ssh" --tree-spawn --prtemca prte_parent_uri "prterun-ip-172-31-8-217-349942@0.0;tcp://172.31.8.217,172.31.2.193,172.31.10.23,172.31.6.99,172.17.0.1:44927:20,20,20,20,16"
[ip-172-31-8-217:349942] [prterun-ip-172-31-8-217-349942@0,0] plm:ssh:launch daemon 0 not a child of mine
[ip-172-31-8-217:349942] [prterun-ip-172-31-8-217-349942@0,0] plm:ssh: adding node 172.31.14.44 to launch list
[ip-172-31-8-217:349942] [prterun-ip-172-31-8-217-349942@0,0] plm:ssh: activating launch event
[ip-172-31-8-217:349942] [prterun-ip-172-31-8-217-349942@0,0] plm:ssh: recording launch of daemon [prterun-ip-172-31-8-217-349942@0,1]
[ip-172-31-8-217:349942] [prterun-ip-172-31-8-217-349942@0,0] plm:ssh: executing: (/usr/bin/ssh) [/usr/bin/ssh 172.31.14.44 PRTE_PREFIX=/opt/amazon/prrte;export PRTE_PREFIX;LD_LIBRARY_PATH=/opt/amazon/prrte/lib:/opt/amazon/pmix/lib:$LD_LIBRARY_PATH;export LD_LIBRARY_PATH;DYLD_LIBRARY_PATH=/opt/amazon/prrte/lib:/opt/amazon/pmix/lib:$DYLD_LIBRARY_PATH;export DYLD_LIBRARY_PATH;/opt/amazon/prrte/bin/prted --prtemca ess "env" --prtemca ess_base_nspace "prterun-ip-172-31-8-217-349942@0" --prtemca ess_base_vpid 1 --prtemca ess_base_num_procs "2" --prtemca prte_hnp_uri "prterun-ip-172-31-8-217-349942@0.0;tcp://172.31.8.217,172.31.2.193,172.31.10.23,172.31.6.99,172.17.0.1:44927:20,20,20,20,16" --prtemca plm_base_verbose "100" --prtemca rmaps_base_verbose "100" --prtemca pmix_session_server "1" --prtemca plm "ssh" --tree-spawn --prtemca prte_parent_uri "prterun-ip-172-31-8-217-349942@0.0;tcp://172.31.8.217,172.31.2.193,172.31.10.23,172.31.6.99,172.17.0.1:44927:20,20,20,20,16"]
[ip-172-31-8-217:349942] [prterun-ip-172-31-8-217-349942@0,0] daemon 1 failed with status 1
--------------------------------------------------------------------------
PRTE has lost communication with a remote daemon.

  HNP daemon   : [prterun-ip-172-31-8-217-349942@0,0] on node ip-172-31-8-217
  Remote daemon: [prterun-ip-172-31-8-217-349942@0,1] on node 172.31.14.44

This is usually due to either a failure of the TCP network
connection to the node, or possibly an internal failure of
the daemon itself. We cannot recover from this failure, and
therefore will terminate the job.
--------------------------------------------------------------------------
[ip-172-31-8-217:349942] [prterun-ip-172-31-8-217-349942@0,0] plm:base:receive stop comm
[ip-172-31-8-217:349942] mca: base: close: component ssh closed
[ip-172-31-8-217:349942] mca: base: close: unloading component ssh

```

See the attachments for all outputs

I can confirm that I can ssh to the node 172.31.14.44 without any issue.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

PRTE has lost communication with a remote daemon #12840

Background information

What version of Open MPI are you using? (e.g., v4.1.6, v5.0.1, git branch name and hash, etc.)

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

If you are building/installing from a git clone, please copy-n-paste the output from `git submodule status`.

Please describe the system on which you are running

AWS Elastic Fabric Adaptor; TCP

Details of the problem

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

PRTE has lost communication with a remote daemon #12840

Description

Background information

What version of Open MPI are you using? (e.g., v4.1.6, v5.0.1, git branch name and hash, etc.)

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

If you are building/installing from a git clone, please copy-n-paste the output from git submodule status.

Please describe the system on which you are running

AWS Elastic Fabric Adaptor; TCP

Details of the problem

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

If you are building/installing from a git clone, please copy-n-paste the output from `git submodule status`.