-
Notifications
You must be signed in to change notification settings - Fork 937
Description
Hello,
Background information
We are running OpenMPI on a small GPU cluster composed of two front-end nodes, hpc-vesta[1-2], and three compute nodes, hpc-mn[101-103].
When running everything locally, the execution completes quickly:
[xxx@hpc-vesta1 ~]$ time mpirun --host hpc-vesta1 hostname
hpc-vesta1.hpc.lan
real 0m0.041s
user 0m0.007s
sys 0m0.023sHowever, when running the task remotely, it takes almost three minutes to complete:
[xxx@hpc-vesta1 ~]$ time mpirun --host hpc-mn101 hostname
hpc-mn101.hpc.lan
real 2m9.159s
user 0m0.013s
sys 0m0.019s
The same behavior occurs when running from hpc-vesta1 using --host hpc-mn102 --host hpc-mn103 or --host hpc-vest2 (same when using multinodes)
Version of Open MPI
We have tested all v5.0.x versions and are currently using v5.0.7.
Describe how Open MPI was installed
Open MPI was installed from the official website's source tarball.
[xxx@hpc-vesta1 ~]$ ompi_info
Package: Open MPI [email protected] Distribution
Open MPI: 5.0.7
Open MPI repo revision: v5.0.7
Open MPI release date: Feb 14, 2025
MPI API: 3.1.0
Ident string: 5.0.7
Prefix: /nfs/mesonet/sw/openmpi/openmpi-5.0.7.rocm-5.7.1
Configured architecture: x86_64-pc-linux-gnu
Configured by: root
Configured on: Wed Feb 26 13:02:06 UTC 2025
Configure host: hpc-vesta1.hpc.lan
Configure command line: '--prefix=/nfs/mesonet/sw/openmpi/openmpi-5.0.7.rocm-5.7.1'
'--enable-prte-prefix-by-default'
Built by: root
Built on: Wed Feb 26 13:04:59 UTC 2025
Built host: hpc-vesta1.hpc.lan
C bindings: yes
Fort mpif.h: yes (all)
Fort use mpi: yes (full: ignore TKR)
Fort use mpi size: deprecated-ompi-info-value
Fort use mpi_f08: yes
Fort mpi_f08 compliance: The mpi_f08 module is available, but due to
limitations in the /opt/rocm-5.7.1/llvm/bin/flang
compiler and/or Open MPI, does not support the
following: array subsections, direct passthru
(where possible) to underlying Open MPI's C
functionality
Fort mpi_f08 subarrays: no
Java bindings: no
Wrapper compiler rpath: runpath
C compiler: /opt/rocm-5.7.1/llvm/bin/clang
C compiler absolute: /opt/rocm-5.7.1/llvm/bin/clang
C compiler family name: CLANG
C compiler version: 17.0.0
(https://github.com/RadeonOpenCompute/llvm-project
roc-5.7.1 23382
f3e174a1d286158c06e4cc8276366b1d4bc0c914)
C++ compiler: /opt/rocm-5.7.1/llvm/bin/clang++
C++ compiler absolute: /opt/rocm-5.7.1/llvm/bin/clang++
Fort compiler: /opt/rocm-5.7.1/llvm/bin/flang
Fort compiler abs: /opt/rocm-5.7.1/llvm/bin/flang
Fort ignore TKR: yes (!DIR$ IGNORE_TKR)
Fort 08 assumed shape: no
Fort optional args: yes
Fort INTERFACE: yes
Fort ISO_FORTRAN_ENV: yes
Fort STORAGE_SIZE: yes
Fort BIND(C) (all): yes
Fort ISO_C_BINDING: yes
Fort SUBROUTINE BIND(C): yes
Fort TYPE,BIND(C): yes
Fort T,BIND(C,name="a"): yes
Fort PRIVATE: yes
Fort ABSTRACT: yes
Fort ASYNCHRONOUS: yes
Fort PROCEDURE: yes
Fort USE...ONLY: yes
Fort C_FUNLOC: yes
Fort f08 using wrappers: yes
Fort MPI_SIZEOF: yes
C profiling: yes
Fort mpif.h profiling: yes
Fort use mpi profiling: yes
Fort use mpi_f08 prof: yes
Thread support: posix (MPI_THREAD_MULTIPLE: yes, OPAL support: yes,
OMPI progress: no, Event lib: yes)
Sparse Groups: no
Internal debug support: no
MPI interface warnings: yes
MPI parameter check: runtime
Memory profiling support: no
Memory debugging support: no
dl support: yes
Heterogeneous support: no
MPI_WTIME support: native
Symbol vis. support: yes
Host topology support: yes
IPv6 support: no
MPI extensions: affinity, cuda, ftmpi, rocm, shortfloat
Fault Tolerance support: yes
FT MPI support: yes
MPI_MAX_PROCESSOR_NAME: 256
MPI_MAX_ERROR_STRING: 256
MPI_MAX_OBJECT_NAME: 64
MPI_MAX_INFO_KEY: 36
MPI_MAX_INFO_VAL: 256
MPI_MAX_PORT_NAME: 1024
MPI_MAX_DATAREP_STRING: 128
MCA accelerator: null (MCA v2.1.0, API v1.0.0, Component v5.0.7)
MCA accelerator: rocm (MCA v2.1.0, API v1.0.0, Component v5.0.7)
MCA allocator: basic (MCA v2.1.0, API v2.0.0, Component v5.0.7)
MCA allocator: bucket (MCA v2.1.0, API v2.0.0, Component v5.0.7)
MCA backtrace: execinfo (MCA v2.1.0, API v2.0.0, Component v5.0.7)
MCA btl: self (MCA v2.1.0, API v3.3.0, Component v5.0.7)
MCA btl: sm (MCA v2.1.0, API v3.3.0, Component v5.0.7)
MCA btl: tcp (MCA v2.1.0, API v3.3.0, Component v5.0.7)
MCA dl: dlopen (MCA v2.1.0, API v1.0.0, Component v5.0.7)
MCA if: linux_ipv6 (MCA v2.1.0, API v2.0.0, Component
v5.0.7)
MCA if: posix_ipv4 (MCA v2.1.0, API v2.0.0, Component
v5.0.7)
MCA installdirs: env (MCA v2.1.0, API v2.0.0, Component v5.0.7)
MCA installdirs: config (MCA v2.1.0, API v2.0.0, Component v5.0.7)
MCA memory: patcher (MCA v2.1.0, API v2.0.0, Component v5.0.7)
MCA mpool: hugepage (MCA v2.1.0, API v3.1.0, Component v5.0.7)
MCA patcher: overwrite (MCA v2.1.0, API v1.0.0, Component
v5.0.7)
MCA rcache: grdma (MCA v2.1.0, API v3.3.0, Component v5.0.7)
MCA reachable: weighted (MCA v2.1.0, API v2.0.0, Component v5.0.7)
MCA shmem: mmap (MCA v2.1.0, API v2.0.0, Component v5.0.7)
MCA shmem: posix (MCA v2.1.0, API v2.0.0, Component v5.0.7)
MCA shmem: sysv (MCA v2.1.0, API v2.0.0, Component v5.0.7)
MCA smsc: cma (MCA v2.1.0, API v1.0.0, Component v5.0.7)
MCA threads: pthreads (MCA v2.1.0, API v1.0.0, Component v5.0.7)
MCA timer: linux (MCA v2.1.0, API v2.0.0, Component v5.0.7)
MCA bml: r2 (MCA v2.1.0, API v2.1.0, Component v5.0.7)
MCA coll: adapt (MCA v2.1.0, API v2.4.0, Component v5.0.7)
MCA coll: basic (MCA v2.1.0, API v2.4.0, Component v5.0.7)
MCA coll: han (MCA v2.1.0, API v2.4.0, Component v5.0.7)
MCA coll: inter (MCA v2.1.0, API v2.4.0, Component v5.0.7)
MCA coll: libnbc (MCA v2.1.0, API v2.4.0, Component v5.0.7)
MCA coll: self (MCA v2.1.0, API v2.4.0, Component v5.0.7)
MCA coll: sync (MCA v2.1.0, API v2.4.0, Component v5.0.7)
MCA coll: tuned (MCA v2.1.0, API v2.4.0, Component v5.0.7)
MCA coll: ftagree (MCA v2.1.0, API v2.4.0, Component v5.0.7)
MCA coll: monitoring (MCA v2.1.0, API v2.4.0, Component
v5.0.7)
MCA coll: sm (MCA v2.1.0, API v2.4.0, Component v5.0.7)
MCA fbtl: posix (MCA v2.1.0, API v2.0.0, Component v5.0.7)
MCA fcoll: dynamic (MCA v2.1.0, API v2.0.0, Component v5.0.7)
MCA fcoll: dynamic_gen2 (MCA v2.1.0, API v2.0.0, Component
v5.0.7)
MCA fcoll: individual (MCA v2.1.0, API v2.0.0, Component
v5.0.7)
MCA fcoll: vulcan (MCA v2.1.0, API v2.0.0, Component v5.0.7)
MCA fs: ufs (MCA v2.1.0, API v2.0.0, Component v5.0.7)
MCA hook: comm_method (MCA v2.1.0, API v1.0.0, Component
v5.0.7)
MCA io: ompio (MCA v2.1.0, API v2.0.0, Component v5.0.7)
MCA io: romio341 (MCA v2.1.0, API v2.0.0, Component v5.0.7)
MCA op: avx (MCA v2.1.0, API v1.0.0, Component v5.0.7)
MCA osc: sm (MCA v2.1.0, API v3.0.0, Component v5.0.7)
MCA osc: monitoring (MCA v2.1.0, API v3.0.0, Component
v5.0.7)
MCA osc: rdma (MCA v2.1.0, API v3.0.0, Component v5.0.7)
MCA part: persist (MCA v2.1.0, API v4.0.0, Component v5.0.7)
MCA pml: cm (MCA v2.1.0, API v2.1.0, Component v5.0.7)
MCA pml: monitoring (MCA v2.1.0, API v2.1.0, Component
v5.0.7)
MCA pml: ob1 (MCA v2.1.0, API v2.1.0, Component v5.0.7)
MCA pml: v (MCA v2.1.0, API v2.1.0, Component v5.0.7)
MCA sharedfp: individual (MCA v2.1.0, API v2.0.0, Component
v5.0.7)
MCA sharedfp: lockedfile (MCA v2.1.0, API v2.0.0, Component
v5.0.7)
MCA sharedfp: sm (MCA v2.1.0, API v2.0.0, Component v5.0.7)
MCA topo: basic (MCA v2.1.0, API v2.2.0, Component v5.0.7)
MCA topo: treematch (MCA v2.1.0, API v2.2.0, Component
v5.0.7)
MCA vprotocol: pessimist (MCA v2.1.0, API v2.0.0, Component
v5.0.7)Please describe the system on which you are running
-
OS: RHEL 4.18.0-477.10.1.el8_8.x86_64
-
Front-end nodes hardware (hpc-vesta[1-2]): 2 × AMD EPYC 7313 16-Core Processors
-
Compute nodes hardware (hpc-mn[101-103]): 2 × AMD EPYC 7643 48-Core Processors + 10 × AMD MI210 GPUs per node
- Network type:
The connection betweenhpc-vesta[1-2]andhpc-mn[101-103]uses Ethernet with LACP.
Eachhpc-vesta[1-2]node has two interfaces bonded (bond0).
[root@hpc-vesta1 ~]$ ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: ens1f0: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 9000 qdisc mq master bond0 state UP group default qlen 1000
link/ether 6c:fe:54:47:8a:e4 brd ff:ff:ff:ff:ff:ff
altname enp3s0f0
3: ens1f1: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 9000 qdisc mq master bond0 state UP group default qlen 1000
link/ether 6c:fe:54:47:8a:e4 brd ff:ff:ff:ff:ff:ff permaddr 6c:fe:54:47:8a:e5
altname enp3s0f1
4: ens1f2: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc mq master bond1 state UP group default qlen 1000
link/ether 6c:fe:54:47:8a:e6 brd ff:ff:ff:ff:ff:ff
altname enp3s0f2
5: ens1f3: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc mq master bond1 state UP group default qlen 1000
link/ether 6c:fe:54:47:8a:e6 brd ff:ff:ff:ff:ff:ff permaddr 6c:fe:54:47:8a:e7
altname enp3s0f3
6: bond1: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
link/ether 6c:fe:54:47:8a:e6 brd ff:ff:ff:ff:ff:ff
inet **REDACTED**/28 brd **REDACTED** scope global noprefixroute bond1
valid_lft forever preferred_lft forever
inet6 fe80::5a62:a881:97b2:6d50/64 scope link noprefixroute
valid_lft forever preferred_lft forever
7: bond0: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 9000 qdisc noqueue state UP group default qlen 1000
link/ether 6c:fe:54:47:8a:e4 brd ff:ff:ff:ff:ff:ff
inet 192.168.0.4/21 brd 192.168.7.255 scope global noprefixroute bond0
valid_lft forever preferred_lft forever
inet6 fe80::4212:b9f9:8ab6:399f/64 scope link noprefixroute
valid_lft forever preferred_lft foreverEach hpc-mn[101-103] node has four interfaces bonded (bond0).
[root@hpc-mn101 ~]# ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: ens21f0: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 9000 qdisc mq master bond0 state UP group default qlen 1000
link/ether 6c:fe:54:47:28:80 brd ff:ff:ff:ff:ff:ff
altname enp166s0f0
3: ens21f1: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 9000 qdisc mq master bond0 state UP group default qlen 1000
link/ether 6c:fe:54:47:28:80 brd ff:ff:ff:ff:ff:ff permaddr 6c:fe:54:47:28:81
altname enp166s0f1
4: ens21f2: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 9000 qdisc mq master bond0 state UP group default qlen 1000
link/ether 6c:fe:54:47:28:80 brd ff:ff:ff:ff:ff:ff permaddr 6c:fe:54:47:28:82
altname enp166s0f2
5: ens21f3: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 9000 qdisc mq master bond0 state UP group default qlen 1000
link/ether 6c:fe:54:47:28:80 brd ff:ff:ff:ff:ff:ff permaddr 6c:fe:54:47:28:83
altname enp166s0f3
6: mlx5_ib0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 2044 qdisc mq state UP group default qlen 256
link/infiniband 00:00:10:29:fe:80:00:00:00:00:00:00:88:e9:a4:ff:ff:9f:d9:56 brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff
inet 12.12.12.1/24 brd 12.12.12.255 scope global noprefixroute mlx5_ib0
valid_lft forever preferred_lft forever
inet6 fe80::9812:40a6:ac02:2225/64 scope link noprefixroute
valid_lft forever preferred_lft forever
7: bond0: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 9000 qdisc noqueue state UP group default qlen 1000
link/ether 6c:fe:54:47:28:80 brd ff:ff:ff:ff:ff:ff
inet 192.168.0.1/21 brd 192.168.7.255 scope global noprefixroute bond0
valid_lft forever preferred_lft forever
inet6 fe80::35af:c281:f3e9:67bc/64 scope link noprefixroute
valid_lft forever preferred_lft foreverMellanox cards are only used among hpc-mn102, hpc-mn102, and hpc-mn103, so they are not used here.
Details of the problem
Steps I already tried that didn't change anything:
- I checked if the issue was coming from my
./configureoptions, so I ended up removing almost all the options I was initially using :
--with-tests-examples
--enable-mca-no-build=btl-uct
--enable-mpi-fortran=all
--enable-prte-prefix-by-default
--enable-shared
--enable-static
--with-rocm="$SITE_PATH_ROCM"
--with-ucc="$UCC_PATH"
--with-ucx="$UCX_PATH"
--without-verbs
but nothing changed. Even with the most minimalistic configuration:
--prefix=/nfs/mesonet/sw/openmpi/openmpi-5.0.7.rocm-5.7.1 --enable-prte-prefix-by-default the issue persists.
-
Changed the compiler from
amdclang(shipped with ROCm 5.7.1) togcc-13.0.2. -
Tried using an external PRRTE and external PMIX.
-
I can SSH from
hpc-vesta1tohpc-vesta2,hpc-mn101,hpc-mn102, andhpc-mn103, but I cannot SSH back in the other direction, except when adding-A. -
Checked
nftables, running only onhpc-vesta[1-2]. All incoming traffic to thebond0interface is allowed. -
Ran the following command:
mpirun --mca btl_tcp_if_include 192.168.0.0/21 --host hpc-mn101 hostname -
Tried default sshd_config
-
Ran the following command :
[xxx@hpc-vesta1 ~]$ mpirun --debug-daemons --prtemca plm_base_verbose 100 --prtemca rmaps_base_verbose 100 --display alloc --host hpc-mn101 hostname
[hpc-vesta1.hpc.lan:2224251] mca: base: component_find: searching NULL for plm components
[hpc-vesta1.hpc.lan:2224251] mca: base: find_dyn_components: checking NULL for plm components
[hpc-vesta1.hpc.lan:2224251] pmix:mca: base: components_register: registering framework plm components
[hpc-vesta1.hpc.lan:2224251] pmix:mca: base: components_register: found loaded component slurm
[hpc-vesta1.hpc.lan:2224251] pmix:mca: base: components_register: component slurm register function successful
[hpc-vesta1.hpc.lan:2224251] pmix:mca: base: components_register: found loaded component ssh
[hpc-vesta1.hpc.lan:2224251] pmix:mca: base: components_register: component ssh register function successful
[hpc-vesta1.hpc.lan:2224251] mca: base: components_open: opening plm components
[hpc-vesta1.hpc.lan:2224251] mca: base: components_open: found loaded component slurm
[hpc-vesta1.hpc.lan:2224251] mca: base: components_open: component slurm open function successful
[hpc-vesta1.hpc.lan:2224251] mca: base: components_open: found loaded component ssh
[hpc-vesta1.hpc.lan:2224251] mca: base: components_open: component ssh open function successful
[hpc-vesta1.hpc.lan:2224251] mca:base:select: Auto-selecting plm components
[hpc-vesta1.hpc.lan:2224251] mca:base:select:( plm) Querying component [slurm]
[hpc-vesta1.hpc.lan:2224251] mca:base:select:( plm) Querying component [ssh]
[hpc-vesta1.hpc.lan:2224251] [[INVALID],UNDEFINED] plm:ssh_lookup on agent ssh : rsh path NULL
[hpc-vesta1.hpc.lan:2224251] mca:base:select:( plm) Query of component [ssh] set priority to 10
[hpc-vesta1.hpc.lan:2224251] mca:base:select:( plm) Selected component [ssh]
[hpc-vesta1.hpc.lan:2224251] mca: base: close: component slurm closed
[hpc-vesta1.hpc.lan:2224251] mca: base: close: unloading component slurm
[hpc-vesta1.hpc.lan:2224251] [prterun-hpc-vesta1-2224251@0,0] plm:ssh_setup on agent ssh : rsh path NULL
[hpc-vesta1.hpc.lan:2224251] [prterun-hpc-vesta1-2224251@0,0] plm:base:receive start comm
[hpc-vesta1.hpc.lan:2224251] mca: base: component_find: searching NULL for rmaps components
[hpc-vesta1.hpc.lan:2224251] mca: base: find_dyn_components: checking NULL for rmaps components
[hpc-vesta1.hpc.lan:2224251] pmix:mca: base: components_register: registering framework rmaps components
[hpc-vesta1.hpc.lan:2224251] pmix:mca: base: components_register: found loaded component ppr
[hpc-vesta1.hpc.lan:2224251] pmix:mca: base: components_register: component ppr register function successful
[hpc-vesta1.hpc.lan:2224251] pmix:mca: base: components_register: found loaded component rank_file
[hpc-vesta1.hpc.lan:2224251] pmix:mca: base: components_register: component rank_file has no register or open function
[hpc-vesta1.hpc.lan:2224251] pmix:mca: base: components_register: found loaded component round_robin
[hpc-vesta1.hpc.lan:2224251] pmix:mca: base: components_register: component round_robin register function successful
[hpc-vesta1.hpc.lan:2224251] pmix:mca: base: components_register: found loaded component seq
[hpc-vesta1.hpc.lan:2224251] pmix:mca: base: components_register: component seq register function successful
[hpc-vesta1.hpc.lan:2224251] mca: base: components_open: opening rmaps components
[hpc-vesta1.hpc.lan:2224251] mca: base: components_open: found loaded component ppr
[hpc-vesta1.hpc.lan:2224251] mca: base: components_open: component ppr open function successful
[hpc-vesta1.hpc.lan:2224251] mca: base: components_open: found loaded component rank_file
[hpc-vesta1.hpc.lan:2224251] mca: base: components_open: found loaded component round_robin
[hpc-vesta1.hpc.lan:2224251] mca: base: components_open: component round_robin open function successful
[hpc-vesta1.hpc.lan:2224251] mca: base: components_open: found loaded component seq
[hpc-vesta1.hpc.lan:2224251] mca: base: components_open: component seq open function successful
[hpc-vesta1.hpc.lan:2224251] mca:rmaps:select: checking available component ppr
[hpc-vesta1.hpc.lan:2224251] mca:rmaps:select: Querying component [ppr]
[hpc-vesta1.hpc.lan:2224251] mca:rmaps:select: checking available component rank_file
[hpc-vesta1.hpc.lan:2224251] mca:rmaps:select: Querying component [rank_file]
[hpc-vesta1.hpc.lan:2224251] mca:rmaps:select: checking available component round_robin
[hpc-vesta1.hpc.lan:2224251] mca:rmaps:select: Querying component [round_robin]
[hpc-vesta1.hpc.lan:2224251] mca:rmaps:select: checking available component seq
[hpc-vesta1.hpc.lan:2224251] mca:rmaps:select: Querying component [seq]
[hpc-vesta1.hpc.lan:2224251] [prterun-hpc-vesta1-2224251@0,0]: Final mapper priorities
[hpc-vesta1.hpc.lan:2224251] Mapper: rank_file Priority: 100
[hpc-vesta1.hpc.lan:2224251] Mapper: ppr Priority: 90
[hpc-vesta1.hpc.lan:2224251] Mapper: seq Priority: 60
[hpc-vesta1.hpc.lan:2224251] Mapper: round_robin Priority: 10
[hpc-vesta1.hpc.lan:2224251] [prterun-hpc-vesta1-2224251@0,0] plm:base:setup_vm
[hpc-vesta1.hpc.lan:2224251] [prterun-hpc-vesta1-2224251@0,0] plm:base:setup_vm creating map
[hpc-vesta1.hpc.lan:2224251] [prterun-hpc-vesta1-2224251@0,0] setup:vm: working unmanaged allocation
[hpc-vesta1.hpc.lan:2224251] [prterun-hpc-vesta1-2224251@0,0] using dash_host
[hpc-vesta1.hpc.lan:2224251] [prterun-hpc-vesta1-2224251@0,0] checking node hpc-mn101
[hpc-vesta1.hpc.lan:2224251] [prterun-hpc-vesta1-2224251@0,0] plm:base:setup_vm add new daemon [prterun-hpc-vesta1-2224251@0,1]
[hpc-vesta1.hpc.lan:2224251] [prterun-hpc-vesta1-2224251@0,0] plm:base:setup_vm assigning new daemon [prterun-hpc-vesta1-2224251@0,1] to node hpc-mn101
====================== ALLOCATED NODES ======================
[hpc-vesta1.hpc.lan:2224251] [prterun-hpc-vesta1-2224251@0,0] plm:ssh: launching vm
hpc-mn101: slots=1 max_slots=0 slots_inuse=0 state=UP
Flags: SLOTS_GIVEN
aliases: NONE
=================================================================
[hpc-vesta1.hpc.lan:2224251] [prterun-hpc-vesta1-2224251@0,0] plm:ssh: local shell: 0 (bash)
[hpc-vesta1.hpc.lan:2224251] [prterun-hpc-vesta1-2224251@0,0] plm:ssh: assuming same remote shell as local shell
[hpc-vesta1.hpc.lan:2224251] [prterun-hpc-vesta1-2224251@0,0] plm:ssh: remote shell: 0 (bash)
[hpc-vesta1.hpc.lan:2224251] [prterun-hpc-vesta1-2224251@0,0] plm:ssh: final template argv:
/usr/bin/ssh <template> PRTE_PREFIX=/nfs/mesonet/sw/openmpi/openmpi-5.0.7.rocm-5.7.1;export PRTE_PREFIX;LD_LIBRARY_PATH=/nfs/mesonet/sw/openmpi/openmpi-5.0.7.rocm-5.7.1/lib:/nfs/mesonet/sw/openmpi/openmpi-5.0.7.rocm-5.7.1/lib:$LD_LIBRARY_PATH;export LD_LIBRARY_PATH;DYLD_LIBRARY_PATH=/nfs/mesonet/sw/openmpi/openmpi-5.0.7.rocm-5.7.1/lib:/nfs/mesonet/sw/openmpi/openmpi-5.0.7.rocm-5.7.1/lib:$DYLD_LIBRARY_PATH;export DYLD_LIBRARY_PATH;/nfs/mesonet/sw/openmpi/openmpi-5.0.7.rocm-5.7.1/bin/prted --debug-daemons --prtemca ess "env" --prtemca ess_base_nspace "prterun-hpc-vesta1-2224251@0" --prtemca ess_base_vpid "<template>" --prtemca ess_base_num_procs "2" --prtemca prte_hnp_uri "[email protected];tcp://185.155.95.146,192.168.0.4:41353:28,21" --prtemca PREFIXES "errmgr,ess,filem,grpcomm,iof,odls,oob,plm,prtebacktrace,prtedl,prteinstalldirs,prtereachable,ras,rmaps,rtc,schizo,state,hwloc,if,reachable" --prtemca plm_base_verbose "100" --prtemca rmaps_base_verbose "100" --prtemca pmix_session_server "1" --prtemca plm "ssh" --tree-spawn --prtemca prte_parent_uri "[email protected];tcp://185.155.95.146,192.168.0.4:41353:28,21"
[hpc-vesta1.hpc.lan:2224251] [prterun-hpc-vesta1-2224251@0,0] plm:ssh:launch daemon 0 not a child of mine
[hpc-vesta1.hpc.lan:2224251] [prterun-hpc-vesta1-2224251@0,0] plm:ssh: adding node hpc-mn101 to launch list
[hpc-vesta1.hpc.lan:2224251] [prterun-hpc-vesta1-2224251@0,0] plm:ssh: activating launch event
[hpc-vesta1.hpc.lan:2224251] [prterun-hpc-vesta1-2224251@0,0] plm:ssh: recording launch of daemon [prterun-hpc-vesta1-2224251@0,1]
[hpc-vesta1.hpc.lan:2224251] [prterun-hpc-vesta1-2224251@0,0] plm:ssh: executing: (/usr/bin/ssh) [/usr/bin/ssh hpc-mn101 PRTE_PREFIX=/nfs/mesonet/sw/openmpi/openmpi-5.0.7.rocm-5.7.1;export PRTE_PREFIX;LD_LIBRARY_PATH=/nfs/mesonet/sw/openmpi/openmpi-5.0.7.rocm-5.7.1/lib:/nfs/mesonet/sw/openmpi/openmpi-5.0.7.rocm-5.7.1/lib:$LD_LIBRARY_PATH;export LD_LIBRARY_PATH;DYLD_LIBRARY_PATH=/nfs/mesonet/sw/openmpi/openmpi-5.0.7.rocm-5.7.1/lib:/nfs/mesonet/sw/openmpi/openmpi-5.0.7.rocm-5.7.1/lib:$DYLD_LIBRARY_PATH;export DYLD_LIBRARY_PATH;/nfs/mesonet/sw/openmpi/openmpi-5.0.7.rocm-5.7.1/bin/prted --debug-daemons --prtemca ess "env" --prtemca ess_base_nspace "prterun-hpc-vesta1-2224251@0" --prtemca ess_base_vpid 1 --prtemca ess_base_num_procs "2" --prtemca prte_hnp_uri "[email protected];tcp://185.155.95.146,192.168.0.4:41353:28,21" --prtemca PREFIXES "errmgr,ess,filem,grpcomm,iof,odls,oob,plm,prtebacktrace,prtedl,prteinstalldirs,prtereachable,ras,rmaps,rtc,schizo,state,hwloc,if,reachable" --prtemca plm_base_verbose "100" --prtemca rmaps_base_verbose "100" --prtemca pmix_session_server "1" --prtemca plm "ssh" --tree-spawn --prtemca prte_parent_uri "[email protected];tcp://185.155.95.146,192.168.0.4:41353:28,21"]
Daemon was launched on hpc-mn101 - beginning to initialize
[hpc-mn101.hpc.lan:2020523] mca: base: component_find: searching NULL for plm components
[hpc-mn101.hpc.lan:2020523] mca: base: find_dyn_components: checking NULL for plm components
[hpc-mn101.hpc.lan:2020523] pmix:mca: base: components_register: registering framework plm components
[hpc-mn101.hpc.lan:2020523] pmix:mca: base: components_register: found loaded component ssh
[hpc-mn101.hpc.lan:2020523] pmix:mca: base: components_register: component ssh register function successful
[hpc-mn101.hpc.lan:2020523] mca: base: components_open: opening plm components
[hpc-mn101.hpc.lan:2020523] mca: base: components_open: found loaded component ssh
[hpc-mn101.hpc.lan:2020523] mca: base: components_open: component ssh open function successful
[hpc-mn101.hpc.lan:2020523] mca:base:select: Auto-selecting plm components
[hpc-mn101.hpc.lan:2020523] mca:base:select:( plm) Querying component [ssh]
[hpc-mn101.hpc.lan:2020523] [prterun-hpc-vesta1-2224251@0,1] plm:ssh_lookup on agent ssh : rsh path NULL
[hpc-mn101.hpc.lan:2020523] mca:base:select:( plm) Query of component [ssh] set priority to 10
[hpc-mn101.hpc.lan:2020523] mca:base:select:( plm) Selected component [ssh]
[hpc-mn101.hpc.lan:2020523] mca: base: component_find: searching NULL for rmaps components
[hpc-mn101.hpc.lan:2020523] mca: base: find_dyn_components: checking NULL for rmaps components
[hpc-mn101.hpc.lan:2020523] pmix:mca: base: components_register: registering framework rmaps components
[hpc-mn101.hpc.lan:2020523] pmix:mca: base: components_register: found loaded component ppr
[hpc-mn101.hpc.lan:2020523] pmix:mca: base: components_register: component ppr register function successful
[hpc-mn101.hpc.lan:2020523] pmix:mca: base: components_register: found loaded component rank_file
[hpc-mn101.hpc.lan:2020523] pmix:mca: base: components_register: component rank_file has no register or open function
[hpc-mn101.hpc.lan:2020523] pmix:mca: base: components_register: found loaded component round_robin
[hpc-mn101.hpc.lan:2020523] pmix:mca: base: components_register: component round_robin register function successful
[hpc-mn101.hpc.lan:2020523] pmix:mca: base: components_register: found loaded component seq
[hpc-mn101.hpc.lan:2020523] pmix:mca: base: components_register: component seq register function successful
[hpc-mn101.hpc.lan:2020523] mca: base: components_open: opening rmaps components
[hpc-mn101.hpc.lan:2020523] mca: base: components_open: found loaded component ppr
[hpc-mn101.hpc.lan:2020523] mca: base: components_open: component ppr open function successful
[hpc-mn101.hpc.lan:2020523] mca: base: components_open: found loaded component rank_file
[hpc-mn101.hpc.lan:2020523] mca: base: components_open: found loaded component round_robin
[hpc-mn101.hpc.lan:2020523] mca: base: components_open: component round_robin open function successful
[hpc-mn101.hpc.lan:2020523] mca: base: components_open: found loaded component seq
[hpc-mn101.hpc.lan:2020523] mca: base: components_open: component seq open function successful
[hpc-mn101.hpc.lan:2020523] mca:rmaps:select: checking available component ppr
[hpc-mn101.hpc.lan:2020523] mca:rmaps:select: Querying component [ppr]
[hpc-mn101.hpc.lan:2020523] mca:rmaps:select: checking available component rank_file
[hpc-mn101.hpc.lan:2020523] mca:rmaps:select: Querying component [rank_file]
[hpc-mn101.hpc.lan:2020523] mca:rmaps:select: checking available component round_robin
[hpc-mn101.hpc.lan:2020523] mca:rmaps:select: Querying component [round_robin]
[hpc-mn101.hpc.lan:2020523] mca:rmaps:select: checking available component seq
[hpc-mn101.hpc.lan:2020523] mca:rmaps:select: Querying component [seq]
[hpc-mn101.hpc.lan:2020523] [prterun-hpc-vesta1-2224251@0,1]: Final mapper priorities
[hpc-mn101.hpc.lan:2020523] Mapper: rank_file Priority: 100
[hpc-mn101.hpc.lan:2020523] Mapper: ppr Priority: 90
[hpc-mn101.hpc.lan:2020523] Mapper: seq Priority: 60
[hpc-mn101.hpc.lan:2020523] Mapper: round_robin Priority: 10
[hpc-mn101.hpc.lan:2020523] [prterun-hpc-vesta1-2224251@0,1] plm:ssh_setup on agent ssh : rsh path NULL
[hpc-mn101.hpc.lan:2020523] [prterun-hpc-vesta1-2224251@0,1] plm:base:receive start comm
-- HANGS HERE !! --
[hpc-mn101.hpc.lan:2020523] [prterun-hpc-vesta1-2224251@0,1] plm:ssh: remote spawn called
[hpc-mn101.hpc.lan:2020523] [prterun-hpc-vesta1-2224251@0,1] plm:ssh: remote spawn - have no children!
[hpc-vesta1.hpc.lan:2224251] [prterun-hpc-vesta1-2224251@0,0] plm:base:orted_report_launch from daemon [prterun-hpc-vesta1-2224251@0,1]
[hpc-vesta1.hpc.lan:2224251] [prterun-hpc-vesta1-2224251@0,0] plm:base:orted_report_launch from daemon [prterun-hpc-vesta1-2224251@0,1] on node hpc-mn101
[hpc-vesta1.hpc.lan:2224251] ALIASES FOR NODE hpc-mn101 (hpc-mn101)
[hpc-vesta1.hpc.lan:2224251] ALIAS: hpc-mn101.hpc.lan
[hpc-vesta1.hpc.lan:2224251] ALIAS: 12.12.12.1
[hpc-vesta1.hpc.lan:2224251] ALIAS: 192.168.0.1
[hpc-vesta1.hpc.lan:2224251] [prterun-hpc-vesta1-2224251@0,0] RECEIVED TOPOLOGY SIG 8N:2S:16L3:96L2:96L1:96C:96H:0-95::x86_64:le FROM NODE hpc-mn101
[hpc-vesta1.hpc.lan:2224251] [prterun-hpc-vesta1-2224251@0,0] NEW TOPOLOGY - ADDING SIGNATURE
[hpc-vesta1.hpc.lan:2224251] [prterun-hpc-vesta1-2224251@0,0] plm:base:orted_report_launch completed for daemon [prterun-hpc-vesta1-2224251@0,1] at contact [email protected];tcp://12.12.12.1,192.168.0.1:46399:24,21
[hpc-vesta1.hpc.lan:2224251] [prterun-hpc-vesta1-2224251@0,0] plm:base:orted_report_launch job prterun-hpc-vesta1-2224251@0 recvd 2 of 2 reported daemons
[hpc-vesta1.hpc.lan:2224251] [prterun-hpc-vesta1-2224251@0,0] plm:base:setting slots for node hpc-vesta1 by core
====================== ALLOCATED NODES ======================
hpc-mn101: slots=1 max_slots=0 slots_inuse=0 state=UP
Flags: DAEMON_LAUNCHED:LOCATION_VERIFIED:SLOTS_GIVEN
aliases: hpc-mn101.hpc.lan,12.12.12.1,192.168.0.1
=================================================================
[hpc-vesta1.hpc.lan:2224251] [prterun-hpc-vesta1-2224251@0,0] plm:base:receive processing msg
[hpc-vesta1.hpc.lan:2224251] [prterun-hpc-vesta1-2224251@0,0] plm:base:receive job launch command from [prterun-hpc-vesta1-2224251@0,0]
[hpc-vesta1.hpc.lan:2224251] [prterun-hpc-vesta1-2224251@0,0] plm:base:receive adding hosts
[hpc-vesta1.hpc.lan:2224251] [prterun-hpc-vesta1-2224251@0,0] plm:base:receive calling spawn
[hpc-vesta1.hpc.lan:2224251] [prterun-hpc-vesta1-2224251@0,0] plm:base:receive done processing commands
[hpc-vesta1.hpc.lan:2224251] [prterun-hpc-vesta1-2224251@0,0] plm:base:setup_job
[hpc-vesta1.hpc.lan:2224251] [prterun-hpc-vesta1-2224251@0,0] plm:base:setup_vm
[hpc-vesta1.hpc.lan:2224251] [prterun-hpc-vesta1-2224251@0,0] plm_base:setup_vm NODE hpc-mn101 WAS NOT ADDED
[hpc-vesta1.hpc.lan:2224251] [prterun-hpc-vesta1-2224251@0,0] plm:base:setup_vm no new daemons required
[hpc-vesta1.hpc.lan:2224251] mca:rmaps: mapping job prterun-hpc-vesta1-2224251@1
[hpc-vesta1.hpc.lan:2224251] mca:rmaps: setting mapping policies for job prterun-hpc-vesta1-2224251@1 inherit TRUE hwtcpus FALSE
[hpc-vesta1.hpc.lan:2224251] mca:rmaps mapping not given - using bycore
[hpc-vesta1.hpc.lan:2224251] [prterun-hpc-vesta1-2224251@0,0] using dash_host hpc-mn101
[hpc-vesta1.hpc.lan:2224251] NODE hpc-vesta1 DOESNT MATCH NODE hpc-mn101
[hpc-vesta1.hpc.lan:2224251] [prterun-hpc-vesta1-2224251@0,0] node hpc-mn101 has 1 slots available
[hpc-vesta1.hpc.lan:2224251] AVAILABLE NODES FOR MAPPING:
[hpc-vesta1.hpc.lan:2224251] node: hpc-mn101 daemon: 1 slots_available: 1
====================== ALLOCATED NODES ======================
hpc-mn101: slots=1 max_slots=0 slots_inuse=0 state=UP
[hpc-vesta1.hpc.lan:2224251] setdefaultbinding[316] binding not given - using bycore
Flags: DAEMON_LAUNCHED:LOCATION_VERIFIED:SLOTS_GIVEN
[hpc-vesta1.hpc.lan:2224251] mca:rmaps:rf: job prterun-hpc-vesta1-2224251@1 not using rankfile policy
aliases: hpc-mn101.hpc.lan,12.12.12.1,192.168.0.1
=================================================================
[hpc-vesta1.hpc.lan:2224251] mca:rmaps:ppr: job prterun-hpc-vesta1-2224251@1 not using ppr mapper PPR NULL policy PPR NOTSET
[hpc-vesta1.hpc.lan:2224251] [prterun-hpc-vesta1-2224251@0,0] rmaps:seq called on job prterun-hpc-vesta1-2224251@1
[hpc-vesta1.hpc.lan:2224251] mca:rmaps:seq: job prterun-hpc-vesta1-2224251@1 not using seq mapper
[hpc-vesta1.hpc.lan:2224251] mca:rmaps:rr: mapping job prterun-hpc-vesta1-2224251@1
[hpc-vesta1.hpc.lan:2224251] [prterun-hpc-vesta1-2224251@0,0] using dash_host hpc-mn101
[hpc-vesta1.hpc.lan:2224251] NODE hpc-vesta1 DOESNT MATCH NODE hpc-mn101
[hpc-vesta1.hpc.lan:2224251] [prterun-hpc-vesta1-2224251@0,0] node hpc-mn101 has 1 slots available
[hpc-vesta1.hpc.lan:2224251] AVAILABLE NODES FOR MAPPING:
[hpc-vesta1.hpc.lan:2224251] node: hpc-mn101 daemon: 1 slots_available: 1
[hpc-vesta1.hpc.lan:2224251] mca:rmaps:rr:byobj mapping by Core for job prterun-hpc-vesta1-2224251@1 slots 1 num_procs 1
[hpc-vesta1.hpc.lan:2224251] mca:rmaps:rr: found 96 Core objects on node hpc-mn101
[hpc-vesta1.hpc.lan:2224251] mca:rmaps:rr: assigning proc to object 0
[hpc-vesta1.hpc.lan:2224251] [prterun-hpc-vesta1-2224251@0,0] get_avail_ncpus: node hpc-mn101 has 0 procs on it
[hpc-vesta1.hpc.lan:2224251] mca:rmaps: compute bindings for job prterun-hpc-vesta1-2224251@1 with policy CORE:IF-SUPPORTED[1007]
[hpc-vesta1.hpc.lan:2224251] mca:rmaps: bind [prterun-hpc-vesta1-2224251@1,INVALID] with policy CORE:IF-SUPPORTED
[hpc-vesta1.hpc.lan:2224251] [prterun-hpc-vesta1-2224251@0,0] BOUND PROC [prterun-hpc-vesta1-2224251@1,INVALID][hpc-mn101] TO package[0][core:0]
[hpc-vesta1.hpc.lan:2224251] [prterun-hpc-vesta1-2224251@0,0] complete_setup on job prterun-hpc-vesta1-2224251@1
[hpc-vesta1.hpc.lan:2224251] [prterun-hpc-vesta1-2224251@0,0] plm:base:launch_apps for job prterun-hpc-vesta1-2224251@1
[hpc-vesta1.hpc.lan:2224251] [prterun-hpc-vesta1-2224251@0,0] plm:base:send launch msg for job prterun-hpc-vesta1-2224251@1
[hpc-vesta1.hpc.lan:2224251] [prterun-hpc-vesta1-2224251@0,0] prted:comm:process_commands() Processing Command: PRTE_DAEMON_ADD_LOCAL_PROCS
[hpc-vesta1.hpc.lan:2224251] [prterun-hpc-vesta1-2224251@0,0] prted_cmd: received add_local_procs
[hpc-mn101.hpc.lan:2020523] [prterun-hpc-vesta1-2224251@0,1] prted:comm:process_commands() Processing Command: PRTE_DAEMON_ADD_LOCAL_PROCS
[hpc-mn101.hpc.lan:2020523] [prterun-hpc-vesta1-2224251@0,1] prted_cmd: received add_local_procs
[hpc-vesta1.hpc.lan:2224251] [prterun-hpc-vesta1-2224251@0,0] plm:base:receive processing msg
[hpc-vesta1.hpc.lan:2224251] [prterun-hpc-vesta1-2224251@0,0] plm:base:receive local launch complete command from [prterun-hpc-vesta1-2224251@0,1]
[hpc-vesta1.hpc.lan:2224251] [prterun-hpc-vesta1-2224251@0,0] plm:base:receive got local launch complete for job prterun-hpc-vesta1-2224251@1
[hpc-vesta1.hpc.lan:2224251] [prterun-hpc-vesta1-2224251@0,0] plm:base:receive got local launch complete for vpid 0
[hpc-vesta1.hpc.lan:2224251] [prterun-hpc-vesta1-2224251@0,0] plm:base:receive got local launch complete for vpid 0 state RUNNING
[hpc-vesta1.hpc.lan:2224251] [prterun-hpc-vesta1-2224251@0,0] plm:base:receive done processing commands
[hpc-vesta1.hpc.lan:2224251] [prterun-hpc-vesta1-2224251@0,0] plm:base:launch wiring up iof for job prterun-hpc-vesta1-2224251@1
hpc-mn101.hpc.lan
[hpc-vesta1.hpc.lan:2224251] [prterun-hpc-vesta1-2224251@0,0] plm:base:receive processing msg
[hpc-vesta1.hpc.lan:2224251]
[prterun-hpc-vesta1-2224251@0,0] plm:base:receive update proc state command from [prterun-hpc-vesta1-2224251@0,1]
[hpc-vesta1.hpc.lan:2224251] [prterun-hpc-vesta1-2224251@0,0] plm:base:receive got update_proc_state for job prterun-hpc-vesta1-2224251@1
[hpc-vesta1.hpc.lan:2224251] [prterun-hpc-vesta1-2224251@0,0] plm:base:receive got update_proc_state for vpid 0 pid 2020550 state NORMALLY TERMINATED exit_code 0
[hpc-vesta1.hpc.lan:2224251] [prterun-hpc-vesta1-2224251@0,0] plm:base:receive done processing commands
[hpc-vesta1.hpc.lan:2224251] [prterun-hpc-vesta1-2224251@0,0] plm:base:prted_cmd sending prted_exit commands
[hpc-vesta1.hpc.lan:2224251] [prterun-hpc-vesta1-2224251@0,0] prted:comm:process_commands() Processing Command: PRTE_DAEMON_EXIT_CMD
[hpc-vesta1.hpc.lan:2224251] [prterun-hpc-vesta1-2224251@0,0] prted_cmd: received exit cmd
[hpc-vesta1.hpc.lan:2224251] [prterun-hpc-vesta1-2224251@0,0] prted_cmd: exit cmd, 1 routes still exist
[hpc-mn101.hpc.lan:2020523] [prterun-hpc-vesta1-2224251@0,1] prted:comm:process_commands() Processing Command: PRTE_DAEMON_EXIT_CMD
[hpc-mn101.hpc.lan:2020523] [prterun-hpc-vesta1-2224251@0,1] prted_cmd: received exit cmd
[hpc-mn101.hpc.lan:2020523] [prterun-hpc-vesta1-2224251@0,1] prted_cmd: all routes and children gone - exiting
[hpc-mn101.hpc.lan:2020523] [prterun-hpc-vesta1-2224251@0,1] plm:base:receive stop comm
[hpc-mn101.hpc.lan:2020523] mca: base: close: component ssh closed
[hpc-mn101.hpc.lan:2020523] mca: base: close: unloading component ssh
[hpc-vesta1.hpc.lan:2224251] [prterun-hpc-vesta1-2224251@0,0] plm:base:receive stop comm
[hpc-vesta1.hpc.lan:2224251] mca: base: close: component ssh closed
[hpc-vesta1.hpc.lan:2224251] mca: base: close: unloading component ssh
Thanks in advance for you help