Skip to content

mpirun / prted very slow on multi-nodes #13113

@p-gerhard

Description

@p-gerhard

Hello,

Background information

We are running OpenMPI on a small GPU cluster composed of two front-end nodes, hpc-vesta[1-2], and three compute nodes, hpc-mn[101-103].

When running everything locally, the execution completes quickly:

[xxx@hpc-vesta1 ~]$ time mpirun --host hpc-vesta1 hostname
hpc-vesta1.hpc.lan

real    0m0.041s
user    0m0.007s
sys     0m0.023s

However, when running the task remotely, it takes almost three minutes to complete:

[xxx@hpc-vesta1 ~]$ time mpirun --host hpc-mn101 hostname
hpc-mn101.hpc.lan

real    2m9.159s
user    0m0.013s
sys     0m0.019s

The same behavior occurs when running from hpc-vesta1 using --host hpc-mn102 --host hpc-mn103 or --host hpc-vest2 (same when using multinodes)

Version of Open MPI

We have tested all v5.0.x versions and are currently using v5.0.7.

Describe how Open MPI was installed

Open MPI was installed from the official website's source tarball.

[xxx@hpc-vesta1 ~]$ ompi_info
                 Package: Open MPI [email protected] Distribution
                Open MPI: 5.0.7
  Open MPI repo revision: v5.0.7
   Open MPI release date: Feb 14, 2025
                 MPI API: 3.1.0
            Ident string: 5.0.7
                  Prefix: /nfs/mesonet/sw/openmpi/openmpi-5.0.7.rocm-5.7.1
 Configured architecture: x86_64-pc-linux-gnu
           Configured by: root
           Configured on: Wed Feb 26 13:02:06 UTC 2025
          Configure host: hpc-vesta1.hpc.lan
  Configure command line: '--prefix=/nfs/mesonet/sw/openmpi/openmpi-5.0.7.rocm-5.7.1'
                          '--enable-prte-prefix-by-default'
                Built by: root
                Built on: Wed Feb 26 13:04:59 UTC 2025
              Built host: hpc-vesta1.hpc.lan
              C bindings: yes
             Fort mpif.h: yes (all)
            Fort use mpi: yes (full: ignore TKR)
       Fort use mpi size: deprecated-ompi-info-value
        Fort use mpi_f08: yes
 Fort mpi_f08 compliance: The mpi_f08 module is available, but due to
                          limitations in the /opt/rocm-5.7.1/llvm/bin/flang
                          compiler and/or Open MPI, does not support the
                          following: array subsections, direct passthru
                          (where possible) to underlying Open MPI's C
                          functionality
  Fort mpi_f08 subarrays: no
           Java bindings: no
  Wrapper compiler rpath: runpath
              C compiler: /opt/rocm-5.7.1/llvm/bin/clang
     C compiler absolute: /opt/rocm-5.7.1/llvm/bin/clang
  C compiler family name: CLANG
      C compiler version: 17.0.0
                          (https://github.com/RadeonOpenCompute/llvm-project
                          roc-5.7.1 23382
                          f3e174a1d286158c06e4cc8276366b1d4bc0c914)
            C++ compiler: /opt/rocm-5.7.1/llvm/bin/clang++
   C++ compiler absolute: /opt/rocm-5.7.1/llvm/bin/clang++
           Fort compiler: /opt/rocm-5.7.1/llvm/bin/flang
       Fort compiler abs: /opt/rocm-5.7.1/llvm/bin/flang
         Fort ignore TKR: yes (!DIR$ IGNORE_TKR)
   Fort 08 assumed shape: no
      Fort optional args: yes
          Fort INTERFACE: yes
    Fort ISO_FORTRAN_ENV: yes
       Fort STORAGE_SIZE: yes
      Fort BIND(C) (all): yes
      Fort ISO_C_BINDING: yes
 Fort SUBROUTINE BIND(C): yes
       Fort TYPE,BIND(C): yes
 Fort T,BIND(C,name="a"): yes
            Fort PRIVATE: yes
           Fort ABSTRACT: yes
       Fort ASYNCHRONOUS: yes
          Fort PROCEDURE: yes
         Fort USE...ONLY: yes
           Fort C_FUNLOC: yes
 Fort f08 using wrappers: yes
         Fort MPI_SIZEOF: yes
             C profiling: yes
   Fort mpif.h profiling: yes
  Fort use mpi profiling: yes
   Fort use mpi_f08 prof: yes
          Thread support: posix (MPI_THREAD_MULTIPLE: yes, OPAL support: yes,
                          OMPI progress: no, Event lib: yes)
           Sparse Groups: no
  Internal debug support: no
  MPI interface warnings: yes
     MPI parameter check: runtime
Memory profiling support: no
Memory debugging support: no
              dl support: yes
   Heterogeneous support: no
       MPI_WTIME support: native
     Symbol vis. support: yes
   Host topology support: yes
            IPv6 support: no
          MPI extensions: affinity, cuda, ftmpi, rocm, shortfloat
 Fault Tolerance support: yes
          FT MPI support: yes
  MPI_MAX_PROCESSOR_NAME: 256
    MPI_MAX_ERROR_STRING: 256
     MPI_MAX_OBJECT_NAME: 64
        MPI_MAX_INFO_KEY: 36
        MPI_MAX_INFO_VAL: 256
       MPI_MAX_PORT_NAME: 1024
  MPI_MAX_DATAREP_STRING: 128
         MCA accelerator: null (MCA v2.1.0, API v1.0.0, Component v5.0.7)
         MCA accelerator: rocm (MCA v2.1.0, API v1.0.0, Component v5.0.7)
           MCA allocator: basic (MCA v2.1.0, API v2.0.0, Component v5.0.7)
           MCA allocator: bucket (MCA v2.1.0, API v2.0.0, Component v5.0.7)
           MCA backtrace: execinfo (MCA v2.1.0, API v2.0.0, Component v5.0.7)
                 MCA btl: self (MCA v2.1.0, API v3.3.0, Component v5.0.7)
                 MCA btl: sm (MCA v2.1.0, API v3.3.0, Component v5.0.7)
                 MCA btl: tcp (MCA v2.1.0, API v3.3.0, Component v5.0.7)
                  MCA dl: dlopen (MCA v2.1.0, API v1.0.0, Component v5.0.7)
                  MCA if: linux_ipv6 (MCA v2.1.0, API v2.0.0, Component
                          v5.0.7)
                  MCA if: posix_ipv4 (MCA v2.1.0, API v2.0.0, Component
                          v5.0.7)
         MCA installdirs: env (MCA v2.1.0, API v2.0.0, Component v5.0.7)
         MCA installdirs: config (MCA v2.1.0, API v2.0.0, Component v5.0.7)
              MCA memory: patcher (MCA v2.1.0, API v2.0.0, Component v5.0.7)
               MCA mpool: hugepage (MCA v2.1.0, API v3.1.0, Component v5.0.7)
             MCA patcher: overwrite (MCA v2.1.0, API v1.0.0, Component
                          v5.0.7)
              MCA rcache: grdma (MCA v2.1.0, API v3.3.0, Component v5.0.7)
           MCA reachable: weighted (MCA v2.1.0, API v2.0.0, Component v5.0.7)
               MCA shmem: mmap (MCA v2.1.0, API v2.0.0, Component v5.0.7)
               MCA shmem: posix (MCA v2.1.0, API v2.0.0, Component v5.0.7)
               MCA shmem: sysv (MCA v2.1.0, API v2.0.0, Component v5.0.7)
                MCA smsc: cma (MCA v2.1.0, API v1.0.0, Component v5.0.7)
             MCA threads: pthreads (MCA v2.1.0, API v1.0.0, Component v5.0.7)
               MCA timer: linux (MCA v2.1.0, API v2.0.0, Component v5.0.7)
                 MCA bml: r2 (MCA v2.1.0, API v2.1.0, Component v5.0.7)
                MCA coll: adapt (MCA v2.1.0, API v2.4.0, Component v5.0.7)
                MCA coll: basic (MCA v2.1.0, API v2.4.0, Component v5.0.7)
                MCA coll: han (MCA v2.1.0, API v2.4.0, Component v5.0.7)
                MCA coll: inter (MCA v2.1.0, API v2.4.0, Component v5.0.7)
                MCA coll: libnbc (MCA v2.1.0, API v2.4.0, Component v5.0.7)
                MCA coll: self (MCA v2.1.0, API v2.4.0, Component v5.0.7)
                MCA coll: sync (MCA v2.1.0, API v2.4.0, Component v5.0.7)
                MCA coll: tuned (MCA v2.1.0, API v2.4.0, Component v5.0.7)
                MCA coll: ftagree (MCA v2.1.0, API v2.4.0, Component v5.0.7)
                MCA coll: monitoring (MCA v2.1.0, API v2.4.0, Component
                          v5.0.7)
                MCA coll: sm (MCA v2.1.0, API v2.4.0, Component v5.0.7)
                MCA fbtl: posix (MCA v2.1.0, API v2.0.0, Component v5.0.7)
               MCA fcoll: dynamic (MCA v2.1.0, API v2.0.0, Component v5.0.7)
               MCA fcoll: dynamic_gen2 (MCA v2.1.0, API v2.0.0, Component
                          v5.0.7)
               MCA fcoll: individual (MCA v2.1.0, API v2.0.0, Component
                          v5.0.7)
               MCA fcoll: vulcan (MCA v2.1.0, API v2.0.0, Component v5.0.7)
                  MCA fs: ufs (MCA v2.1.0, API v2.0.0, Component v5.0.7)
                MCA hook: comm_method (MCA v2.1.0, API v1.0.0, Component
                          v5.0.7)
                  MCA io: ompio (MCA v2.1.0, API v2.0.0, Component v5.0.7)
                  MCA io: romio341 (MCA v2.1.0, API v2.0.0, Component v5.0.7)
                  MCA op: avx (MCA v2.1.0, API v1.0.0, Component v5.0.7)
                 MCA osc: sm (MCA v2.1.0, API v3.0.0, Component v5.0.7)
                 MCA osc: monitoring (MCA v2.1.0, API v3.0.0, Component
                          v5.0.7)
                 MCA osc: rdma (MCA v2.1.0, API v3.0.0, Component v5.0.7)
                MCA part: persist (MCA v2.1.0, API v4.0.0, Component v5.0.7)
                 MCA pml: cm (MCA v2.1.0, API v2.1.0, Component v5.0.7)
                 MCA pml: monitoring (MCA v2.1.0, API v2.1.0, Component
                          v5.0.7)
                 MCA pml: ob1 (MCA v2.1.0, API v2.1.0, Component v5.0.7)
                 MCA pml: v (MCA v2.1.0, API v2.1.0, Component v5.0.7)
            MCA sharedfp: individual (MCA v2.1.0, API v2.0.0, Component
                          v5.0.7)
            MCA sharedfp: lockedfile (MCA v2.1.0, API v2.0.0, Component
                          v5.0.7)
            MCA sharedfp: sm (MCA v2.1.0, API v2.0.0, Component v5.0.7)
                MCA topo: basic (MCA v2.1.0, API v2.2.0, Component v5.0.7)
                MCA topo: treematch (MCA v2.1.0, API v2.2.0, Component
                          v5.0.7)
           MCA vprotocol: pessimist (MCA v2.1.0, API v2.0.0, Component
                          v5.0.7)

Please describe the system on which you are running

  • OS: RHEL 4.18.0-477.10.1.el8_8.x86_64

  • Front-end nodes hardware (hpc-vesta[1-2]): 2 × AMD EPYC 7313 16-Core Processors

  • Compute nodes hardware (hpc-mn[101-103]): 2 × AMD EPYC 7643 48-Core Processors + 10 × AMD MI210 GPUs per node

  • Network type:
    The connection between hpc-vesta[1-2] and hpc-mn[101-103] uses Ethernet with LACP.
    Each hpc-vesta[1-2] node has two interfaces bonded (bond0).
[root@hpc-vesta1 ~]$ ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: ens1f0: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 9000 qdisc mq master bond0 state UP group default qlen 1000
    link/ether 6c:fe:54:47:8a:e4 brd ff:ff:ff:ff:ff:ff
    altname enp3s0f0
3: ens1f1: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 9000 qdisc mq master bond0 state UP group default qlen 1000
    link/ether 6c:fe:54:47:8a:e4 brd ff:ff:ff:ff:ff:ff permaddr 6c:fe:54:47:8a:e5
    altname enp3s0f1
4: ens1f2: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc mq master bond1 state UP group default qlen 1000
    link/ether 6c:fe:54:47:8a:e6 brd ff:ff:ff:ff:ff:ff
    altname enp3s0f2
5: ens1f3: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc mq master bond1 state UP group default qlen 1000
    link/ether 6c:fe:54:47:8a:e6 brd ff:ff:ff:ff:ff:ff permaddr 6c:fe:54:47:8a:e7
    altname enp3s0f3
6: bond1: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether 6c:fe:54:47:8a:e6 brd ff:ff:ff:ff:ff:ff
    inet **REDACTED**/28 brd **REDACTED** scope global noprefixroute bond1
       valid_lft forever preferred_lft forever
    inet6 fe80::5a62:a881:97b2:6d50/64 scope link noprefixroute 
       valid_lft forever preferred_lft forever
7: bond0: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 9000 qdisc noqueue state UP group default qlen 1000
    link/ether 6c:fe:54:47:8a:e4 brd ff:ff:ff:ff:ff:ff
    inet 192.168.0.4/21 brd 192.168.7.255 scope global noprefixroute bond0
       valid_lft forever preferred_lft forever
    inet6 fe80::4212:b9f9:8ab6:399f/64 scope link noprefixroute 
       valid_lft forever preferred_lft forever

Each hpc-mn[101-103] node has four interfaces bonded (bond0).
[root@hpc-mn101 ~]# ip a

1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: ens21f0: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 9000 qdisc mq master bond0 state UP group default qlen 1000
    link/ether 6c:fe:54:47:28:80 brd ff:ff:ff:ff:ff:ff
    altname enp166s0f0
3: ens21f1: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 9000 qdisc mq master bond0 state UP group default qlen 1000
    link/ether 6c:fe:54:47:28:80 brd ff:ff:ff:ff:ff:ff permaddr 6c:fe:54:47:28:81
    altname enp166s0f1
4: ens21f2: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 9000 qdisc mq master bond0 state UP group default qlen 1000
    link/ether 6c:fe:54:47:28:80 brd ff:ff:ff:ff:ff:ff permaddr 6c:fe:54:47:28:82
    altname enp166s0f2
5: ens21f3: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 9000 qdisc mq master bond0 state UP group default qlen 1000
    link/ether 6c:fe:54:47:28:80 brd ff:ff:ff:ff:ff:ff permaddr 6c:fe:54:47:28:83
    altname enp166s0f3
6: mlx5_ib0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 2044 qdisc mq state UP group default qlen 256
    link/infiniband 00:00:10:29:fe:80:00:00:00:00:00:00:88:e9:a4:ff:ff:9f:d9:56 brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff
    inet 12.12.12.1/24 brd 12.12.12.255 scope global noprefixroute mlx5_ib0
       valid_lft forever preferred_lft forever
    inet6 fe80::9812:40a6:ac02:2225/64 scope link noprefixroute 
       valid_lft forever preferred_lft forever
7: bond0: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 9000 qdisc noqueue state UP group default qlen 1000
    link/ether 6c:fe:54:47:28:80 brd ff:ff:ff:ff:ff:ff
    inet 192.168.0.1/21 brd 192.168.7.255 scope global noprefixroute bond0
       valid_lft forever preferred_lft forever
    inet6 fe80::35af:c281:f3e9:67bc/64 scope link noprefixroute 
       valid_lft forever preferred_lft forever

Mellanox cards are only used among hpc-mn102, hpc-mn102, and hpc-mn103, so they are not used here.

Details of the problem

Steps I already tried that didn't change anything:

  1. I checked if the issue was coming from my ./configure options, so I ended up removing almost all the options I was initially using :
--with-tests-examples
--enable-mca-no-build=btl-uct
--enable-mpi-fortran=all
--enable-prte-prefix-by-default
--enable-shared
--enable-static
--with-rocm="$SITE_PATH_ROCM"
--with-ucc="$UCC_PATH"
--with-ucx="$UCX_PATH" 
 --without-verbs 

but nothing changed. Even with the most minimalistic configuration:
--prefix=/nfs/mesonet/sw/openmpi/openmpi-5.0.7.rocm-5.7.1 --enable-prte-prefix-by-default the issue persists.

  1. Changed the compiler from amdclang (shipped with ROCm 5.7.1) to gcc-13.0.2.

  2. Tried using an external PRRTE and external PMIX.

  3. I can SSH from hpc-vesta1 to hpc-vesta2, hpc-mn101, hpc-mn102, and hpc-mn103, but I cannot SSH back in the other direction, except when adding -A.

  4. Checked nftables, running only on hpc-vesta[1-2]. All incoming traffic to the bond0 interface is allowed.

  5. Ran the following command: mpirun --mca btl_tcp_if_include 192.168.0.0/21 --host hpc-mn101 hostname

  6. Tried default sshd_config

  7. Ran the following command :

[xxx@hpc-vesta1 ~]$ mpirun --debug-daemons --prtemca plm_base_verbose 100 --prtemca rmaps_base_verbose 100 --display alloc --host hpc-mn101 hostname
[hpc-vesta1.hpc.lan:2224251] mca: base: component_find: searching NULL for plm components
[hpc-vesta1.hpc.lan:2224251] mca: base: find_dyn_components: checking NULL for plm components
[hpc-vesta1.hpc.lan:2224251] pmix:mca: base: components_register: registering framework plm components
[hpc-vesta1.hpc.lan:2224251] pmix:mca: base: components_register: found loaded component slurm
[hpc-vesta1.hpc.lan:2224251] pmix:mca: base: components_register: component slurm register function successful
[hpc-vesta1.hpc.lan:2224251] pmix:mca: base: components_register: found loaded component ssh
[hpc-vesta1.hpc.lan:2224251] pmix:mca: base: components_register: component ssh register function successful
[hpc-vesta1.hpc.lan:2224251] mca: base: components_open: opening plm components
[hpc-vesta1.hpc.lan:2224251] mca: base: components_open: found loaded component slurm
[hpc-vesta1.hpc.lan:2224251] mca: base: components_open: component slurm open function successful
[hpc-vesta1.hpc.lan:2224251] mca: base: components_open: found loaded component ssh
[hpc-vesta1.hpc.lan:2224251] mca: base: components_open: component ssh open function successful
[hpc-vesta1.hpc.lan:2224251] mca:base:select: Auto-selecting plm components
[hpc-vesta1.hpc.lan:2224251] mca:base:select:(  plm) Querying component [slurm]
[hpc-vesta1.hpc.lan:2224251] mca:base:select:(  plm) Querying component [ssh]
[hpc-vesta1.hpc.lan:2224251] [[INVALID],UNDEFINED] plm:ssh_lookup on agent ssh : rsh path NULL
[hpc-vesta1.hpc.lan:2224251] mca:base:select:(  plm) Query of component [ssh] set priority to 10
[hpc-vesta1.hpc.lan:2224251] mca:base:select:(  plm) Selected component [ssh]
[hpc-vesta1.hpc.lan:2224251] mca: base: close: component slurm closed
[hpc-vesta1.hpc.lan:2224251] mca: base: close: unloading component slurm
[hpc-vesta1.hpc.lan:2224251] [prterun-hpc-vesta1-2224251@0,0] plm:ssh_setup on agent ssh : rsh path NULL
[hpc-vesta1.hpc.lan:2224251] [prterun-hpc-vesta1-2224251@0,0] plm:base:receive start comm
[hpc-vesta1.hpc.lan:2224251] mca: base: component_find: searching NULL for rmaps components
[hpc-vesta1.hpc.lan:2224251] mca: base: find_dyn_components: checking NULL for rmaps components
[hpc-vesta1.hpc.lan:2224251] pmix:mca: base: components_register: registering framework rmaps components
[hpc-vesta1.hpc.lan:2224251] pmix:mca: base: components_register: found loaded component ppr
[hpc-vesta1.hpc.lan:2224251] pmix:mca: base: components_register: component ppr register function successful
[hpc-vesta1.hpc.lan:2224251] pmix:mca: base: components_register: found loaded component rank_file
[hpc-vesta1.hpc.lan:2224251] pmix:mca: base: components_register: component rank_file has no register or open function
[hpc-vesta1.hpc.lan:2224251] pmix:mca: base: components_register: found loaded component round_robin
[hpc-vesta1.hpc.lan:2224251] pmix:mca: base: components_register: component round_robin register function successful
[hpc-vesta1.hpc.lan:2224251] pmix:mca: base: components_register: found loaded component seq
[hpc-vesta1.hpc.lan:2224251] pmix:mca: base: components_register: component seq register function successful
[hpc-vesta1.hpc.lan:2224251] mca: base: components_open: opening rmaps components
[hpc-vesta1.hpc.lan:2224251] mca: base: components_open: found loaded component ppr
[hpc-vesta1.hpc.lan:2224251] mca: base: components_open: component ppr open function successful
[hpc-vesta1.hpc.lan:2224251] mca: base: components_open: found loaded component rank_file
[hpc-vesta1.hpc.lan:2224251] mca: base: components_open: found loaded component round_robin
[hpc-vesta1.hpc.lan:2224251] mca: base: components_open: component round_robin open function successful
[hpc-vesta1.hpc.lan:2224251] mca: base: components_open: found loaded component seq
[hpc-vesta1.hpc.lan:2224251] mca: base: components_open: component seq open function successful
[hpc-vesta1.hpc.lan:2224251] mca:rmaps:select: checking available component ppr
[hpc-vesta1.hpc.lan:2224251] mca:rmaps:select: Querying component [ppr]
[hpc-vesta1.hpc.lan:2224251] mca:rmaps:select: checking available component rank_file
[hpc-vesta1.hpc.lan:2224251] mca:rmaps:select: Querying component [rank_file]
[hpc-vesta1.hpc.lan:2224251] mca:rmaps:select: checking available component round_robin
[hpc-vesta1.hpc.lan:2224251] mca:rmaps:select: Querying component [round_robin]
[hpc-vesta1.hpc.lan:2224251] mca:rmaps:select: checking available component seq
[hpc-vesta1.hpc.lan:2224251] mca:rmaps:select: Querying component [seq]
[hpc-vesta1.hpc.lan:2224251] [prterun-hpc-vesta1-2224251@0,0]: Final mapper priorities
[hpc-vesta1.hpc.lan:2224251]    Mapper: rank_file Priority: 100
[hpc-vesta1.hpc.lan:2224251]    Mapper: ppr Priority: 90
[hpc-vesta1.hpc.lan:2224251]    Mapper: seq Priority: 60
[hpc-vesta1.hpc.lan:2224251]    Mapper: round_robin Priority: 10
[hpc-vesta1.hpc.lan:2224251] [prterun-hpc-vesta1-2224251@0,0] plm:base:setup_vm
[hpc-vesta1.hpc.lan:2224251] [prterun-hpc-vesta1-2224251@0,0] plm:base:setup_vm creating map
[hpc-vesta1.hpc.lan:2224251] [prterun-hpc-vesta1-2224251@0,0] setup:vm: working unmanaged allocation
[hpc-vesta1.hpc.lan:2224251] [prterun-hpc-vesta1-2224251@0,0] using dash_host
[hpc-vesta1.hpc.lan:2224251] [prterun-hpc-vesta1-2224251@0,0] checking node hpc-mn101
[hpc-vesta1.hpc.lan:2224251] [prterun-hpc-vesta1-2224251@0,0] plm:base:setup_vm add new daemon [prterun-hpc-vesta1-2224251@0,1]

[hpc-vesta1.hpc.lan:2224251] [prterun-hpc-vesta1-2224251@0,0] plm:base:setup_vm assigning new daemon [prterun-hpc-vesta1-2224251@0,1] to node hpc-mn101
======================   ALLOCATED NODES   ======================
[hpc-vesta1.hpc.lan:2224251] [prterun-hpc-vesta1-2224251@0,0] plm:ssh: launching vm
    hpc-mn101: slots=1 max_slots=0 slots_inuse=0 state=UP
        Flags: SLOTS_GIVEN
        aliases: NONE
=================================================================
[hpc-vesta1.hpc.lan:2224251] [prterun-hpc-vesta1-2224251@0,0] plm:ssh: local shell: 0 (bash)
[hpc-vesta1.hpc.lan:2224251] [prterun-hpc-vesta1-2224251@0,0] plm:ssh: assuming same remote shell as local shell
[hpc-vesta1.hpc.lan:2224251] [prterun-hpc-vesta1-2224251@0,0] plm:ssh: remote shell: 0 (bash)
[hpc-vesta1.hpc.lan:2224251] [prterun-hpc-vesta1-2224251@0,0] plm:ssh: final template argv:
        /usr/bin/ssh <template> PRTE_PREFIX=/nfs/mesonet/sw/openmpi/openmpi-5.0.7.rocm-5.7.1;export PRTE_PREFIX;LD_LIBRARY_PATH=/nfs/mesonet/sw/openmpi/openmpi-5.0.7.rocm-5.7.1/lib:/nfs/mesonet/sw/openmpi/openmpi-5.0.7.rocm-5.7.1/lib:$LD_LIBRARY_PATH;export LD_LIBRARY_PATH;DYLD_LIBRARY_PATH=/nfs/mesonet/sw/openmpi/openmpi-5.0.7.rocm-5.7.1/lib:/nfs/mesonet/sw/openmpi/openmpi-5.0.7.rocm-5.7.1/lib:$DYLD_LIBRARY_PATH;export DYLD_LIBRARY_PATH;/nfs/mesonet/sw/openmpi/openmpi-5.0.7.rocm-5.7.1/bin/prted --debug-daemons --prtemca ess "env" --prtemca ess_base_nspace "prterun-hpc-vesta1-2224251@0" --prtemca ess_base_vpid "<template>" --prtemca ess_base_num_procs "2" --prtemca prte_hnp_uri "[email protected];tcp://185.155.95.146,192.168.0.4:41353:28,21" --prtemca PREFIXES "errmgr,ess,filem,grpcomm,iof,odls,oob,plm,prtebacktrace,prtedl,prteinstalldirs,prtereachable,ras,rmaps,rtc,schizo,state,hwloc,if,reachable" --prtemca plm_base_verbose "100" --prtemca rmaps_base_verbose "100" --prtemca pmix_session_server "1" --prtemca plm "ssh" --tree-spawn --prtemca prte_parent_uri "[email protected];tcp://185.155.95.146,192.168.0.4:41353:28,21"
[hpc-vesta1.hpc.lan:2224251] [prterun-hpc-vesta1-2224251@0,0] plm:ssh:launch daemon 0 not a child of mine
[hpc-vesta1.hpc.lan:2224251] [prterun-hpc-vesta1-2224251@0,0] plm:ssh: adding node hpc-mn101 to launch list
[hpc-vesta1.hpc.lan:2224251] [prterun-hpc-vesta1-2224251@0,0] plm:ssh: activating launch event
[hpc-vesta1.hpc.lan:2224251] [prterun-hpc-vesta1-2224251@0,0] plm:ssh: recording launch of daemon [prterun-hpc-vesta1-2224251@0,1]
[hpc-vesta1.hpc.lan:2224251] [prterun-hpc-vesta1-2224251@0,0] plm:ssh: executing: (/usr/bin/ssh) [/usr/bin/ssh hpc-mn101 PRTE_PREFIX=/nfs/mesonet/sw/openmpi/openmpi-5.0.7.rocm-5.7.1;export PRTE_PREFIX;LD_LIBRARY_PATH=/nfs/mesonet/sw/openmpi/openmpi-5.0.7.rocm-5.7.1/lib:/nfs/mesonet/sw/openmpi/openmpi-5.0.7.rocm-5.7.1/lib:$LD_LIBRARY_PATH;export LD_LIBRARY_PATH;DYLD_LIBRARY_PATH=/nfs/mesonet/sw/openmpi/openmpi-5.0.7.rocm-5.7.1/lib:/nfs/mesonet/sw/openmpi/openmpi-5.0.7.rocm-5.7.1/lib:$DYLD_LIBRARY_PATH;export DYLD_LIBRARY_PATH;/nfs/mesonet/sw/openmpi/openmpi-5.0.7.rocm-5.7.1/bin/prted --debug-daemons --prtemca ess "env" --prtemca ess_base_nspace "prterun-hpc-vesta1-2224251@0" --prtemca ess_base_vpid 1 --prtemca ess_base_num_procs "2" --prtemca prte_hnp_uri "[email protected];tcp://185.155.95.146,192.168.0.4:41353:28,21" --prtemca PREFIXES "errmgr,ess,filem,grpcomm,iof,odls,oob,plm,prtebacktrace,prtedl,prteinstalldirs,prtereachable,ras,rmaps,rtc,schizo,state,hwloc,if,reachable" --prtemca plm_base_verbose "100" --prtemca rmaps_base_verbose "100" --prtemca pmix_session_server "1" --prtemca plm "ssh" --tree-spawn --prtemca prte_parent_uri "[email protected];tcp://185.155.95.146,192.168.0.4:41353:28,21"]
Daemon was launched on hpc-mn101 - beginning to initialize
[hpc-mn101.hpc.lan:2020523] mca: base: component_find: searching NULL for plm components
[hpc-mn101.hpc.lan:2020523] mca: base: find_dyn_components: checking NULL for plm components
[hpc-mn101.hpc.lan:2020523] pmix:mca: base: components_register: registering framework plm components
[hpc-mn101.hpc.lan:2020523] pmix:mca: base: components_register: found loaded component ssh
[hpc-mn101.hpc.lan:2020523] pmix:mca: base: components_register: component ssh register function successful
[hpc-mn101.hpc.lan:2020523] mca: base: components_open: opening plm components
[hpc-mn101.hpc.lan:2020523] mca: base: components_open: found loaded component ssh
[hpc-mn101.hpc.lan:2020523] mca: base: components_open: component ssh open function successful
[hpc-mn101.hpc.lan:2020523] mca:base:select: Auto-selecting plm components
[hpc-mn101.hpc.lan:2020523] mca:base:select:(  plm) Querying component [ssh]
[hpc-mn101.hpc.lan:2020523] [prterun-hpc-vesta1-2224251@0,1] plm:ssh_lookup on agent ssh : rsh path NULL
[hpc-mn101.hpc.lan:2020523] mca:base:select:(  plm) Query of component [ssh] set priority to 10
[hpc-mn101.hpc.lan:2020523] mca:base:select:(  plm) Selected component [ssh]
[hpc-mn101.hpc.lan:2020523] mca: base: component_find: searching NULL for rmaps components
[hpc-mn101.hpc.lan:2020523] mca: base: find_dyn_components: checking NULL for rmaps components
[hpc-mn101.hpc.lan:2020523] pmix:mca: base: components_register: registering framework rmaps components
[hpc-mn101.hpc.lan:2020523] pmix:mca: base: components_register: found loaded component ppr
[hpc-mn101.hpc.lan:2020523] pmix:mca: base: components_register: component ppr register function successful
[hpc-mn101.hpc.lan:2020523] pmix:mca: base: components_register: found loaded component rank_file
[hpc-mn101.hpc.lan:2020523] pmix:mca: base: components_register: component rank_file has no register or open function
[hpc-mn101.hpc.lan:2020523] pmix:mca: base: components_register: found loaded component round_robin
[hpc-mn101.hpc.lan:2020523] pmix:mca: base: components_register: component round_robin register function successful
[hpc-mn101.hpc.lan:2020523] pmix:mca: base: components_register: found loaded component seq
[hpc-mn101.hpc.lan:2020523] pmix:mca: base: components_register: component seq register function successful
[hpc-mn101.hpc.lan:2020523] mca: base: components_open: opening rmaps components
[hpc-mn101.hpc.lan:2020523] mca: base: components_open: found loaded component ppr
[hpc-mn101.hpc.lan:2020523] mca: base: components_open: component ppr open function successful
[hpc-mn101.hpc.lan:2020523] mca: base: components_open: found loaded component rank_file
[hpc-mn101.hpc.lan:2020523] mca: base: components_open: found loaded component round_robin
[hpc-mn101.hpc.lan:2020523] mca: base: components_open: component round_robin open function successful
[hpc-mn101.hpc.lan:2020523] mca: base: components_open: found loaded component seq
[hpc-mn101.hpc.lan:2020523] mca: base: components_open: component seq open function successful
[hpc-mn101.hpc.lan:2020523] mca:rmaps:select: checking available component ppr
[hpc-mn101.hpc.lan:2020523] mca:rmaps:select: Querying component [ppr]
[hpc-mn101.hpc.lan:2020523] mca:rmaps:select: checking available component rank_file
[hpc-mn101.hpc.lan:2020523] mca:rmaps:select: Querying component [rank_file]
[hpc-mn101.hpc.lan:2020523] mca:rmaps:select: checking available component round_robin
[hpc-mn101.hpc.lan:2020523] mca:rmaps:select: Querying component [round_robin]
[hpc-mn101.hpc.lan:2020523] mca:rmaps:select: checking available component seq
[hpc-mn101.hpc.lan:2020523] mca:rmaps:select: Querying component [seq]
[hpc-mn101.hpc.lan:2020523] [prterun-hpc-vesta1-2224251@0,1]: Final mapper priorities
[hpc-mn101.hpc.lan:2020523]     Mapper: rank_file Priority: 100
[hpc-mn101.hpc.lan:2020523]     Mapper: ppr Priority: 90
[hpc-mn101.hpc.lan:2020523]     Mapper: seq Priority: 60
[hpc-mn101.hpc.lan:2020523]     Mapper: round_robin Priority: 10
[hpc-mn101.hpc.lan:2020523] [prterun-hpc-vesta1-2224251@0,1] plm:ssh_setup on agent ssh : rsh path NULL
[hpc-mn101.hpc.lan:2020523] [prterun-hpc-vesta1-2224251@0,1] plm:base:receive start comm

-- HANGS HERE !! --

[hpc-mn101.hpc.lan:2020523] [prterun-hpc-vesta1-2224251@0,1] plm:ssh: remote spawn called
[hpc-mn101.hpc.lan:2020523] [prterun-hpc-vesta1-2224251@0,1] plm:ssh: remote spawn - have no children!
[hpc-vesta1.hpc.lan:2224251] [prterun-hpc-vesta1-2224251@0,0] plm:base:orted_report_launch from daemon [prterun-hpc-vesta1-2224251@0,1]
[hpc-vesta1.hpc.lan:2224251] [prterun-hpc-vesta1-2224251@0,0] plm:base:orted_report_launch from daemon [prterun-hpc-vesta1-2224251@0,1] on node hpc-mn101
[hpc-vesta1.hpc.lan:2224251] ALIASES FOR NODE hpc-mn101 (hpc-mn101)
[hpc-vesta1.hpc.lan:2224251]    ALIAS: hpc-mn101.hpc.lan
[hpc-vesta1.hpc.lan:2224251]    ALIAS: 12.12.12.1
[hpc-vesta1.hpc.lan:2224251]    ALIAS: 192.168.0.1
[hpc-vesta1.hpc.lan:2224251] [prterun-hpc-vesta1-2224251@0,0] RECEIVED TOPOLOGY SIG 8N:2S:16L3:96L2:96L1:96C:96H:0-95::x86_64:le FROM NODE hpc-mn101
[hpc-vesta1.hpc.lan:2224251] [prterun-hpc-vesta1-2224251@0,0] NEW TOPOLOGY - ADDING SIGNATURE
[hpc-vesta1.hpc.lan:2224251] [prterun-hpc-vesta1-2224251@0,0] plm:base:orted_report_launch completed for daemon [prterun-hpc-vesta1-2224251@0,1] at contact [email protected];tcp://12.12.12.1,192.168.0.1:46399:24,21
[hpc-vesta1.hpc.lan:2224251] [prterun-hpc-vesta1-2224251@0,0] plm:base:orted_report_launch job prterun-hpc-vesta1-2224251@0 recvd 2 of 2 reported daemons
[hpc-vesta1.hpc.lan:2224251] [prterun-hpc-vesta1-2224251@0,0] plm:base:setting slots for node hpc-vesta1 by core

======================   ALLOCATED NODES   ======================
    hpc-mn101: slots=1 max_slots=0 slots_inuse=0 state=UP
        Flags: DAEMON_LAUNCHED:LOCATION_VERIFIED:SLOTS_GIVEN
        aliases: hpc-mn101.hpc.lan,12.12.12.1,192.168.0.1
=================================================================
[hpc-vesta1.hpc.lan:2224251] [prterun-hpc-vesta1-2224251@0,0] plm:base:receive processing msg
[hpc-vesta1.hpc.lan:2224251] [prterun-hpc-vesta1-2224251@0,0] plm:base:receive job launch command from [prterun-hpc-vesta1-2224251@0,0]
[hpc-vesta1.hpc.lan:2224251] [prterun-hpc-vesta1-2224251@0,0] plm:base:receive adding hosts
[hpc-vesta1.hpc.lan:2224251] [prterun-hpc-vesta1-2224251@0,0] plm:base:receive calling spawn
[hpc-vesta1.hpc.lan:2224251] [prterun-hpc-vesta1-2224251@0,0] plm:base:receive done processing commands
[hpc-vesta1.hpc.lan:2224251] [prterun-hpc-vesta1-2224251@0,0] plm:base:setup_job
[hpc-vesta1.hpc.lan:2224251] [prterun-hpc-vesta1-2224251@0,0] plm:base:setup_vm
[hpc-vesta1.hpc.lan:2224251] [prterun-hpc-vesta1-2224251@0,0] plm_base:setup_vm NODE hpc-mn101 WAS NOT ADDED
[hpc-vesta1.hpc.lan:2224251] [prterun-hpc-vesta1-2224251@0,0] plm:base:setup_vm no new daemons required
[hpc-vesta1.hpc.lan:2224251] mca:rmaps: mapping job prterun-hpc-vesta1-2224251@1
[hpc-vesta1.hpc.lan:2224251] mca:rmaps: setting mapping policies for job prterun-hpc-vesta1-2224251@1 inherit TRUE hwtcpus FALSE
[hpc-vesta1.hpc.lan:2224251] mca:rmaps mapping not given - using bycore
[hpc-vesta1.hpc.lan:2224251] [prterun-hpc-vesta1-2224251@0,0] using dash_host hpc-mn101
[hpc-vesta1.hpc.lan:2224251] NODE hpc-vesta1 DOESNT MATCH NODE hpc-mn101
[hpc-vesta1.hpc.lan:2224251] [prterun-hpc-vesta1-2224251@0,0] node hpc-mn101 has 1 slots available
[hpc-vesta1.hpc.lan:2224251] AVAILABLE NODES FOR MAPPING:

[hpc-vesta1.hpc.lan:2224251]     node: hpc-mn101 daemon: 1 slots_available: 1
======================   ALLOCATED NODES   ======================
    hpc-mn101: slots=1 max_slots=0 slots_inuse=0 state=UP
[hpc-vesta1.hpc.lan:2224251] setdefaultbinding[316] binding not given - using bycore
        Flags: DAEMON_LAUNCHED:LOCATION_VERIFIED:SLOTS_GIVEN
[hpc-vesta1.hpc.lan:2224251] mca:rmaps:rf: job prterun-hpc-vesta1-2224251@1 not using rankfile policy
        aliases: hpc-mn101.hpc.lan,12.12.12.1,192.168.0.1
=================================================================
[hpc-vesta1.hpc.lan:2224251] mca:rmaps:ppr: job prterun-hpc-vesta1-2224251@1 not using ppr mapper PPR NULL policy PPR NOTSET
[hpc-vesta1.hpc.lan:2224251] [prterun-hpc-vesta1-2224251@0,0] rmaps:seq called on job prterun-hpc-vesta1-2224251@1
[hpc-vesta1.hpc.lan:2224251] mca:rmaps:seq: job prterun-hpc-vesta1-2224251@1 not using seq mapper
[hpc-vesta1.hpc.lan:2224251] mca:rmaps:rr: mapping job prterun-hpc-vesta1-2224251@1
[hpc-vesta1.hpc.lan:2224251] [prterun-hpc-vesta1-2224251@0,0] using dash_host hpc-mn101
[hpc-vesta1.hpc.lan:2224251] NODE hpc-vesta1 DOESNT MATCH NODE hpc-mn101
[hpc-vesta1.hpc.lan:2224251] [prterun-hpc-vesta1-2224251@0,0] node hpc-mn101 has 1 slots available
[hpc-vesta1.hpc.lan:2224251] AVAILABLE NODES FOR MAPPING:
[hpc-vesta1.hpc.lan:2224251]     node: hpc-mn101 daemon: 1 slots_available: 1
[hpc-vesta1.hpc.lan:2224251] mca:rmaps:rr:byobj mapping by Core for job prterun-hpc-vesta1-2224251@1 slots 1 num_procs 1
[hpc-vesta1.hpc.lan:2224251] mca:rmaps:rr: found 96 Core objects on node hpc-mn101
[hpc-vesta1.hpc.lan:2224251] mca:rmaps:rr: assigning proc to object 0
[hpc-vesta1.hpc.lan:2224251] [prterun-hpc-vesta1-2224251@0,0] get_avail_ncpus: node hpc-mn101 has 0 procs on it
[hpc-vesta1.hpc.lan:2224251] mca:rmaps: compute bindings for job prterun-hpc-vesta1-2224251@1 with policy CORE:IF-SUPPORTED[1007]
[hpc-vesta1.hpc.lan:2224251] mca:rmaps: bind [prterun-hpc-vesta1-2224251@1,INVALID] with policy CORE:IF-SUPPORTED
[hpc-vesta1.hpc.lan:2224251] [prterun-hpc-vesta1-2224251@0,0] BOUND PROC [prterun-hpc-vesta1-2224251@1,INVALID][hpc-mn101] TO package[0][core:0]
[hpc-vesta1.hpc.lan:2224251] [prterun-hpc-vesta1-2224251@0,0] complete_setup on job prterun-hpc-vesta1-2224251@1
[hpc-vesta1.hpc.lan:2224251] [prterun-hpc-vesta1-2224251@0,0] plm:base:launch_apps for job prterun-hpc-vesta1-2224251@1
[hpc-vesta1.hpc.lan:2224251] [prterun-hpc-vesta1-2224251@0,0] plm:base:send launch msg for job prterun-hpc-vesta1-2224251@1
[hpc-vesta1.hpc.lan:2224251] [prterun-hpc-vesta1-2224251@0,0] prted:comm:process_commands() Processing Command: PRTE_DAEMON_ADD_LOCAL_PROCS
[hpc-vesta1.hpc.lan:2224251] [prterun-hpc-vesta1-2224251@0,0] prted_cmd: received add_local_procs
[hpc-mn101.hpc.lan:2020523] [prterun-hpc-vesta1-2224251@0,1] prted:comm:process_commands() Processing Command: PRTE_DAEMON_ADD_LOCAL_PROCS
[hpc-mn101.hpc.lan:2020523] [prterun-hpc-vesta1-2224251@0,1] prted_cmd: received add_local_procs
[hpc-vesta1.hpc.lan:2224251] [prterun-hpc-vesta1-2224251@0,0] plm:base:receive processing msg
[hpc-vesta1.hpc.lan:2224251] [prterun-hpc-vesta1-2224251@0,0] plm:base:receive local launch complete command from [prterun-hpc-vesta1-2224251@0,1]
[hpc-vesta1.hpc.lan:2224251] [prterun-hpc-vesta1-2224251@0,0] plm:base:receive got local launch complete for job prterun-hpc-vesta1-2224251@1
[hpc-vesta1.hpc.lan:2224251] [prterun-hpc-vesta1-2224251@0,0] plm:base:receive got local launch complete for vpid 0
[hpc-vesta1.hpc.lan:2224251] [prterun-hpc-vesta1-2224251@0,0] plm:base:receive got local launch complete for vpid 0 state RUNNING
[hpc-vesta1.hpc.lan:2224251] [prterun-hpc-vesta1-2224251@0,0] plm:base:receive done processing commands
[hpc-vesta1.hpc.lan:2224251] [prterun-hpc-vesta1-2224251@0,0] plm:base:launch wiring up iof for job prterun-hpc-vesta1-2224251@1
hpc-mn101.hpc.lan
[hpc-vesta1.hpc.lan:2224251] [prterun-hpc-vesta1-2224251@0,0] plm:base:receive processing msg
[hpc-vesta1.hpc.lan:2224251] 

[prterun-hpc-vesta1-2224251@0,0] plm:base:receive update proc state command from [prterun-hpc-vesta1-2224251@0,1]

[hpc-vesta1.hpc.lan:2224251] [prterun-hpc-vesta1-2224251@0,0] plm:base:receive got update_proc_state for job prterun-hpc-vesta1-2224251@1
[hpc-vesta1.hpc.lan:2224251] [prterun-hpc-vesta1-2224251@0,0] plm:base:receive got update_proc_state for vpid 0 pid 2020550 state NORMALLY TERMINATED exit_code 0
[hpc-vesta1.hpc.lan:2224251] [prterun-hpc-vesta1-2224251@0,0] plm:base:receive done processing commands
[hpc-vesta1.hpc.lan:2224251] [prterun-hpc-vesta1-2224251@0,0] plm:base:prted_cmd sending prted_exit commands
[hpc-vesta1.hpc.lan:2224251] [prterun-hpc-vesta1-2224251@0,0] prted:comm:process_commands() Processing Command: PRTE_DAEMON_EXIT_CMD
[hpc-vesta1.hpc.lan:2224251] [prterun-hpc-vesta1-2224251@0,0] prted_cmd: received exit cmd
[hpc-vesta1.hpc.lan:2224251] [prterun-hpc-vesta1-2224251@0,0] prted_cmd: exit cmd, 1 routes still exist
[hpc-mn101.hpc.lan:2020523] [prterun-hpc-vesta1-2224251@0,1] prted:comm:process_commands() Processing Command: PRTE_DAEMON_EXIT_CMD
[hpc-mn101.hpc.lan:2020523] [prterun-hpc-vesta1-2224251@0,1] prted_cmd: received exit cmd
[hpc-mn101.hpc.lan:2020523] [prterun-hpc-vesta1-2224251@0,1] prted_cmd: all routes and children gone - exiting
[hpc-mn101.hpc.lan:2020523] [prterun-hpc-vesta1-2224251@0,1] plm:base:receive stop comm
[hpc-mn101.hpc.lan:2020523] mca: base: close: component ssh closed
[hpc-mn101.hpc.lan:2020523] mca: base: close: unloading component ssh
[hpc-vesta1.hpc.lan:2224251] [prterun-hpc-vesta1-2224251@0,0] plm:base:receive stop comm
[hpc-vesta1.hpc.lan:2224251] mca: base: close: component ssh closed
[hpc-vesta1.hpc.lan:2224251] mca: base: close: unloading component ssh

Thanks in advance for you help

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions