-
Notifications
You must be signed in to change notification settings - Fork 929
Closed
Labels
Description
Background information
What version of Open MPI are you using? (e.g., v4.1.6, v5.0.1, git branch name and hash, etc.)
Branch main, 10 March 2025
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
Was installed via Spack, using OSS libfabric and cxi support. Compiler args:
--enable-shared --disable-silent-rules --disable-sphinx --enable-builtin-atomics --disable-static --with-slingshot --enable-mpi1-compatibility --without-psm --without-psm2 --without-fca --without-cma --without-knem --with-xpmem=/usr --without-hcoll --without-mxm --with-ofi=/afs/psi.ch/sys/spack/develop/opt/spack/testing/[padded-to-256-chars]/linux-sles15-aarch64/gcc-14.2.0/libfabric-main-iegtmu74ojgb2pvywqfrlzvedlcz7cps --without-ucc --without-ucx --without-verbs --with-cray-xpmem --without-sge --without-alps --without-loadleveler --without-tm --with-slurm --without-lsf --disable-memchecker --with-libevent=/afs/psi.ch/sys/spack/develop/opt/spack/testing/[padded-to-256-chars]/linux-sles15-aarch64/gcc-7.5.0/libevent-2.1.12-p5qzh7hez5qdbtl7j3avjhf4mry5fm4n --without-lustre --with-pmix=internal --with-zlib=/afs/psi.ch/sys/spack/develop/opt/spack/testing/[padded-to-256-chars]/linux-sles15-aarch64/gcc-7.5.0/zlib-ng-2.2.3-o7xbhxhnqpk5ljcpjmgple6rv3i75z2h --with-hwloc=/afs/psi.ch/sys/spack/develop/opt/spack/testing/[padded-to-256-chars]/linux-sles15-aarch64/gcc-14.2.0/hwloc-2.11.1-3jxzkohocpqjyvd2irytycprfi2bom5q --disable-java --disable-mpi-java --disable-io-romio --with-gpfs=no --enable-dlopen --with-cuda=/afs/psi.ch/sys/spack/develop/opt/spack/testing/[padded-to-256-chars]/linux-sles15-aarch64/gcc-14.2.0/cuda-12.8.0-ne7ulo7g6hhe7dv5nhh4nxchdenls3r2 --with-cuda-libdir=/afs/psi.ch/sys/spack/develop/opt/spack/testing/[padded-to-256-chars]/linux-sles15-aarch64/gcc-14.2.0/cuda-12.8.0-ne7ulo7g6hhe7dv5nhh4nxchdenls3r2/lib64/stubs --enable-wrapper-rpath --disable-wrapper-runpath --with-wrapper-ldflags=-Wl,-rpath,/afs/psi.ch/sys/spack/develop/opt/spack/unstable/linux-sles15-aarch64/gcc-7.5.0/gcc-14.2.0-tln2ck4lolcipi2fj2klu5dei3oac4sv/lib/gcc/aarch64-unknown-linux-gnu/14.2.0 -Wl,-rpath,/afs/psi.ch/sys/spack/develop/opt/spack/unstable/linux-sles15-aarch64/gcc-7.5.0/gcc-14.2.0-tln2ck4lolcipi2fj2klu5dei3oac4sv/lib64 CFLAGS=-DYY_BUF_SIZE=1048576 --disable-debugspack spec:
- scyqclc openmpi@main%[email protected]+atomics+cuda~debug~gpfs~internal-hwloc~internal-libevent+internal-pmix~java~lustre~memchecker~openshmem~romio+rsh~static~two_level_namespace+vt+wrapper-rpath build_system=autotools cuda_arch=90 fabrics=ofi,xpmem romio-filesystem=none schedulers=slurm arch=linux-sles15-aarch64
[+] mcdzcmr ^[email protected]%[email protected] build_system=autotools arch=linux-sles15-aarch64
[+] btyzacb ^[email protected]%[email protected] build_system=autotools arch=linux-sles15-aarch64
[+] ne7ulo7 ^[email protected]%[email protected]~allow-unsupported-compilers~dev build_system=generic arch=linux-sles15-aarch64
[+] gzc3f4t ^[email protected]%[email protected]~http+pic~python+shared build_system=autotools arch=linux-sles15-aarch64
[+] bu4jqoi ^[email protected]%[email protected] build_system=autotools libs=shared,static arch=linux-sles15-aarch64
[+] 5ss23k5 ^[email protected]%[email protected]+internal_glib build_system=autotools arch=linux-sles15-aarch64
[e] fkyyhdc ^[email protected]%[email protected]~pic build_system=autotools libs=shared,static arch=linux-sles15-aarch64
[+] nyb2vfy ^[email protected]%[email protected] build_system=generic arch=linux-sles15-aarch64
[e] 3egpojh ^[email protected]%[email protected] build_system=autotools arch=linux-sles15-aarch64
[+] hpibhrn ^gnuconfig@2024-07-27%[email protected] build_system=generic arch=linux-sles15-aarch64
[+] 3jxzkoh ^[email protected]%[email protected]~cairo+cuda~gl~level_zero~libudev+libxml2~nvml~opencl+pci~rocm build_system=autotools cuda_arch=90 libs=shared,static arch=linux-sles15-aarch64
[+] 4eajtzs ^[email protected]%[email protected] build_system=autotools arch=linux-sles15-aarch64
[+] txi65ob ^[email protected]%[email protected] build_system=generic arch=linux-sles15-aarch64
[+] nwu26be ^[email protected]%[email protected] build_system=autotools arch=linux-sles15-aarch64
[+] wfg3cd7 ^[email protected]%[email protected] build_system=generic arch=linux-sles15-aarch64
[+] zia4ebj ^[email protected]%[email protected]~symlinks+termlib abi=none build_system=autotools patches=7a351bc arch=linux-sles15-aarch64
[+] p5qzh7h ^[email protected]%[email protected]+openssl build_system=autotools arch=linux-sles15-aarch64
[+] m3gwtgf ^[email protected]%[email protected]~docs+shared build_system=generic certs=mozilla arch=linux-sles15-aarch64
[+] 3aq2syu ^ca-certificates-mozilla@2023-05-30%[email protected] build_system=generic arch=linux-sles15-aarch64
[e] eyczfjv ^[email protected]%[email protected]+cpanm+opcode+open+shared+threads build_system=generic patches=0eac10e,8cf4302 arch=linux-sles15-aarch64
- zgkq6vw ^libfabric@main%[email protected]+cuda~debug~kdreg~level_zero+uring build_system=autotools cuda_arch=90 fabrics=cxi,sockets,tcp,udp,xpmem arch=linux-sles15-aarch64
[+] u5d4zw4 ^[email protected]%[email protected]~gssapi~ldap~libidn2~librtmp~libssh~libssh2+nghttp2 build_system=autotools libs=shared,static tls=openssl arch=linux-sles15-aarch64
[+] 6mvnrnk ^[email protected]%[email protected] build_system=autotools arch=linux-sles15-aarch64
[+] 5t2hvib ^[email protected]%[email protected]~ipo build_system=cmake build_type=Release generator=make arch=linux-sles15-aarch64
[+] u2nmjzn ^[email protected]%[email protected]~doc+ncurses+ownlibs~qtgui build_system=generic build_type=Release arch=linux-sles15-aarch64
- nhjhwto ^libcxi@main%[email protected]+cuda~level_zero~rocm build_system=autotools arch=linux-sles15-aarch64
- 5wgs3er ^cassini-headers@main%[email protected] build_system=generic arch=linux-sles15-aarch64
- u6zzijb ^cxi-driver@main%[email protected] build_system=generic arch=linux-sles15-aarch64
[+] 7cnxi2c ^[email protected]%[email protected] build_system=autotools arch=linux-sles15-aarch64
[+] rq33jed ^[email protected]%[email protected] build_system=autotools arch=linux-sles15-aarch64
[+] dcjewuo ^[email protected]%[email protected] build_system=autotools arch=linux-sles15-aarch64
[+] l6mjl5c ^[email protected]%[email protected] build_system=generic arch=linux-sles15-aarch64
[+] p7ozge3 ^[email protected]%[email protected] build_system=autotools arch=linux-sles15-aarch64
[+] trzm7v5 ^[email protected]%[email protected] build_system=autotools patches=440b954 arch=linux-sles15-aarch64
[+] jppuqwv ^[email protected]%[email protected]+sigsegv build_system=autotools patches=9dc5fbd,bfdffa7 arch=linux-sles15-aarch64
[+] ivhh3c7 ^[email protected]%[email protected] build_system=autotools arch=linux-sles15-aarch64
[+] oj3ovvr ^[email protected]%[email protected] build_system=autotools arch=linux-sles15-aarch64
[+] laxgbis ^[email protected]%[email protected] build_system=generic arch=linux-sles15-aarch64
[+] fmk7hej ^[email protected]%[email protected]~symlinks+termlib abi=none build_system=autotools patches=7a351bc arch=linux-sles15-aarch64
- pjzsvu4 ^[email protected]%[email protected]~strip~system_install~useroot+utils build_system=meson buildtype=release default_library=shared arch=linux-sles15-aarch64
[+] yczbssx ^[email protected]%[email protected] build_system=python_pip patches=0f0b1bd arch=linux-sles15-aarch64
[+] zvoecxo ^[email protected]%[email protected] build_system=generic arch=linux-sles15-aarch64
[+] qqyoi74 ^[email protected]%[email protected] build_system=generic arch=linux-sles15-aarch64
[+] wk6kswr ^[email protected]%[email protected] build_system=generic arch=linux-sles15-aarch64
[+] q7jjsue ^[email protected]%[email protected]+bz2+crypt+ctypes+dbm~debug+libxml2+lzma~optimizations+pic+pyexpat+pythoncmd+readline+shared+sqlite3+ssl~tkinter+uuid+zlib build_system=generic arch=linux-sles15-aarch64
[+] 3hlzxo5 ^[email protected]%[email protected]+libbsd build_system=autotools arch=linux-sles15-aarch64
[+] lducxxr ^[email protected]%[email protected] build_system=autotools arch=linux-sles15-aarch64
[+] qerkf3p ^[email protected]%[email protected] build_system=autotools arch=linux-sles15-aarch64
[+] ds2kwc3 ^[email protected]%[email protected] build_system=autotools arch=linux-sles15-aarch64
[+] mfeth7l ^[email protected]%[email protected] build_system=autotools patches=1ea4349,24f587b,3d9885e,5911a5b,622ba38,6c8adf8,758e2ec,79572ee,a177edc,bbf97f1,c7b45ff,e0013d9,e065038 arch=linux-sles15-aarch64
[+] teiwdpd ^[email protected]%[email protected]+column_metadata+dynamic_extensions+fts~functions+rtree build_system=autotools arch=linux-sles15-aarch64
[+] suaqfjx ^[email protected]%[email protected] build_system=autotools arch=linux-sles15-aarch64
[+] hjcwama ^[email protected]%[email protected] build_system=generic arch=linux-sles15-aarch64
[+] saybo2v ^[email protected]%[email protected]~re2c build_system=generic arch=linux-sles15-aarch64
[+] zq5s6y2 ^[email protected]%[email protected] build_system=generic arch=linux-sles15-aarch64
[+] 3bgmubm ^[email protected]%[email protected]~bz2~crypt+ctypes~dbm~debug+libxml2+lzma~nis~optimizations+pic~pyexpat+pythoncmd~readline+shared~sqlite3~ssl~tkinter~uuid+zlib build_system=generic patches=0d98e93,4c24573,ebdca64,f2fd060 arch=linux-sles15-aarch64
[e] gnju5co ^[email protected]%[email protected]+bzip2+curses+git~libunistring+libxml2+pic+shared+tar+xz build_system=autotools arch=linux-sles15-aarch64
[+] s2jrvfo ^[email protected]%[email protected]+compat+new_strategies+opt+pic+shared build_system=autotools arch=linux-sles15-aarch64
[+] qlqjhch ^gnuconfig@2022-09-17%[email protected] build_system=generic arch=linux-sles15-aarch64
[+] xnbrchw ^[email protected]%[email protected] build_system=autotools arch=linux-sles15-aarch64
[+] 3aey2ot ^[email protected]%[email protected]+lex~nls build_system=autotools arch=linux-sles15-aarch64
[+] 3e23eaj ^[email protected]%[email protected] build_system=autotools arch=linux-sles15-aarch64
[+] 7sa6suu ^[email protected]%[email protected] build_system=autotools arch=linux-sles15-aarch64
[+] pnnsys3 ^lm-sensors@3-6-0%[email protected] build_system=makefile arch=linux-sles15-aarch64
[+] sawly4e ^[email protected]%[email protected]~color build_system=autotools arch=linux-sles15-aarch64
[+] ghouivi ^[email protected]%[email protected]~sigsegv build_system=autotools patches=9dc5fbd,bfdffa7 arch=linux-sles15-aarch64
[+] r2g4qhm ^[email protected]%[email protected]+lex~nls build_system=autotools arch=linux-sles15-aarch64
[+] 5cb63ad ^[email protected]%[email protected] build_system=autotools arch=linux-sles15-aarch64
[+] 3gxsior ^[email protected]%[email protected] build_system=autotools arch=linux-sles15-aarch64
[+] l2qugjv ^[email protected]%[email protected] build_system=autotools arch=linux-sles15-aarch64
[+] tvosith ^[email protected]%[email protected] build_system=autotools arch=linux-sles15-aarch64
[+] 2xmogbm ^[email protected]%[email protected]+gssapi build_system=autotools arch=linux-sles15-aarch64
[+] 7cgifzm ^[email protected]%[email protected]+shared build_system=autotools arch=linux-sles15-aarch64
[+] q7hhrig ^[email protected]%[email protected]~color build_system=autotools arch=linux-sles15-aarch64
[+] obetosr ^[email protected]%[email protected] build_system=autotools arch=linux-sles15-aarch64
[+] w6hnyfk ^[email protected]%[email protected]~obsolete_api build_system=autotools patches=4885da3 arch=linux-sles15-aarch64
[+] iacvnhj ^[email protected]%[email protected]+internal_glib build_system=autotools arch=linux-sles15-aarch64
[e] 2d7jkg5 ^[email protected]%[email protected]+cgroup~cray_shasta+gtk~hdf5+hwloc+mariadb+nvml+pam+pmix+readline+restd~rsmi build_system=autotools sysconfdir=PREFIX/etc arch=linux-sles15-aarch64
[e] znxqplr ^[email protected]%[email protected]+kernel-module build_system=autotools arch=linux-sles15-aarch64
Please describe the system on which you are running
- Operating system/version: SLES15 14.21-150500.55.65_13.0.73-cray_shasta_c_64k aarch64
- Computer hardware: Grace Hopper GPU, aarch64
- Network type: CXI, SHS11.1
Details of the problem
Multi nodes jobs do run without any problem. Multi tasks jobs on single node do fail with the following error:
[gpu001.merlin7.psi.ch:177843] [[46903,1],1] selected pml ob1, but peer [[46903,1],0] on gpu001 selected pml cm
ompi_info returns that the btl ofi component is there, but it still seems to fail
ompi_info
...
MCA btl: ofi (MCA v2.1.0, API v3.3.0, Component v5.1.0)shell$ mpirun --mca btl_base_verbose 100 -np 2 osu_bw -d cuda D D
[gpu001.merlin7.psi.ch:177843] mca: base: components_register: registering framework btl components
[gpu001.merlin7.psi.ch:177843] mca: base: components_register: found loaded component self
[gpu001.merlin7.psi.ch:177843] mca: base: components_register: component self register function successful
[gpu001.merlin7.psi.ch:177843] mca: base: components_register: found loaded component ofi
[gpu001.merlin7.psi.ch:177843] mca: base: components_register: component ofi register function successful
[gpu001.merlin7.psi.ch:177843] mca: base: components_register: found loaded component sm
[gpu001.merlin7.psi.ch:177843] mca: base: components_register: component sm register function successful
[gpu001.merlin7.psi.ch:177843] mca: base: components_register: found loaded component tcp
[gpu001.merlin7.psi.ch:177843] mca: base: components_register: component tcp register function successful
[gpu001.merlin7.psi.ch:177843] mca: base: components_register: found loaded component smcuda
[gpu001.merlin7.psi.ch:177843] mca: base: components_register: component smcuda register function successful
[gpu001.merlin7.psi.ch:177843] mca: base: components_open: opening btl components
[gpu001.merlin7.psi.ch:177843] mca: base: components_open: found loaded component self
[gpu001.merlin7.psi.ch:177843] mca: base: components_open: component self open function successful
[gpu001.merlin7.psi.ch:177843] mca: base: components_open: found loaded component ofi
[gpu001.merlin7.psi.ch:177843] mca: base: components_open: component ofi open function successful
[gpu001.merlin7.psi.ch:177843] mca: base: components_open: found loaded component sm
[gpu001.merlin7.psi.ch:177843] mca: base: components_open: component sm open function successful
[gpu001.merlin7.psi.ch:177843] mca: base: components_open: found loaded component tcp
[gpu001.merlin7.psi.ch:177843] mca: base: components_open: component tcp open function successful
[gpu001.merlin7.psi.ch:177843] mca: base: components_open: found loaded component smcuda
[gpu001.merlin7.psi.ch:177843] btl: smcuda: cuda_max_send_size=131072, max_send_size=32768, max_frag_size=131072
[gpu001.merlin7.psi.ch:177843] mca: base: components_open: component smcuda open function successful
[gpu001.merlin7.psi.ch:177842] mca: base: components_register: registering framework btl components
[gpu001.merlin7.psi.ch:177842] mca: base: components_register: found loaded component self
[gpu001.merlin7.psi.ch:177842] mca: base: components_register: component self register function successful
[gpu001.merlin7.psi.ch:177842] mca: base: components_register: found loaded component ofi
[gpu001.merlin7.psi.ch:177842] mca: base: components_register: component ofi register function successful
[gpu001.merlin7.psi.ch:177842] mca: base: components_register: found loaded component sm
[gpu001.merlin7.psi.ch:177842] mca: base: components_register: component sm register function successful
[gpu001.merlin7.psi.ch:177842] mca: base: components_register: found loaded component tcp
[gpu001.merlin7.psi.ch:177842] mca: base: components_register: component tcp register function successful
[gpu001.merlin7.psi.ch:177842] mca: base: components_register: found loaded component smcuda
[gpu001.merlin7.psi.ch:177842] mca: base: components_register: component smcuda register function successful
[gpu001.merlin7.psi.ch:177842] mca: base: components_open: opening btl components
[gpu001.merlin7.psi.ch:177842] mca: base: components_open: found loaded component self
[gpu001.merlin7.psi.ch:177842] mca: base: components_open: component self open function successful
[gpu001.merlin7.psi.ch:177842] mca: base: components_open: found loaded component ofi
[gpu001.merlin7.psi.ch:177842] mca: base: components_open: component ofi open function successful
[gpu001.merlin7.psi.ch:177842] mca: base: components_open: found loaded component sm
[gpu001.merlin7.psi.ch:177842] mca: base: components_open: component sm open function successful
[gpu001.merlin7.psi.ch:177842] mca: base: components_open: found loaded component tcp
[gpu001.merlin7.psi.ch:177842] mca: base: components_open: component tcp open function successful
[gpu001.merlin7.psi.ch:177842] mca: base: components_open: found loaded component smcuda
[gpu001.merlin7.psi.ch:177842] btl: smcuda: cuda_max_send_size=131072, max_send_size=32768, max_frag_size=131072
[gpu001.merlin7.psi.ch:177842] mca: base: components_open: component smcuda open function successful
--------------------------------------------------------------------------
Open MPI failed an OFI Libfabric library call (fi_domain). This is highly
unusual; your job may behave unpredictably (and/or abort) after this.
Local host: gpu001
Location: mtl_ofi_component.c:1007
Error: Function not implemented (70368744177702)
--------------------------------------------------------------------------
[gpu001.merlin7.psi.ch:177843] select: initializing btl component self
[gpu001.merlin7.psi.ch:177843] select: init of component self returned success
[gpu001.merlin7.psi.ch:177843] select: initializing btl component ofi
[gpu001.merlin7.psi.ch:177842] select: initializing btl component self
[gpu001.merlin7.psi.ch:177842] select: init of component self returned success
[gpu001.merlin7.psi.ch:177842] select: initializing btl component ofi
[gpu001.merlin7.psi.ch:177843] select: init of component ofi returned failure
[gpu001.merlin7.psi.ch:177842] select: init of component ofi returned success
[gpu001.merlin7.psi.ch:177842] select: initializing btl component sm
[gpu001.merlin7.psi.ch:177842] select: init of component sm returned success
[gpu001.merlin7.psi.ch:177842] select: initializing btl component tcp
[gpu001.merlin7.psi.ch:177842] btl: tcp: Searching for exclude address+prefix: 127.0.0.1 / 8
[gpu001.merlin7.psi.ch:177842] btl: tcp: Found match: 127.0.0.1 (lo)
[gpu001.merlin7.psi.ch:177842] btl: tcp: Using interface: sppp
[gpu001.merlin7.psi.ch:177842] btl:tcp: 0x323ce000: if nmn0 kidx 2 cnt 0 addr 10.100.36.33 IPv4 bw 1000 lt 100
[gpu001.merlin7.psi.ch:177842] btl:tcp: 0x32860a90: if hsn0 kidx 3 cnt 0 addr 172.30.138.1 IPv4 bw 200000 lt 100
[gpu001.merlin7.psi.ch:177842] btl:tcp: 0x32861280: if hsn2 kidx 4 cnt 0 addr 172.30.138.3 IPv4 bw 200000 lt 100
[gpu001.merlin7.psi.ch:177842] btl:tcp: 0x32861b70: if hsn3 kidx 5 cnt 0 addr 172.30.138.4 IPv4 bw 200000 lt 100
[gpu001.merlin7.psi.ch:177842] btl:tcp: 0x32862380: if hsn1 kidx 6 cnt 0 addr 172.30.138.2 IPv4 bw 200000 lt 100
[gpu001.merlin7.psi.ch:177842] btl:tcp: Attempting to bind to AF_INET port 1024
[gpu001.merlin7.psi.ch:177842] btl:tcp: Successfully bound to AF_INET port 1024
[gpu001.merlin7.psi.ch:177842] btl:tcp: my listening v4 socket is 0.0.0.0:1024
[gpu001.merlin7.psi.ch:177842] btl: tcp: exchange: 0 2 IPv4 10.100.36.33
[gpu001.merlin7.psi.ch:177842] btl: tcp: exchange: 1 3 IPv4 172.30.138.1
[gpu001.merlin7.psi.ch:177842] btl: tcp: exchange: 2 4 IPv4 172.30.138.3
[gpu001.merlin7.psi.ch:177842] btl: tcp: exchange: 3 5 IPv4 172.30.138.4
[gpu001.merlin7.psi.ch:177842] btl: tcp: exchange: 4 6 IPv4 172.30.138.2
[gpu001.merlin7.psi.ch:177842] select: init of component tcp returned success
[gpu001.merlin7.psi.ch:177842] select: initializing btl component smcuda
[gpu001.merlin7.psi.ch:177842] select: init of component smcuda returned success
[gpu001.merlin7.psi.ch:177843] mca: base: close: component ofi closed
[gpu001.merlin7.psi.ch:177843] mca: base: close: unloading component ofi
[gpu001.merlin7.psi.ch:177843] select: initializing btl component sm
[gpu001.merlin7.psi.ch:177843] select: init of component sm returned success
[gpu001.merlin7.psi.ch:177843] select: initializing btl component tcp
[gpu001.merlin7.psi.ch:177843] btl: tcp: Searching for exclude address+prefix: 127.0.0.1 / 8
[gpu001.merlin7.psi.ch:177843] btl: tcp: Found match: 127.0.0.1 (lo)
[gpu001.merlin7.psi.ch:177843] btl: tcp: Using interface: sppp
[gpu001.merlin7.psi.ch:177843] btl:tcp: 0x37a72a80: if nmn0 kidx 2 cnt 0 addr 10.100.36.33 IPv4 bw 1000 lt 100
[gpu001.merlin7.psi.ch:177843] btl:tcp: 0x37a72f80: if hsn0 kidx 3 cnt 0 addr 172.30.138.1 IPv4 bw 200000 lt 100
[gpu001.merlin7.psi.ch:177843] btl:tcp: 0x37a73660: if hsn2 kidx 4 cnt 0 addr 172.30.138.3 IPv4 bw 200000 lt 100
[gpu001.merlin7.psi.ch:177843] btl:tcp: 0x37a6d130: if hsn3 kidx 5 cnt 0 addr 172.30.138.4 IPv4 bw 200000 lt 100
[gpu001.merlin7.psi.ch:177843] btl:tcp: 0x37a6da20: if hsn1 kidx 6 cnt 0 addr 172.30.138.2 IPv4 bw 200000 lt 100
[gpu001.merlin7.psi.ch:177843] btl:tcp: Attempting to bind to AF_INET port 1024
[gpu001.merlin7.psi.ch:177843] btl:tcp: Attempting to bind to AF_INET port 1025
[gpu001.merlin7.psi.ch:177843] btl:tcp: Successfully bound to AF_INET port 1025
[gpu001.merlin7.psi.ch:177843] btl:tcp: my listening v4 socket is 0.0.0.0:1025
[gpu001.merlin7.psi.ch:177843] btl: tcp: exchange: 0 2 IPv4 10.100.36.33
[gpu001.merlin7.psi.ch:177843] btl: tcp: exchange: 1 3 IPv4 172.30.138.1
[gpu001.merlin7.psi.ch:177843] btl: tcp: exchange: 2 4 IPv4 172.30.138.3
[gpu001.merlin7.psi.ch:177843] btl: tcp: exchange: 3 5 IPv4 172.30.138.4
[gpu001.merlin7.psi.ch:177843] btl: tcp: exchange: 4 6 IPv4 172.30.138.2
[gpu001.merlin7.psi.ch:177843] select: init of component tcp returned success
[gpu001.merlin7.psi.ch:177843] select: initializing btl component smcuda
[gpu001.merlin7.psi.ch:177843] select: init of component smcuda returned success
[gpu001.merlin7.psi.ch:177843] [[46903,1],1] selected pml ob1, but peer [[46903,1],0] on gpu001 selected pml cm
--------------------------------------------------------------------------
MPI_INIT has failed because at least one MPI process is unreachable
from another. This *usually* means that an underlying communication
plugin -- such as a BTL or an MTL -- has either not loaded or not
allowed itself to be used. Your MPI job will now abort.
You may wish to try to narrow down the problem;
* Check the output of ompi_info to see which BTL/MTL plugins are
available.
* Run your application with MPI_THREAD_SINGLE.
* Set the MCA parameter btl_base_verbose to 100 (or mtl_base_verbose,
if using MTL-based communications) to see exactly which
communication plugins were considered and/or discarded.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems. This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):
ompi_mpi_init: ompi_mpi_instance_init failed
--> Returned "Unreachable" (-12) instead of "Success" (0)
--------------------------------------------------------------------------
[gpu001:00000] *** An error occurred in MPI_Init
[gpu001:00000] *** reported by process [3073835009,281470681743361]
[gpu001:00000] *** on a NULL communicator
[gpu001:00000] *** Unknown error
[gpu001:00000] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[gpu001:00000] *** and MPI will try to terminate your MPI job as well)
--------------------------------------------------------------------------
This help section is empty because PRRTE was built without Sphinx.
--------------------------------------------------------------------------It seems very similar to the issue #12038 but since I am using the branch main, this shoud have been fixed in the meantime...
Thanks a lot for any help in advance!
hv15