-
Notifications
You must be signed in to change notification settings - Fork 929
Closed
Description
Background information
I am trying to run Open MPI 5.0 on Crusher/Frontier but I get the following error during MPI_Init:
--------------------------------------------------------------------------
Open MPI failed an OFI Libfabric library call (fi_domain). This is highly
unusual; your job may behave unpredictably (and/or abort) after this.
Local host: crusher051
Location: ../../../../../ompi/mca/mtl/ofi/mtl_ofi_component.c:999
Error: Function not implemented (38)
--------------------------------------------------------------------------
What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)
Open MPI 5.0 from the release tarball.
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
On Crusher I run configure:
../configure --disable-debug --with-slurm --prefix=$HOME/opt-crusher/openmpi-5.0 --with-ofi=/opt/cray/libfabric/1.15.2.0/ --with-xpmem=/opt/cray/xpmem/default --with-rocm=/opt/rocm-5.3.0 --with-libevent=internal CC=gcc CXX=g++
Please describe the system on which you are running
libfabric version: 1.15.2.0 (default module)
Details of the problem
Running OSU benchmark built against this installation on Crusher:
> mpirun -np 2 $HOME/src/osu-micro-benchmarks-7.1-1/build_crusher_ompi5/c/mpi/collective/blocking/osu_reduce
--------------------------------------------------------------------------
Open MPI failed an OFI Libfabric library call (fi_domain). This is highly
unusual; your job may behave unpredictably (and/or abort) after this.
Local host: crusher001
Location: ../../../../../ompi/mca/mtl/ofi/mtl_ofi_component.c:999
Error: Function not implemented (38)
--------------------------------------------------------------------------
If I run with FI_LOG_LEVEL=Debug I get a couple of lines like this:
libfabric:15824:1698769409::cxi:core:ofi_check_info():1101<info> Supported: FI_ADDR_CXI_COMPAT
libfabric:15824:1698769409::cxi:core:ofi_check_info():1101<info> Requested: FI_ADDR_CXI
libfabric:15824:1698769409::cxi:core:ofi_check_info():1099<info> address format not supported
and
libfabric:15824:1698769409::cxi:fabric:cxip_gen_auth_key_ss_env_get_vni():1232<info> crusher056: SLINGSHOT_VNIS not found
libfabric:15824:1698769409::cxi:domain:cxip_domain():1238<warn> crusher056: cxip_gen_auth_key failed: -38:Function not implementedlibfabric:15824:1698769409::core:core:fi_fabric_():1374<info> Opened fabric: cxi
and
libfabric:15824:1698769409:ofi_rxm:core:core:fi_getinfo_():1176<info> fi_getinfo: provider cxi returned -61 (No data available)
libfabric:15824:1698769409::ofi_rxm:core:ofi_check_info():1084<info> Unsupported capabilities
libfabric:15824:1698769409::ofi_rxm:core:ofi_check_info():1085<info> Supported: FI_MSG, FI_RMA, FI_TAGGED, FI_ATOMIC, FI_COLLECTIVE, FI_READ, FI_WRITE, FI_RECV, FI_SEND, FI_REMOTE_READ, FI_REMOTE_WRITE, FI_MULTI_RECV, FI_LOCAL_COMM, FI_REMOTE_COMM, FI_SOURCE, FI_DIRECTED_RECV
libfabric:15824:1698769409::ofi_rxm:core:ofi_check_info():1085<info> Requested: FI_RMA, FI_ATOMIC, FI_HMEM
libfabric:15824:1698769409::core:core:fi_getinfo_():1176<info> fi_getinfo: provider ofi_rxm returned -61 (No data available)
libfabric:15824:1698769409::ofi_rxd:core:ofi_check_info():1084<info> Unsupported capabilities
libfabric:15824:1698769409::ofi_rxd:core:ofi_check_info():1085<info> Supported: FI_MSG, FI_RMA, FI_TAGGED, FI_ATOMIC, FI_READ, FI_WRITE, FI_RECV, FI_SEND, FI_REMOTE_READ, FI_REMOTE_WRITE, FI_MULTI_RECV, FI_LOCAL_COMM, FI_REMOTE_COMM, FI_RMA_EVENT, FI_SOURCE, FI_DIRECTED_RECV
libfabric:15824:1698769409::ofi_rxd:core:ofi_check_info():1085<info> Requested: FI_RMA, FI_ATOMIC, FI_HMEM
libfabric:15824:1698769409::core:core:fi_getinfo_():1176<info> fi_getinfo: provider ofi_rxd returned -61 (No data available)
libfabric:15824:1698769409::udp:core:ofi_check_ep_type():681<info> unsupported endpoint type
libfabric:15824:1698769409::udp:core:ofi_check_ep_type():682<info> Supported: FI_EP_DGRAM
libfabric:15824:1698769409::udp:core:ofi_check_ep_type():682<info> Requested: FI_EP_RDM
libfabric:15824:1698769409::core:core:fi_getinfo_():1176<info> fi_getinfo: provider udp returned -61 (No data available)
Not sure if that helps and if that is the right thing to look for. I can post the full log if necessary.
Is there any way to get OMPI 5.0 working with this libfabric?