Skip to content

mtl/ofi: call to fi_domain fails on Crusher/Frontier #12038

@devreal

Description

@devreal

Background information

I am trying to run Open MPI 5.0 on Crusher/Frontier but I get the following error during MPI_Init:

--------------------------------------------------------------------------
Open MPI failed an OFI Libfabric library call (fi_domain).  This is highly
unusual; your job may behave unpredictably (and/or abort) after this.

  Local host: crusher051
  Location: ../../../../../ompi/mca/mtl/ofi/mtl_ofi_component.c:999
  Error: Function not implemented (38)
--------------------------------------------------------------------------

What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

Open MPI 5.0 from the release tarball.

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

On Crusher I run configure:

../configure --disable-debug --with-slurm --prefix=$HOME/opt-crusher/openmpi-5.0 --with-ofi=/opt/cray/libfabric/1.15.2.0/ --with-xpmem=/opt/cray/xpmem/default --with-rocm=/opt/rocm-5.3.0 --with-libevent=internal CC=gcc CXX=g++

Please describe the system on which you are running

libfabric version: 1.15.2.0 (default module)


Details of the problem

Running OSU benchmark built against this installation on Crusher:

> mpirun -np 2 $HOME/src/osu-micro-benchmarks-7.1-1/build_crusher_ompi5/c/mpi/collective/blocking/osu_reduce
--------------------------------------------------------------------------
Open MPI failed an OFI Libfabric library call (fi_domain).  This is highly
unusual; your job may behave unpredictably (and/or abort) after this.

  Local host: crusher001
  Location: ../../../../../ompi/mca/mtl/ofi/mtl_ofi_component.c:999
  Error: Function not implemented (38)
--------------------------------------------------------------------------

If I run with FI_LOG_LEVEL=Debug I get a couple of lines like this:

libfabric:15824:1698769409::cxi:core:ofi_check_info():1101<info> Supported: FI_ADDR_CXI_COMPAT
libfabric:15824:1698769409::cxi:core:ofi_check_info():1101<info> Requested: FI_ADDR_CXI
libfabric:15824:1698769409::cxi:core:ofi_check_info():1099<info> address format not supported

and

libfabric:15824:1698769409::cxi:fabric:cxip_gen_auth_key_ss_env_get_vni():1232<info> crusher056: SLINGSHOT_VNIS not found
libfabric:15824:1698769409::cxi:domain:cxip_domain():1238<warn> crusher056: cxip_gen_auth_key failed: -38:Function not implementedlibfabric:15824:1698769409::core:core:fi_fabric_():1374<info> Opened fabric: cxi

and

libfabric:15824:1698769409:ofi_rxm:core:core:fi_getinfo_():1176<info> fi_getinfo: provider cxi returned -61 (No data available)
libfabric:15824:1698769409::ofi_rxm:core:ofi_check_info():1084<info> Unsupported capabilities
libfabric:15824:1698769409::ofi_rxm:core:ofi_check_info():1085<info> Supported: FI_MSG, FI_RMA, FI_TAGGED, FI_ATOMIC, FI_COLLECTIVE, FI_READ, FI_WRITE, FI_RECV, FI_SEND, FI_REMOTE_READ, FI_REMOTE_WRITE, FI_MULTI_RECV, FI_LOCAL_COMM, FI_REMOTE_COMM, FI_SOURCE, FI_DIRECTED_RECV
libfabric:15824:1698769409::ofi_rxm:core:ofi_check_info():1085<info> Requested: FI_RMA, FI_ATOMIC, FI_HMEM
libfabric:15824:1698769409::core:core:fi_getinfo_():1176<info> fi_getinfo: provider ofi_rxm returned -61 (No data available)
libfabric:15824:1698769409::ofi_rxd:core:ofi_check_info():1084<info> Unsupported capabilities
libfabric:15824:1698769409::ofi_rxd:core:ofi_check_info():1085<info> Supported: FI_MSG, FI_RMA, FI_TAGGED, FI_ATOMIC, FI_READ, FI_WRITE, FI_RECV, FI_SEND, FI_REMOTE_READ, FI_REMOTE_WRITE, FI_MULTI_RECV, FI_LOCAL_COMM, FI_REMOTE_COMM, FI_RMA_EVENT, FI_SOURCE, FI_DIRECTED_RECV
libfabric:15824:1698769409::ofi_rxd:core:ofi_check_info():1085<info> Requested: FI_RMA, FI_ATOMIC, FI_HMEM
libfabric:15824:1698769409::core:core:fi_getinfo_():1176<info> fi_getinfo: provider ofi_rxd returned -61 (No data available)
libfabric:15824:1698769409::udp:core:ofi_check_ep_type():681<info> unsupported endpoint type
libfabric:15824:1698769409::udp:core:ofi_check_ep_type():682<info> Supported: FI_EP_DGRAM
libfabric:15824:1698769409::udp:core:ofi_check_ep_type():682<info> Requested: FI_EP_RDM
libfabric:15824:1698769409::core:core:fi_getinfo_():1176<info> fi_getinfo: provider udp returned -61 (No data available)

Not sure if that helps and if that is the right thing to look for. I can post the full log if necessary.

Is there any way to get OMPI 5.0 working with this libfabric?

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions