Skip to content

Using co-locales on single-node ofi runs may result in "Function not implemented" errors #28373

@bradcray

Description

@bradcray

Today, we're finding that running single-node co-locale runs such as -nl 1x4 can result in errors of the form:

internal error: 0: comm-ofi.c:2209: OFI error: fi_domain(ofi_fabric, ofi_info, &ofi_domain, ((void*)0)): Function not implemented

where using -nl 1, -nl 4, -nl 2x4, -nl 4x4 all work fine. I.e., the error seems specific to 1-node co-locale runs.

That said, the behavior also seems to depend on the version of libfabric used. Specifically, we're seeing this error when using libfabric versions:

  • 1.22.0
  • 2.2.0rc1

But things work fine when using:

  • 1.20.1
  • 2.3.1

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions