Today, we're finding that running single-node co-locale runs such as -nl 1x4 can result in errors of the form:
internal error: 0: comm-ofi.c:2209: OFI error: fi_domain(ofi_fabric, ofi_info, &ofi_domain, ((void*)0)): Function not implemented
where using -nl 1, -nl 4, -nl 2x4, -nl 4x4 all work fine. I.e., the error seems specific to 1-node co-locale runs.
That said, the behavior also seems to depend on the version of libfabric used. Specifically, we're seeing this error when using libfabric versions:
But things work fine when using: