Skip to content

Segfault on Cray HPE system #12913

@angainor

Description

@angainor

Hi,

I compiled OpenMPI v5.0.5 on LUMI (Cray HPE SS11 system with AMD CPUs and GPUs). I used the PrgEnv-gnu/8.5.0 environment and configured as

./configure --prefix=/users/makrotki/software/openmpi5 --with-ofi=/opt/cray/libfabric/1.15.2.0/

I ran some OSU benchmarks and generally things look good. Point to point tests yield the same performance as Cray MPI. However, I stumbled upon a segfault in MPI_Init. Here, I allocated only 1 compute node through slurm. Then:

~/software/openmpi5/bin/mpirun -np 2 ./osu_barrier
--------------------------------------------------------------------------
Open MPI failed an OFI Libfabric library call (fi_domain).  This is highly
unusual; your job may behave unpredictably (and/or abort) after this.

  Local host: nid007955
  Location: mtl_ofi_component.c:1007
  Error: Function not implemented (38)
--------------------------------------------------------------------------
[nid007955:08519] *** Process received signal ***
[nid007955:08519] Signal: Segmentation fault (11)
[nid007955:08519] Signal code: Address not mapped (1)
[nid007955:08519] Failing at address: 0x140074656e7a
[nid007955:08519] [ 0] /lib64/libpthread.so.0(+0x16910)[0x14f3d4b66910]
[nid007955:08519] [ 1] /opt/cray/libfabric/1.15.2.0/lib64/libfabric.so.1(+0x3d0a6)[0x14f3cbe4e0a6]
[nid007955:08519] [ 2] /opt/cray/libfabric/1.15.2.0/lib64/libfabric.so.1(+0x3cfeb)[0x14f3cbe4dfeb]
[nid007955:08519] [ 3] /opt/cray/libfabric/1.15.2.0/lib64/libfabric.so.1(+0x4d7ba)[0x14f3cbe5e7ba]
[nid007955:08519] [ 4] /opt/cray/libfabric/1.15.2.0/lib64/libfabric.so.1(fi_fabric+0xa2)[0x14f3cbe2a172]
[nid007955:08519] [ 5] /users/makrotki/software/openmpi5/lib/libopen-pal.so.80(+0xa3db4)[0x14f3cbfb4db4]
[nid007955:08519] [ 6] /users/makrotki/software/openmpi5/lib/libopen-pal.so.80(mca_btl_base_select+0x14d)[0x14f3cbfa1ddd]
[nid007955:08519] [ 7] /users/makrotki/software/openmpi5/lib/libmpi.so.40(mca_bml_r2_component_init+0x12)[0x14f3d503d0c2]
[nid007955:08519] [ 8] /users/makrotki/software/openmpi5/lib/libmpi.so.40(mca_bml_base_init+0x94)[0x14f3d503ae54]
[nid007955:08519] [ 9] /users/makrotki/software/openmpi5/lib/libmpi.so.40(+0x27d34a)[0x14f3d51c634a]
[nid007955:08519] [10] /users/makrotki/software/openmpi5/lib/libmpi.so.40(mca_pml_base_select+0x1ce)[0x14f3d51c287e]
[nid007955:08519] [11] /users/makrotki/software/openmpi5/lib/libmpi.so.40(+0x9a92a)[0x14f3d4fe392a]
[nid007955:08519] [12] /users/makrotki/software/openmpi5/lib/libmpi.so.40(ompi_mpi_instance_init+0x61)[0x14f3d4fe4081]
[nid007955:08519] [13] /users/makrotki/software/openmpi5/lib/libmpi.so.40(ompi_mpi_init+0x96)[0x14f3d4fdb8b6]
[nid007955:08519] [14] /users/makrotki/software/openmpi5/lib/libmpi.so.40(MPI_Init+0x5e)[0x14f3d500d46e]
[nid007955:08519] [15] ./osu_barrier[0x40675d]
[nid007955:08519] [16] ./osu_barrier[0x402810]
[nid007955:08519] [17] /lib64/libc.so.6(__libc_start_main+0xef)[0x14f3d498e24d]
[nid007955:08519] [18] ./osu_barrier[0x402d7a]
[nid007955:08519] *** End of error message ***

I tried with 16 ranks and it sometimes works, sometimes segfaults. But with 2 ranks it segfaults always. Note that I always see this message:

--------------------------------------------------------------------------
Open MPI failed an OFI Libfabric library call (fi_domain).  This is highly
unusual; your job may behave unpredictably (and/or abort) after this.

  Local host: nid007972
  Location: mtl_ofi_component.c:1007
  Error: Function not implemented (38)
--------------------------------------------------------------------------

regardless of how many ranks I use.

The segfault is gone when I turn off ofi:

~/software/openmpi5/bin/mpirun -mca mtl ^ofi -np 2 ./osu_barrier

# OSU MPI Barrier Latency Test v7.4
# Avg Latency(us)
             0.21

Is this a known problem?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions