-
Notifications
You must be signed in to change notification settings - Fork 936
Closed
Description
Hi,
I compiled OpenMPI v5.0.5 on LUMI (Cray HPE SS11 system with AMD CPUs and GPUs). I used the PrgEnv-gnu/8.5.0 environment and configured as
./configure --prefix=/users/makrotki/software/openmpi5 --with-ofi=/opt/cray/libfabric/1.15.2.0/
I ran some OSU benchmarks and generally things look good. Point to point tests yield the same performance as Cray MPI. However, I stumbled upon a segfault in MPI_Init. Here, I allocated only 1 compute node through slurm. Then:
~/software/openmpi5/bin/mpirun -np 2 ./osu_barrier
--------------------------------------------------------------------------
Open MPI failed an OFI Libfabric library call (fi_domain). This is highly
unusual; your job may behave unpredictably (and/or abort) after this.
Local host: nid007955
Location: mtl_ofi_component.c:1007
Error: Function not implemented (38)
--------------------------------------------------------------------------
[nid007955:08519] *** Process received signal ***
[nid007955:08519] Signal: Segmentation fault (11)
[nid007955:08519] Signal code: Address not mapped (1)
[nid007955:08519] Failing at address: 0x140074656e7a
[nid007955:08519] [ 0] /lib64/libpthread.so.0(+0x16910)[0x14f3d4b66910]
[nid007955:08519] [ 1] /opt/cray/libfabric/1.15.2.0/lib64/libfabric.so.1(+0x3d0a6)[0x14f3cbe4e0a6]
[nid007955:08519] [ 2] /opt/cray/libfabric/1.15.2.0/lib64/libfabric.so.1(+0x3cfeb)[0x14f3cbe4dfeb]
[nid007955:08519] [ 3] /opt/cray/libfabric/1.15.2.0/lib64/libfabric.so.1(+0x4d7ba)[0x14f3cbe5e7ba]
[nid007955:08519] [ 4] /opt/cray/libfabric/1.15.2.0/lib64/libfabric.so.1(fi_fabric+0xa2)[0x14f3cbe2a172]
[nid007955:08519] [ 5] /users/makrotki/software/openmpi5/lib/libopen-pal.so.80(+0xa3db4)[0x14f3cbfb4db4]
[nid007955:08519] [ 6] /users/makrotki/software/openmpi5/lib/libopen-pal.so.80(mca_btl_base_select+0x14d)[0x14f3cbfa1ddd]
[nid007955:08519] [ 7] /users/makrotki/software/openmpi5/lib/libmpi.so.40(mca_bml_r2_component_init+0x12)[0x14f3d503d0c2]
[nid007955:08519] [ 8] /users/makrotki/software/openmpi5/lib/libmpi.so.40(mca_bml_base_init+0x94)[0x14f3d503ae54]
[nid007955:08519] [ 9] /users/makrotki/software/openmpi5/lib/libmpi.so.40(+0x27d34a)[0x14f3d51c634a]
[nid007955:08519] [10] /users/makrotki/software/openmpi5/lib/libmpi.so.40(mca_pml_base_select+0x1ce)[0x14f3d51c287e]
[nid007955:08519] [11] /users/makrotki/software/openmpi5/lib/libmpi.so.40(+0x9a92a)[0x14f3d4fe392a]
[nid007955:08519] [12] /users/makrotki/software/openmpi5/lib/libmpi.so.40(ompi_mpi_instance_init+0x61)[0x14f3d4fe4081]
[nid007955:08519] [13] /users/makrotki/software/openmpi5/lib/libmpi.so.40(ompi_mpi_init+0x96)[0x14f3d4fdb8b6]
[nid007955:08519] [14] /users/makrotki/software/openmpi5/lib/libmpi.so.40(MPI_Init+0x5e)[0x14f3d500d46e]
[nid007955:08519] [15] ./osu_barrier[0x40675d]
[nid007955:08519] [16] ./osu_barrier[0x402810]
[nid007955:08519] [17] /lib64/libc.so.6(__libc_start_main+0xef)[0x14f3d498e24d]
[nid007955:08519] [18] ./osu_barrier[0x402d7a]
[nid007955:08519] *** End of error message ***
I tried with 16 ranks and it sometimes works, sometimes segfaults. But with 2 ranks it segfaults always. Note that I always see this message:
--------------------------------------------------------------------------
Open MPI failed an OFI Libfabric library call (fi_domain). This is highly
unusual; your job may behave unpredictably (and/or abort) after this.
Local host: nid007972
Location: mtl_ofi_component.c:1007
Error: Function not implemented (38)
--------------------------------------------------------------------------
regardless of how many ranks I use.
The segfault is gone when I turn off ofi:
~/software/openmpi5/bin/mpirun -mca mtl ^ofi -np 2 ./osu_barrier
# OSU MPI Barrier Latency Test v7.4
# Avg Latency(us)
0.21
Is this a known problem?
Metadata
Metadata
Assignees
Labels
No labels