Skip to content

Crash on MPI_Init with Open MPI 5.0.0, Intel Fortran, and -init=snan #12113

@mathomp4

Description

@mathomp4

Background information

What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

Open MPI 5.0.0 and 4.1.5

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

Both were installed from tarball. 4.1.5 was installed via:

--   mkdir build-intel-2021.6.0-SLES15 && cd build-intel-2021.6.0-SLES15
--   ../configure --disable-wrapper-rpath --disable-wrapper-runpath --with-pmi --with-slurm \
--      --enable-mpi1-compatibility --with-pmix --without-verbs \
--      CC=icc CXX=icpc FC=ifort \
--      --prefix=/discover/swdev/gmao_SIteam/MPI/openmpi/4.1.5-SLES15/intel-2021.6.0 |& tee configure.intel-2021.6.0.log

and 5.0.0 was installed with:

--   mkdir build-intel-2021.6.0-SLES15 && cd build-intel-2021.6.0-SLES15
--   ../configure --disable-wrapper-rpath --disable-wrapper-runpath --with-slurm \
--      --enable-mpi1-compatibility --with-pmix \
--      CC=icc CXX=icpc FC=ifort \
--      --prefix=/discover/swdev/gmao_SIteam/MPI/openmpi/5.0.0-SLES15/intel-2021.6.0 |& tee configure.intel-2021.6.0.log

Please describe the system on which you are running

  • Operating system/version: SLES 15 SP4
  • Computer hardware: AMD Milan cluster
  • Network type: Infiniband

Details of the problem

Given this program:

program a
   use mpi
   implicit none
   integer :: ierror
   call mpi_init(ierror)
   call MPI_Finalize(ierror)
end program

we seem to be able to trigger a crash with Open MPI 5.0.0 when using -init=snan. For example:

$ mpifort -V && mpirun -V
Intel(R) Fortran Intel(R) 64 Compiler Classic for applications running on Intel(R) 64, Version 2021.6.0 Build 20220226_000000
Copyright (C) 1985-2022 Intel Corporation.  All rights reserved.

mpirun (Open MPI) 5.0.0

Report bugs to https://www.open-mpi.org/community/help/
$ mpifort -g -O0 -init=snan -traceback -o just_init.exe just_init.F90 && mpirun -np 1 ./just_init.exe
forrtl: error (65): floating invalid
Image              PC                Routine            Line        Source
just_init.exe      000000000040A40B  Unknown               Unknown  Unknown
libpthread-2.31.s  000014ACCCEA5910  Unknown               Unknown  Unknown
libxml2.so.2.9.14  000014ACC70DE92F  xmlXPathInit          Unknown  Unknown
libxml2.so.2.9.14  000014ACC7093C03  Unknown               Unknown  Unknown
libxml2.so.2.9.14  000014ACC708EE3B  xmlCheckVersion       Unknown  Unknown
libhwloc.so.15.6.  000014ACCAB49862  Unknown               Unknown  Unknown
libhwloc.so.15.6.  000014ACCAB3E32B  Unknown               Unknown  Unknown
libhwloc.so.15.6.  000014ACCAB31327  Unknown               Unknown  Unknown
libopen-pal.so.80  000014ACCC03E3B9  opal_hwloc_base_g     Unknown  Unknown
libopen-pal.so.80  000014ACCC0222D0  Unknown               Unknown  Unknown
libopen-pal.so.80  000014ACCBFAD447  mca_base_framewor     Unknown  Unknown
libopen-pal.so.80  000014ACCC01BF04  Unknown               Unknown  Unknown
libopen-pal.so.80  000014ACCBFAE5FB  mca_base_framewor     Unknown  Unknown
libopen-pal.so.80  000014ACCBFAE5FB  mca_base_framewor     Unknown  Unknown
libmpi.so.40.40.0  000014ACCD0C45D9  Unknown               Unknown  Unknown
libmpi.so.40.40.0  000014ACCD0C4476  ompi_mpi_instance     Unknown  Unknown
libmpi.so.40.40.0  000014ACCD0B61C0  ompi_mpi_init         Unknown  Unknown
libmpi.so.40.40.0  000014ACCD0EF62D  MPI_Init              Unknown  Unknown
libmpi_mpifh.so.4  000014ACCD49F9B7  PMPI_Init_f08         Unknown  Unknown
just_init.exe      000000000040952E  MAIN__                      5  just_init.F90
just_init.exe      00000000004094E2  Unknown               Unknown  Unknown
libc-2.31.so       000014ACCCCC824D  __libc_start_main     Unknown  Unknown
just_init.exe      00000000004093FA  Unknown               Unknown  Unknown

However, the same thing with 4.1.5 works:

$ mpifort -V && mpirun -V
Intel(R) Fortran Intel(R) 64 Compiler Classic for applications running on Intel(R) 64, Version 2021.6.0 Build 20220226_000000
Copyright (C) 1985-2022 Intel Corporation.  All rights reserved.

mpirun (Open MPI) 4.1.5

Report bugs to http://www.open-mpi.org/community/help/
$ mpifort -g -O0 -init=snan -traceback -o just_init.exe just_init.F90 && mpirun -np 1 ./just_init.exe
$

If I don't use the -init=snan, all is well with Open MPI 5.0.0:

$ mpirun -V
mpirun (Open MPI) 5.0.0

Report bugs to https://www.open-mpi.org/community/help/
$ mpifort -g -O0 -traceback -o just_init.exe just_init.F90 && mpirun -np 1 ./just_init.exe
$

I also tried Intel MPI 2021.10.0 and that works:

$ mpiifort -V && mpirun -V
Intel(R) Fortran Intel(R) 64 Compiler Classic for applications running on Intel(R) 64, Version 2021.6.0 Build 20220226_000000
Copyright (C) 1985-2022 Intel Corporation.  All rights reserved.

Intel(R) MPI Library for Linux* OS, Version 2021.10 Build 20230619 (id: c2e19c2f3e)
Copyright 2003-2023, Intel Corporation.
$ mpiifort -g -O0 -init=snan -o just_init.exe just_init.F90 && mpirun -np 1 ./just_init.exe
$

Just for completeness I built Open MPI 5.0.0 using ifx instead of ifort (and icx and icpx) and if I use ifx instead of ifort with Open MPI 5.0.0 I get a crash:

$ mpifort -V && mpirun -V
Intel(R) Fortran Compiler for applications running on Intel(R) 64, Version 2023.2.0 Build 20230721
Copyright (C) 1985-2023 Intel Corporation. All rights reserved.

mpirun (Open MPI) 5.0.0

Report bugs to https://www.open-mpi.org/community/help/
mathomp4@borgl161 ~/MPITests main ?3
$ mpifort -g -O0 -init=snan -traceback -o just_init.exe just_init.F90 && mpirun -np 1 ./just_init.exe
forrtl: error (65): floating invalid
Image              PC                Routine            Line        Source
libpthread-2.31.s  0000150210F21910  Unknown               Unknown  Unknown
libxml2.so.2.9.14  000015020C67392F  xmlXPathInit          Unknown  Unknown
libxml2.so.2.9.14  000015020C628C03  Unknown               Unknown  Unknown
libxml2.so.2.9.14  000015020C623E3B  xmlCheckVersion       Unknown  Unknown
libhwloc.so.15.6.  000015020EBA4862  Unknown               Unknown  Unknown
libhwloc.so.15.6.  000015020EB9932B  Unknown               Unknown  Unknown
libhwloc.so.15.6.  000015020EB8C327  Unknown               Unknown  Unknown
libopen-pal.so.80  00001502100C2E72  opal_hwloc_base_g     Unknown  Unknown
libopen-pal.so.80  00001502100AAF49  Unknown               Unknown  Unknown
libopen-pal.so.80  000015021003804A  mca_base_framewor     Unknown  Unknown
libopen-pal.so.80  00001502100A5ED6  Unknown               Unknown  Unknown
libopen-pal.so.80  0000150210038C61  mca_base_framewor     Unknown  Unknown
libopen-pal.so.80  0000150210038C61  mca_base_framewor     Unknown  Unknown
libmpi.so.40.40.0  0000150211536B12  ompi_mpi_instance     Unknown  Unknown
libmpi.so.40.40.0  000015021152A8EA  ompi_mpi_init         Unknown  Unknown
libmpi.so.40.40.0  000015021156A070  MPI_Init              Unknown  Unknown
libmpi_mpifh.so.4  00001502118CF3F8  PMPI_Init_f08         Unknown  Unknown
just_init.exe      0000000000409ACF  a                           5  just_init.F90
just_init.exe      0000000000409A8D  Unknown               Unknown  Unknown
libc-2.31.so       0000150210D4224D  __libc_start_main     Unknown  Unknown
just_init.exe      00000000004099BA  Unknown               Unknown  Unknown

Indeed, this uses ifx 2023.2.0 (instead of ifort 2021.6.0) so it's the latest Intel compiler that I have access to.

The thing is, as far as I know, -init=snan only initializes real and complex:

[no]snan       Determines whether the compiler initializes to signaling NaN all uninitialized variables of intrinsic type REAL or COMPLEX  that  are  saved,  local,
               automatic, or allocated variables.

and my code has a single integer and no reals! It's like somehow I'm...corrupting the code with that flag? 🤷🏼

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions