Skip to content

MPI_Init crashes with ffpe-trap=zero #12400

@mathomp4

Description

@mathomp4

Background information

What version of Open MPI are you using? (e.g., v4.1.6, v5.0.1, git branch name and hash, etc.)

Open MPI v5.0.2

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

Installed from a source tarball. Configure command was:

../configure --disable-wrapper-rpath --disable-wrapper-runpath --with-slurm \
   --with-hwloc=internal --with-libevent=internal --with-pmix=internal \
   CC=gcc CXX=g++ FC=gfortran \
   --prefix=/discover/swdev/gmao_SIteam/MPI/openmpi/5.0.2-SLES15/gcc-13.2.0

GNU version was 13.2.0.

Please describe the system on which you are running

  • Operating system/version: SLES15.4
  • Computer hardware: AMD Milan EPYC cluster (SCU17 at NASA NCCS).
  • Network type: Infiniband, though my issue is triggered on a single node.

Details of the problem

So, I've been bedeviled by a code of mine just failing with Open MPI 5. If I build everything with Open MPI 4.1.6, it works, Open MPI 5.0.2, fail. (Open MPI 5.0.0 as well.) In my code it dies on:

call MPI_Init_thread(MPI_THREAD_MULTIPLE, provided, ierror)

So as a test, I made a reproducer:

program main
   use mpi

   implicit none
   integer :: ierror, provided

   call MPI_Init_thread(MPI_THREAD_MULTIPLE, provided, ierror)
   if (provided  < MPI_THREAD_MULTIPLE) THEN
      print *, "We do not support MPI_THREAD_MULTIPLE"
      call MPI_Abort(MPI_COMM_WORLD, -1, ierror)
   else
      print *, "We support MPI_THREAD_MULTIPLE"
   end if
   call MPI_Finalize(ierror)
end program main

So, if I compile with some simple flags for the later example:

$ mpifort -g -O0 -fbacktrace -o thread_multiple.exe thread_multiple.F90
$ mpirun -np 1 ./thread_multiple.exe
 We support MPI_THREAD_MULTIPLE

So I was baffled, but I decided to try all the command-line flags we pass in and one of them seems to do something not good:

$ mpifort -g -O0 -fbacktrace -ffpe-trap=zero -o thread_multiple.exe thread_multiple.F90
$ mpirun -np 1 ./thread_multiple.exe

Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation.

Backtrace for this error:
#0  0x15139a7f4dbf in ???
#1  0x151393517946 in ???
#2  0x1513934ccc02 in ???
#3  0x1513934c7e3a in ???
#4  0x15139ab43b9d in hwloc_libxml_backend_init
	at /discover/swdev/gmao_SIteam/MPI/src/openmpi-5.0.2/build-gcc-13.2.0-SLES15/3rd-party/hwloc-2.7.1/hwloc/topology-xml-libxml.c:363
#5  0x151398a8ccb7 in hwloc_xml_component_instantiate
	at /discover/swdev/gmao_SIteam/MPI/src/openmpi-5.0.2/build-gcc-13.2.0-SLES15/3rd-party/hwloc-2.7.1/hwloc/topology-xml.c:3669
#6  0x151398a80424 in hwloc_disc_component_force_enable
	at /discover/swdev/gmao_SIteam/MPI/src/openmpi-5.0.2/build-gcc-13.2.0-SLES15/3rd-party/hwloc-2.7.1/hwloc/components.c:700
#7  0x151399b7d26b in ???
#8  0x151399b6525c in ???
#9  0x151399ad7c51 in ???
#10  0x151399b6129a in ???
#11  0x151399ad8784 in ???
#12  0x151399ad8784 in ???
#13  0x15139aece57a in ???
#14  0x15139aecf273 in ???
#15  0x15139aec2786 in ???
#16  0x15139aef5bdf in ???
#17  0x15139b22e97c in ???
#18  0x400b85 in MAIN__
	at /home/mathomp4/MPITests/ThreadMultiple/thread_multiple.F90:7
#19  0x400cb2 in main
	at /home/mathomp4/MPITests/ThreadMultiple/thread_multiple.F90:2
--------------------------------------------------------------------------
prterun noticed that process rank 0 with PID 28236 on node borgj033 exited on
signal 8 (Floating point exception).
--------------------------------------------------------------------------

So then I thought let's get really boring:

$ cat boring_init.F90
program main
   use mpi

   implicit none
   integer :: ierror

   call MPI_Init(ierror)
   call MPI_Finalize(ierror)
end program main

and:

$ mpifort -g -O0 -fbacktrace -ffpe-trap=zero -o boring_init.exe boring_init.F90
$ mpirun -np 1 ./boring_init.exe

Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation.

Backtrace for this error:
#0  0x152ae4a87dbf in ???
#1  0x152ae18ca946 in ???
#2  0x152ae187fc02 in ???
#3  0x152ae187ae3a in ???
#4  0x152ae4dd6b9d in hwloc_libxml_backend_init
	at /discover/swdev/gmao_SIteam/MPI/src/openmpi-5.0.2/build-gcc-13.2.0-SLES15/3rd-party/hwloc-2.7.1/hwloc/topology-xml-libxml.c:363
#5  0x152ae2d1fcb7 in hwloc_xml_component_instantiate
	at /discover/swdev/gmao_SIteam/MPI/src/openmpi-5.0.2/build-gcc-13.2.0-SLES15/3rd-party/hwloc-2.7.1/hwloc/topology-xml.c:3669
#6  0x152ae2d13424 in hwloc_disc_component_force_enable
	at /discover/swdev/gmao_SIteam/MPI/src/openmpi-5.0.2/build-gcc-13.2.0-SLES15/3rd-party/hwloc-2.7.1/hwloc/components.c:700
#7  0x152ae3e1026b in ???
#8  0x152ae3df825c in ???
#9  0x152ae3d6ac51 in ???
#10  0x152ae3df429a in ???
#11  0x152ae3d6b784 in ???
#12  0x152ae3d6b784 in ???
#13  0x152ae516157a in ???
#14  0x152ae5162273 in ???
#15  0x152ae5155786 in ???
#16  0x152ae5188a9c in ???
#17  0x152ae54c1917 in ???
#18  0x4009f9 in MAIN__
	at /home/mathomp4/MPITests/ThreadMultiple/boring_init.F90:1
#19  0x400a46 in main
	at /home/mathomp4/MPITests/ThreadMultiple/boring_init.F90:2
--------------------------------------------------------------------------
prterun noticed that process rank 0 with PID 28610 on node borgj033 exited on
signal 8 (Floating point exception).
--------------------------------------------------------------------------

So...yeah. The line it points to from my source is:

 357static int
 358hwloc_libxml_backend_init(struct hwloc_xml_backend_data_s *bdata,
 359const char *xmlpath, const char *xmlbuffer, int xmlbuflen)
 360   │ {
 361xmlDoc *doc = NULL;
 362363LIBXML_TEST_VERSION;
 364hwloc_libxml2_init_once();
 365366errno = 0; /* set to 0 so that we know if libxml2 changed it */

and from what I can see that bit of code is 12 years old.

Is there perhaps something I should pass when building the embedded hwloc to workaround this?

I mean, for now, I'll try our model without -ffpe-trap=zero and see what happens.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions