Skip to content

mpirun exiting basic program with Signal 27 (Profiling timer expired) #13496

@dsatagaj

Description

@dsatagaj

Please submit all the information below so that we can understand the working environment that is the context for your question.

Background information

What version of Open MPI are you using? (e.g., v4.1.6, v5.0.1, git branch name and hash, etc.)

v5.0.5 (was previously using v4.1.6 with no issues)

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

Source/Distribution tarball

If you are building/installing from a git clone, please copy-n-paste the output from git submodule status.

Please describe the system on which you are running

  • Operating system/version: RHEL8
  • Computer hardware: Intel XEON(R) Gold 6254
  • Network type: ethernet

Details of the problem

Please describe, in detail, the problem that you are having, including the behavior you expect to see, the actual behavior that you are seeing, steps to reproduce the problem, etc. It is most helpful if you can attach a small program that a developer can use to reproduce your problem.

I am upgrading from openmpi-4.1.6 to openmpi-5.0.5. My compilation is successful and I can run the tools, however all tools throw signal 27 (Profiling timer expired), even when running basic test programs. For example, see the output of a simple mpirun command for hostname:

[email protected] ~/openmpi-5.0.5]$ mpirun -np 4 hostname
dsat.org
dsat.org
------------------------------------------------------------------------------
prterun noticed that process rank 2 with PID 1202879 on node dsat exited on
signal 27 (Profiling timer expired).
------------------------------------------------------------------------------

mpirun built with openmpi-4.1.6 can successfully run this (and my own) programs.
I was configuring openmpi-4.1.6 with the following parameters:

./configure --disable-shared --enable-static --without-memory-manager --with-hwloc=internal --disable-io-romio --disable-hwloc-pci --without-verbs --enable-mpi-thread-multiple --enable-heterogenous --enable-mpi-cxx --enable-cxx-exceptions --enable-mpi-fortran=none

This is what I am attempting to configure openmpi-5.0.5 with:

./configure --disable-shared --enable-static --without-memory-manager --with-hwloc=internal --with-libevent=internal --with-pmix=internal --with-prrte=internal --disable-io-romio --disable-mpi-fortran

I have also attempted configuration with a few different options in an attempt to solve this problem (with no success):

./configure --disable-shared --without-ft --enable-static --without-memory-manager --with-hwloc=internal --with-libevent=internal --with-pmix=internal --with-prrte=internal --disable-io-romio --disable-mpi-fortran --without-ucx --without-libfabric

I have confirmed that there is no profiling going on in the system that would be causing this SIGPROF to be sent to the child processes. That combined with the test that mpirun compiled with openmpi-4.1.6 runs successfully demonstrates to me that the issue is something internal to openmpi-5.0.5, whether that be a bug or a configuration issue.

In order to debug this problem, I inserted the following patch into odls_default_module.c:

--- /openmpi-5.0.5/3rd-party/prrte/src/mca/odls/default/odls_default_module.c.orig 2025-11-06 09:11:24.00000000 -0500
+++ /openmpi-5.0.5/3rd-party/prrte/src/mca/odls/default/odls_default_module.c 2025-11-06 09:13:28.00000000 -0500

@@ -138,6 +138,15 @@
#include "src/mca/odls/default/odls_default.h"
#include "src/prted/pmix/pmix_server.h"

+#include <stdio.h>
+#include <stdlib.h>
+#include <signal.h>
+#include <sys/time.h>
+#include <unistd.h>
+#ifdef __GLIBC__
+extern char *program_invocation_short_name;
+#endif
+
 /*
  * Module functions (function pointers used in a struct)
  */
@@ -631,6 +640,26 @@
     if (pid == 0) {
         close(p[0]);
+        /* --- DSAT DEBUG/FIX - print and clear inherited ITIMER_PROF --- */
+        const char *dsat_env = getenv("DSAT_ITIMER_PATCH");
+        if (dsat_env != NULL) {
+#ifdef ITIMER_PROF
+           struct itimerval t;
+           getitimer(ITIMER_PROF, &t);
+           struct sigaction sa;
+           sigaction(SIGPROF, NULL, &sa);
+           fprintf(stderr, "PID: %d, PPID: %d, executable: %s, inherited ITIMER_PROF: value [sec=%ld usec=%ld] interval [sec=%ld usec=%ld], SIGPROF handler=%p, flags=%d\n", getpid(), getppid(), program_invocation_short_name, t.it_value.tv_sec, t.it_value.tv_usec, t.it_interval.tv_sec, t.it_interval.tv_usec, sa.sa_handler, sa.sa_flags);
+           /* clear the timer so SIGPROF does not fire */
+           struct itimerval zero = {0};
+           setitimer(ITIMER_PROF, &zero, NULL);+
+
+          /* ignore SIGPROF to be safe */
+          signal(SIGPROF, SIG_IGN);
+#endif
+        }
+        /* END DSAT DEBUG/FIX PATCH --- */
        do_child(cd, p[1]);
         /*Does not return */
     }

With this patch included, the program completes successfully (e.g. ignores the SIGPROF). I get the following output from the same the command 'env DSAT_ITIMER_PATCH=1 mpirun -np 4 hostname':

[email protected] ~/openmpi-5.0.5]$ env DSAT_ITIMER_PATCH=1 mpirun -np 4 hostname
PID= 1207942: PPID: 1207939: executable: prterun: inherited ITIMER_PROF: value [sec=0 usec=0] interval [sec=0 usec=0], SIGPROF handler=0x6c3d60, flags=335544320
PID= 1207943: PPID: 1207939: executable: prterun: inherited ITIMER_PROF: value [sec=0 usec=0] interval [sec=0 usec=0], SIGPROF handler=0x6c3d60, flags=335544320
PID= 1207944: PPID: 1207939: executable: prterun: inherited ITIMER_PROF: value [sec=0 usec=0] interval [sec=0 usec=0], SIGPROF handler=0x6c3d60, flags=335544320
PID= 1207945: PPID: 1207939: executable: prterun: inherited ITIMER_PROF: value [sec=0 usec=0] interval [sec=0 usec=0], SIGPROF handler=0x6c3d60, flags=335544320
dsat.org
dsat.org
dsat.org
dsat.org

I took it one step farther and ran the program using GDB. I get similar output, and when I inspect the SIGPROF handler value, I see that it comes from the following location:

(gdb) info symbol 0x6c3d60
evsig_handler in section .text of /home/dsat/openmpi-5.0.5/build/openmpi/bin/prte

The summary of my investigation is this: It seems that libevent or prte is using profiling timers internally for timing. Somehow, the handler is not correctly getting cleared for the child processes, and is therefore catching, even during simple programs. Does anyone have insight into what is going on, and if it is a bug, or if there are configuration options that could be used to fix it?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions