-
Notifications
You must be signed in to change notification settings - Fork 935
Description
Please submit all the information below so that we can understand the working environment that is the context for your question.
- If you have a problem building or installing Open MPI, be sure to read this.
- If you have a problem launching MPI or OpenSHMEM applications, be sure to read this.
- If you have a problem running MPI or OpenSHMEM applications (i.e., after launching them), be sure to read this.
Background information
What version of Open MPI are you using? (e.g., v4.1.6, v5.0.1, git branch name and hash, etc.)
v5.0.5 (was previously using v4.1.6 with no issues)
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
Source/Distribution tarball
If you are building/installing from a git clone, please copy-n-paste the output from git submodule status.
Please describe the system on which you are running
- Operating system/version: RHEL8
- Computer hardware: Intel XEON(R) Gold 6254
- Network type: ethernet
Details of the problem
Please describe, in detail, the problem that you are having, including the behavior you expect to see, the actual behavior that you are seeing, steps to reproduce the problem, etc. It is most helpful if you can attach a small program that a developer can use to reproduce your problem.
I am upgrading from openmpi-4.1.6 to openmpi-5.0.5. My compilation is successful and I can run the tools, however all tools throw signal 27 (Profiling timer expired), even when running basic test programs. For example, see the output of a simple mpirun command for hostname:
[email protected] ~/openmpi-5.0.5]$ mpirun -np 4 hostname
dsat.org
dsat.org
------------------------------------------------------------------------------
prterun noticed that process rank 2 with PID 1202879 on node dsat exited on
signal 27 (Profiling timer expired).
------------------------------------------------------------------------------
mpirun built with openmpi-4.1.6 can successfully run this (and my own) programs.
I was configuring openmpi-4.1.6 with the following parameters:
./configure --disable-shared --enable-static --without-memory-manager --with-hwloc=internal --disable-io-romio --disable-hwloc-pci --without-verbs --enable-mpi-thread-multiple --enable-heterogenous --enable-mpi-cxx --enable-cxx-exceptions --enable-mpi-fortran=none
This is what I am attempting to configure openmpi-5.0.5 with:
./configure --disable-shared --enable-static --without-memory-manager --with-hwloc=internal --with-libevent=internal --with-pmix=internal --with-prrte=internal --disable-io-romio --disable-mpi-fortran
I have also attempted configuration with a few different options in an attempt to solve this problem (with no success):
./configure --disable-shared --without-ft --enable-static --without-memory-manager --with-hwloc=internal --with-libevent=internal --with-pmix=internal --with-prrte=internal --disable-io-romio --disable-mpi-fortran --without-ucx --without-libfabric
I have confirmed that there is no profiling going on in the system that would be causing this SIGPROF to be sent to the child processes. That combined with the test that mpirun compiled with openmpi-4.1.6 runs successfully demonstrates to me that the issue is something internal to openmpi-5.0.5, whether that be a bug or a configuration issue.
In order to debug this problem, I inserted the following patch into odls_default_module.c:
--- /openmpi-5.0.5/3rd-party/prrte/src/mca/odls/default/odls_default_module.c.orig 2025-11-06 09:11:24.00000000 -0500
+++ /openmpi-5.0.5/3rd-party/prrte/src/mca/odls/default/odls_default_module.c 2025-11-06 09:13:28.00000000 -0500
@@ -138,6 +138,15 @@
#include "src/mca/odls/default/odls_default.h"
#include "src/prted/pmix/pmix_server.h"
+#include <stdio.h>
+#include <stdlib.h>
+#include <signal.h>
+#include <sys/time.h>
+#include <unistd.h>
+#ifdef __GLIBC__
+extern char *program_invocation_short_name;
+#endif
+
/*
* Module functions (function pointers used in a struct)
*/
@@ -631,6 +640,26 @@
if (pid == 0) {
close(p[0]);
+ /* --- DSAT DEBUG/FIX - print and clear inherited ITIMER_PROF --- */
+ const char *dsat_env = getenv("DSAT_ITIMER_PATCH");
+ if (dsat_env != NULL) {
+#ifdef ITIMER_PROF
+ struct itimerval t;
+ getitimer(ITIMER_PROF, &t);
+ struct sigaction sa;
+ sigaction(SIGPROF, NULL, &sa);
+ fprintf(stderr, "PID: %d, PPID: %d, executable: %s, inherited ITIMER_PROF: value [sec=%ld usec=%ld] interval [sec=%ld usec=%ld], SIGPROF handler=%p, flags=%d\n", getpid(), getppid(), program_invocation_short_name, t.it_value.tv_sec, t.it_value.tv_usec, t.it_interval.tv_sec, t.it_interval.tv_usec, sa.sa_handler, sa.sa_flags);
+ /* clear the timer so SIGPROF does not fire */
+ struct itimerval zero = {0};
+ setitimer(ITIMER_PROF, &zero, NULL);+
+
+ /* ignore SIGPROF to be safe */
+ signal(SIGPROF, SIG_IGN);
+#endif
+ }
+ /* END DSAT DEBUG/FIX PATCH --- */
do_child(cd, p[1]);
/*Does not return */
}
With this patch included, the program completes successfully (e.g. ignores the SIGPROF). I get the following output from the same the command 'env DSAT_ITIMER_PATCH=1 mpirun -np 4 hostname':
[email protected] ~/openmpi-5.0.5]$ env DSAT_ITIMER_PATCH=1 mpirun -np 4 hostname
PID= 1207942: PPID: 1207939: executable: prterun: inherited ITIMER_PROF: value [sec=0 usec=0] interval [sec=0 usec=0], SIGPROF handler=0x6c3d60, flags=335544320
PID= 1207943: PPID: 1207939: executable: prterun: inherited ITIMER_PROF: value [sec=0 usec=0] interval [sec=0 usec=0], SIGPROF handler=0x6c3d60, flags=335544320
PID= 1207944: PPID: 1207939: executable: prterun: inherited ITIMER_PROF: value [sec=0 usec=0] interval [sec=0 usec=0], SIGPROF handler=0x6c3d60, flags=335544320
PID= 1207945: PPID: 1207939: executable: prterun: inherited ITIMER_PROF: value [sec=0 usec=0] interval [sec=0 usec=0], SIGPROF handler=0x6c3d60, flags=335544320
dsat.org
dsat.org
dsat.org
dsat.org
I took it one step farther and ran the program using GDB. I get similar output, and when I inspect the SIGPROF handler value, I see that it comes from the following location:
(gdb) info symbol 0x6c3d60
evsig_handler in section .text of /home/dsat/openmpi-5.0.5/build/openmpi/bin/prte
The summary of my investigation is this: It seems that libevent or prte is using profiling timers internally for timing. Somehow, the handler is not correctly getting cleared for the child processes, and is therefore catching, even during simple programs. Does anyone have insight into what is going on, and if it is a bug, or if there are configuration options that could be used to fix it?