-
Notifications
You must be signed in to change notification settings - Fork 928
Description
Background information
What version of Open MPI are you using? (e.g., v4.1.6, v5.0.1, git branch name and hash, etc.)
5.0.8
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
From a git clone
If you are building/installing from a git clone, please copy-n-paste the output from git submodule status.
907b1ccaeec61a1197f0ee5264d4fef20b257b84 3rd-party/openpmix (v5.0.8)
222f03fbb98b71abd293aa205b38fa9a38e57965 3rd-party/prrte (v3.0.11)
dfff67569fb72dbf8d73a1dcf74d091dad93f71b config/oac (heads/main)
Please describe the system on which you are running
- Operating system/version: RHEL 9.4 (Linux 5.14.0-427.42.1.el9_4.aarch64)
- Computer hardware: ARM Neoverse-N1
- Network type: no network used to reproduce
Details of the problem
First of all sorry if this report belongs in the PRRTE github issues, I wasn't sure and decided to open it here first. I'll open it there if it is more appropriate.
With Open MPI 5, when a MPI application with spawned process hits a timeout, the spawned process don't seem to be killed and the application doesn't stop. It seems most of the time the application is finally killed after exactly 1 hour, although I have seen cases where it seemed like the application was never killed.
This seem to be a regression as I have never been able to reproduce it with an Open MPI 4 version.
I am using this simple test to test reproduce this issue:
spawn_timeout_reprod.c
Compiled with: mpicc spawn_timeout_reprod.c -o spawn_timeout_reprod
Launched with: time mpirun --tag-output --report-state-on-timeout --timeout 5 --np 1 ./spawn_timeout_reprod
Bellow this you will find both the Open MPI 4 and the Open MPI 5 output of this same test. Note the difference in output of the --tag-output --report-state-on-timeout options (although this is much less problematic), it seems that a lot of information about the spawned process are lost with Open MPI 5.
With Open MPI 4 the test is killed in around 6s, so the timeout is effective. With Open MPI 5 the tests ends in 15 seconds after the sleep ends, so the timeout is ineffective.
Open MPI 4 output
~/Workdir $ ompi_info | grep Ident
Ident string: 4.1.8rc1
~/Workdir $ time mpirun --tag-output --report-state-on-timeout --timeout 5 --np 1 ./spawn_timeout_reprod
[1,0]<stdout>:Spawning 1 processes
[1,0]<stdout>:Sleeping for 15 seconds
[2,0]<stdout>:Got spawned
[2,0]<stdout>:Sleeping for 15 seconds
--------------------------------------------------------------------------
The user-provided time limit for job execution has been reached:
Timeout: 5 seconds
The job will now be aborted. Please check your code and/or
adjust/remove the job execution time limit (as specified by --timeout
command line option or MPIEXEC_TIMEOUT environment variable).
--------------------------------------------------------------------------
DATA FOR JOB: [51378,0]
Num apps: 1 Num procs: 1 JobState: ALL DAEMONS REPORTED Abort: False
Num launched: 0 Num reported: 1 Num terminated: 0
Procs:
Rank: 0 Node: login1 PID: 2596154 State: RUNNING ExitCode 0
DATA FOR JOB: [51378,1]
Num apps: 1 Num procs: 1 JobState: SYNC REGISTERED Abort: False
Num launched: 1 Num reported: 1 Num terminated: 0
Procs:
Rank: 0 Node: login1 PID: 2596157 State: SYNC REGISTERED ExitCode 0
DATA FOR JOB: [51378,2]
Num apps: 1 Num procs: 1 JobState: SYNC REGISTERED Abort: False
Num launched: 1 Num reported: 1 Num terminated: 0
Procs:
Rank: 0 Node: login1 PID: 2596160 State: SYNC REGISTERED ExitCode 0
real 0m6.070s
user 0m0.025s
sys 0m0.037sOpen MPI 5 output
~/Workdir $ ompi_info | grep Ident
Ident string: 5.0.8rc3
~/Workdir $ time mpirun --tag-output --report-state-on-timeout --timeout 5 --np 1 ./spawn_timeout_reprod
[1,0]<stdout>: Spawning 1 processes
Got spawned
Sleeping for 15 seconds
[1,0]<stdout>: Sleeping for 15 seconds
[1,WILDCARD]<stderr>: --------------------------------------------------------------------------
[1,WILDCARD]<stderr>: The user-provided time limit for job execution has been reached:
[1,WILDCARD]<stderr>:
[1,WILDCARD]<stderr>: Timeout: 5 seconds
[1,WILDCARD]<stderr>:
[1,WILDCARD]<stderr>: The job will now be aborted. Please check your code and/or
[1,WILDCARD]<stderr>: adjust/remove the job execution time limit (as specified by --timeout
[1,WILDCARD]<stderr>: command line option or MPIEXEC_TIMEOUT environment variable).
[1,WILDCARD]<stderr>: --------------------------------------------------------------------------
[1,WILDCARD]<stderr>: DATA FOR JOB: prterun-login1-2597499@1
[1,WILDCARD]<stderr>: Num apps: 1 Num procs: 1 JobState: SYNC REGISTERED Abort: False
[1,WILDCARD]<stderr>: Num launched: 1 Num reported: 1 Num terminated: 0
[1,WILDCARD]<stderr>:
[1,WILDCARD]<stderr>: Procs:
[1,WILDCARD]<stderr>: Rank: 0 Node: login1 PID: 2597502 State: SYNC REGISTERED ExitCode 0
[1,WILDCARD]<stderr>:
real 0m15.214s
user 0m0.047s
sys 0m0.023s