Skip to content

Regression: Spawned process are not killed on timeout #13451

@AntonDaumen

Description

@AntonDaumen

Background information

What version of Open MPI are you using? (e.g., v4.1.6, v5.0.1, git branch name and hash, etc.)

5.0.8

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

From a git clone

If you are building/installing from a git clone, please copy-n-paste the output from git submodule status.

907b1ccaeec61a1197f0ee5264d4fef20b257b84 3rd-party/openpmix (v5.0.8)
222f03fbb98b71abd293aa205b38fa9a38e57965 3rd-party/prrte (v3.0.11)
dfff67569fb72dbf8d73a1dcf74d091dad93f71b config/oac (heads/main)

Please describe the system on which you are running

  • Operating system/version: RHEL 9.4 (Linux 5.14.0-427.42.1.el9_4.aarch64)
  • Computer hardware: ARM Neoverse-N1
  • Network type: no network used to reproduce

Details of the problem

First of all sorry if this report belongs in the PRRTE github issues, I wasn't sure and decided to open it here first. I'll open it there if it is more appropriate.

With Open MPI 5, when a MPI application with spawned process hits a timeout, the spawned process don't seem to be killed and the application doesn't stop. It seems most of the time the application is finally killed after exactly 1 hour, although I have seen cases where it seemed like the application was never killed.

This seem to be a regression as I have never been able to reproduce it with an Open MPI 4 version.

I am using this simple test to test reproduce this issue:
spawn_timeout_reprod.c

Compiled with: mpicc spawn_timeout_reprod.c -o spawn_timeout_reprod
Launched with: time mpirun --tag-output --report-state-on-timeout --timeout 5 --np 1 ./spawn_timeout_reprod

Bellow this you will find both the Open MPI 4 and the Open MPI 5 output of this same test. Note the difference in output of the --tag-output --report-state-on-timeout options (although this is much less problematic), it seems that a lot of information about the spawned process are lost with Open MPI 5.

With Open MPI 4 the test is killed in around 6s, so the timeout is effective. With Open MPI 5 the tests ends in 15 seconds after the sleep ends, so the timeout is ineffective.

Open MPI 4 output

~/Workdir $ ompi_info | grep Ident
            Ident string: 4.1.8rc1

~/Workdir $ time mpirun --tag-output --report-state-on-timeout --timeout 5 --np 1 ./spawn_timeout_reprod
[1,0]<stdout>:Spawning 1 processes
[1,0]<stdout>:Sleeping for 15 seconds
[2,0]<stdout>:Got spawned
[2,0]<stdout>:Sleeping for 15 seconds
--------------------------------------------------------------------------
The user-provided time limit for job execution has been reached:

  Timeout: 5 seconds

The job will now be aborted.  Please check your code and/or
adjust/remove the job execution time limit (as specified by --timeout
command line option or MPIEXEC_TIMEOUT environment variable).
--------------------------------------------------------------------------
DATA FOR JOB: [51378,0]
    Num apps: 1 Num procs: 1    JobState: ALL DAEMONS REPORTED  Abort: False
    Num launched: 0 Num reported: 1 Num terminated: 0

    Procs:
        Rank: 0 Node: login1    PID: 2596154    State: RUNNING  ExitCode 0

DATA FOR JOB: [51378,1]
    Num apps: 1 Num procs: 1    JobState: SYNC REGISTERED   Abort: False
    Num launched: 1 Num reported: 1 Num terminated: 0

    Procs:
        Rank: 0 Node: login1    PID: 2596157    State: SYNC REGISTERED  ExitCode 0

DATA FOR JOB: [51378,2]
    Num apps: 1 Num procs: 1    JobState: SYNC REGISTERED   Abort: False
    Num launched: 1 Num reported: 1 Num terminated: 0

    Procs:
        Rank: 0 Node: login1    PID: 2596160    State: SYNC REGISTERED  ExitCode 0

real    0m6.070s
user    0m0.025s
sys     0m0.037s

Open MPI 5 output

~/Workdir $ ompi_info | grep Ident
            Ident string: 5.0.8rc3

~/Workdir $ time mpirun --tag-output --report-state-on-timeout --timeout 5 --np 1 ./spawn_timeout_reprod
[1,0]<stdout>: Spawning 1 processes
Got spawned
Sleeping for 15 seconds
[1,0]<stdout>: Sleeping for 15 seconds
[1,WILDCARD]<stderr>: --------------------------------------------------------------------------
[1,WILDCARD]<stderr>: The user-provided time limit for job execution has been reached:
[1,WILDCARD]<stderr>:
[1,WILDCARD]<stderr>:   Timeout: 5 seconds
[1,WILDCARD]<stderr>:
[1,WILDCARD]<stderr>: The job will now be aborted.  Please check your code and/or
[1,WILDCARD]<stderr>: adjust/remove the job execution time limit (as specified by --timeout
[1,WILDCARD]<stderr>: command line option or MPIEXEC_TIMEOUT environment variable).
[1,WILDCARD]<stderr>: --------------------------------------------------------------------------
[1,WILDCARD]<stderr>: DATA FOR JOB: prterun-login1-2597499@1
[1,WILDCARD]<stderr>:   Num apps: 1 Num procs: 1    JobState: SYNC REGISTERED   Abort: False
[1,WILDCARD]<stderr>:   Num launched: 1 Num reported: 1 Num terminated: 0
[1,WILDCARD]<stderr>:
[1,WILDCARD]<stderr>:   Procs:
[1,WILDCARD]<stderr>:       Rank: 0 Node: login1    PID: 2597502    State: SYNC REGISTERED  ExitCode 0
[1,WILDCARD]<stderr>:

real    0m15.214s
user    0m0.047s
sys     0m0.023s

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions