Skip to content

Commit 873cebb

Browse files
committed
Merge pull request #1727 from jsquyres/pr/mpirun-timeout-and-friends
mpirun.1in: add descriptions of new options
2 parents ceb2912 + cf27ec3 commit 873cebb

File tree

3 files changed

+61
-12
lines changed

3 files changed

+61
-12
lines changed

contrib/completion/mpirun.zsh

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -158,6 +158,7 @@ _mpirun() {
158158
'(-do-not-launch --do-not-launch)'{-do-not-launch,--do-not-launch}'[Perform all necessary operations to prepare to launch the application, but do not actually launch it]' \
159159
'(-do-not-resolve --do-not-resolve)'{-do-not-resolve,--do-not-resolve}'[Do not attempt to resolve interfaces]' \
160160
'(-enable-recovery --enable-recovery)'{-enable-recovery,--enable-recovery}'[Enable recovery from process failure (default: disabled)]' \
161+
'(-get-stack-traces --get-stack-traces)'{-get-stack-traces,--get-stack-traces}'[Upon timeout, obtain stack traces from all still-alive MPI processes (default: disabled)]' \
161162
'*'{-gmca,--gmca}'[Pass global MCA parameters that are applicable to all contexts (arg0 is the parameter name; arg1 is the parameter value)]:mca variable name:->mca_variable_name:mca variable value:->mca_variable_value' \
162163
'(- *)'{-h,--help}'[Help message]' \
163164
'*'{-H,-host,--host}'[List of hosts to invoke processes on]:hostnames:' \
@@ -192,6 +193,7 @@ _mpirun() {
192193
'(-report-child-jobs-separately --report-child-jobs-separately)'{-report-child-jobs-separately,--report-child-jobs-separately}'[Return the exit status of the primary job only]' \
193194
'(-report-events --report-events)'{-report-events,--report-events}'[Report events to a tool listening at the specified URI]:URI:' \
194195
'(-report-pid --report-pid)'{-report-pid,--report-pid}'[Printout pid on stdout (-), stderr (+), or a file (anything else)]:report file:_report_file' \
196+
'(-report-state-upon-timeout --report-state-upon-timeout)'{-report-state-upon-timeout,--report-state-upon-timeout}'[Upon timeout, print run-time status of each process]' \
195197
'(-report-uri --report-uri)'{-report-uri,--report-uri}'[Printout URI on stdout (-), stderr (+), or a file (anything else)]:report file:_report_file' \
196198
'(-rf --rankfile)'{-rf,--rankfile}'[Provide a rankfile file]:rank file:_files' \
197199
'(-s --preload-binary)'{-s,--preload-binary}'[Preload the binary on the remote machine before starting the remote process.]' \
@@ -202,6 +204,7 @@ _mpirun() {
202204
'(-staged --staged)'{-staged,--staged}'[Used staged execution if inadequate resources are present (cannot support MPI jobs)]' \
203205
'(-stdin --stdin)'{-stdin,--stdin}'[Specify procs to receive stdin \[rank, all, none\] (default: 0, indicating rank 0)]:rank list:' \
204206
'(-tag-output --tag-output)'{-tag-output,--tag-output}'[Tag all output with \[job,rank\]]' \
207+
'(-timeout --timeout)'{-timeout,--timeout}'[Timeout, in seconds, for the entire job]' \
205208
'(-timestamp-output --timestamp-output)'{-timestamp-output,--timestamp-output}'[Timestamp all application process output]' \
206209
'(-use-hwthread-cpus --use-hwthread-cpus)'{-use-hwthread-cpus,--use-hwthread-cpus}'[Use hardware threads as independent cpus]' \
207210
'(-use-regexp --use-regexp)'{-use-regexp,--use-regexp}'[Use regular expressions for launch]' \

orte/tools/orterun/help-orterun.txt

Lines changed: 4 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -642,14 +642,13 @@ number of processes to run:
642642
Please correct this value and try again.
643643
#
644644
[orterun:timeout]
645-
The user-provided time limit for job execution has been
646-
reached:
645+
The user-provided time limit for job execution has been reached:
647646

648-
MPIEXEC_TIMEOUT: %d seconds
647+
Timeout: %d seconds
649648

650649
The job will now be aborted. Please check your code and/or
651-
adjust/remove the job execution time limit (as specified by
652-
MPIEXEC_TIMEOUT in your environment or --timeout on the command line).
650+
adjust/remove the job execution time limit (as specified by --timeout
651+
command line option or MPIEXEC_TIMEOUT environment variable).
653652
#
654653
[orterun:conflict-env-set]
655654
ERROR: You have attempted to pass environment variables to Open MPI

orte/tools/orterun/orterun.1in

Lines changed: 54 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
.\" -*- nroff -*-
2-
.\" Copyright (c) 2009-2014 Cisco Systems, Inc. All rights reserved.
2+
.\" Copyright (c) 2009-2016 Cisco Systems, Inc. All rights reserved.
33
.\" Copyright (c) 2008-2009 Sun Microsystems, Inc. All rights reserved.
44
.\" $COPYRIGHT$
55
.\"
@@ -529,12 +529,41 @@ MCA parameter.
529529
.
530530
.
531531
.TP
532+
.B --get-stack-traces
533+
When paired with the
534+
.B --timeout
535+
option,
536+
.I mpirun
537+
will obtain and print out stack traces from all launched processes
538+
that are still alive when the timeout expires. Note that obtaining
539+
stack traces can take a little time and produce a lot of output,
540+
especially for large process-count jobs.
541+
.
542+
.
543+
.TP
532544
.B -debugger\fR,\fP --debugger
533545
Sequence of debuggers to search for when \fI--debug\fP is used (i.e.
534546
a synonym for \fIorte_base_user_debugger\fP MCA parameter).
535547
.
536548
.
537549
.TP
550+
.B --timeout \fR<seconds>
551+
The maximum number of seconds that
552+
.I mpirun
553+
(also known as
554+
.I mpiexec\fR,\fI oshrun\fR,\fI orterun\fR,\fI
555+
etc.)
556+
will run. After this many seconds,
557+
.I mpirun
558+
will abort the launched job and exit with a non-zero exit status.
559+
Using
560+
.B --timeout
561+
can be also useful when combined with the
562+
.B --get-stack-traces
563+
option.
564+
.
565+
.
566+
.TP
538567
.B -tv\fR,\fP --tv
539568
Launch processes under the TotalView debugger.
540569
Deprecated backwards compatibility flag. Synonym for \fI--debug\fP.
@@ -661,6 +690,14 @@ without clutter from mpirun itself.
661690
Disable the automatic --prefix behavior
662691
.
663692
.
693+
.TP
694+
.B --report-state-on-timeout
695+
When paired with the
696+
.B --timeout
697+
command line option, report the run-time subsystem state of each
698+
process when the timeout expires.
699+
.
700+
.
664701
.P
665702
There may be other options listed with \fImpirun --help\fP.
666703
.
@@ -669,12 +706,9 @@ There may be other options listed with \fImpirun --help\fP.
669706
.
670707
.TP
671708
.B MPIEXEC_TIMEOUT
672-
The maximum number of seconds that
673-
.I mpirun
674-
.RI ( mpiexec )
675-
will run. After this many seconds,
676-
.I mpirun
677-
will abort the launched job and exit.
709+
Synonym for the
710+
.B --timeout
711+
command line option.
678712
.
679713
.
680714
.\" **************************
@@ -1541,6 +1575,19 @@ In the event that one or more processes exit before calling MPI_FINALIZE, the
15411575
return value of the MPI_COMM_WORLD rank of the process that \fImpirun\fP first notices died
15421576
before calling MPI_FINALIZE will be returned. Note that, in general, this will
15431577
be the first process that died but is not guaranteed to be so.
1578+
.
1579+
.PP
1580+
If the
1581+
.B --timeout
1582+
command line option is used and the timeout expires before the job
1583+
completes (thereby forcing
1584+
.I mpirun
1585+
to kill the job)
1586+
.I mpirun
1587+
will return an exit status equivalent to the value of
1588+
.B ETIMEDOUT
1589+
(which is typically 110 on Linux and OS X systems).
1590+
15441591
.
15451592
.\" **************************
15461593
.\" See Also Section

0 commit comments

Comments
 (0)