Skip to content

Commit 49d65d0

Browse files
authored
Merge pull request #5299 from chu11/issue5210_shell_signal_race
shell: document signal race
2 parents a1c02ff + bd60cc5 commit 49d65d0

File tree

2 files changed

+21
-0
lines changed

2 files changed

+21
-0
lines changed

src/shell/log.c

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -364,6 +364,7 @@ int shell_log_init (flux_shell_t *shell, const char *progname)
364364
logger.level = FLUX_SHELL_NOTICE;
365365
logger.fp_level = FLUX_SHELL_NOTICE;
366366
logger.active = 0;
367+
logger.exception_logged = 0;
367368
logger.fp = stderr;
368369
logger.rank = -1;
369370
if (progname && !(logger.prog = strdup (progname)))

src/shell/signals.c

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,27 @@
1515
*
1616
* SIGINT - forward to all local tasks
1717
* SIGTERM - forward
18+
* SIGALRM - forward
1819
*
20+
* Notes:
21+
*
22+
* By setting up the signal watchers during "shell.init", there is the
23+
* potential for inconsistent exit codes if a signal is received before all
24+
* tasks have started. For example, this could be seen with something
25+
* like:
26+
*
27+
* jobid=`flux submit -n1000 foo.sh`
28+
* flux job raise --type=foo --severity=0 $jobid
29+
*
30+
* i.e. raise sends SIGTERM to job/shell immediately after starting,
31+
* but due to the large task count of 1000, the signal is received
32+
* before tasks are all setup. Some tasks could receive SIGTERM while
33+
* some (to be created ones) do not.
34+
*
35+
* Note that the shell should always return an error, but the error
36+
* may not be consistent. This situation is extremely rare and only
37+
* seen is testing situations such as the above. So we elect to not
38+
* fix this race.
1939
*/
2040
#define FLUX_SHELL_PLUGIN_NAME "signals"
2141

0 commit comments

Comments
 (0)