Skip to content

Commit 6ef0d5b

Browse files
committed
job-exec: drain nodes with unkillable processes
Problem: when sdexec's stop timer terminates a subprocess exec RPC, the node is not drained and a subsequent job could be co-located with the unkillable processes. Add specific handling for an EDEADLK error that drains the node and then behaves the same as if a EHOSTUNREACH error were received.
1 parent a7b550d commit 6ef0d5b

File tree

1 file changed

+36
-1
lines changed

1 file changed

+36
-1
lines changed

src/modules/job-exec/exec.c

Lines changed: 36 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -347,7 +347,42 @@ static void error_cb (struct bulk_exec *exec, flux_subprocess_t *p, void *arg)
347347
* create flux_cmd_t
348348
*/
349349
if (cmd) {
350-
if (errnum == EHOSTUNREACH) {
350+
if (errnum == EDEADLK) {
351+
/* EDEADLK from sdexec means that unkillable processes were left
352+
* on the node and it must be drained. A "finished" response
353+
* will not have been received, so after draining, treat this
354+
* like EHOSTUNREACH.
355+
*/
356+
char ranks[16];
357+
snprintf (ranks, sizeof (ranks), "%d", rank);
358+
(void) jobinfo_drain_ranks (job,
359+
ranks,
360+
"unkillable processes from job %s",
361+
idf58 (job->id));
362+
bool critical = is_critical_rank (job, shell_rank);
363+
364+
/* Always notify rank 0 shell of a lost shell.
365+
*/
366+
lost_shell (job,
367+
critical,
368+
shell_rank,
369+
"shell exited with unkillable processes"
370+
" on %s (shell rank %d)",
371+
hostname,
372+
shell_rank);
373+
374+
/* Raise a fatal error and terminate job immediately if
375+
* the lost shell was critical.
376+
*/
377+
if (critical)
378+
jobinfo_fatal_error (job,
379+
0,
380+
"shell exited with unkillable processes"
381+
" on %s (rank %d)",
382+
hostname,
383+
rank);
384+
}
385+
else if (errnum == EHOSTUNREACH) {
351386
bool critical = is_critical_rank (job, shell_rank);
352387

353388
/* Always notify rank 0 shell of a lost shell.

0 commit comments

Comments
 (0)