Skip to content

Commit 44b310a

Browse files
committed
flux-perilog-run: do not drain ranks when job is canceled
Problem: When a user cancels a job while the prolog is running, the prolog is canceled with a SIGTERM. This causes the flux-exec tasks spawned by the flux-perilog-run --exec-per-rank option to be terminated, which causes those ranks to be drained. Thus, a user can cause nodes to be drained by canceling their job. If an exec-per-rank process is terminated by a signal but wasn't canceled by flux-perilog-run itself (i.e. proc.canceled is not True), then assume the prolog was terminated and do not drain ranks. That is, only drain ranks when subprocess returncode is between 0 and 129 to account for shell-reported signal (128+n) or actual signal (-n). Fixes #5734
1 parent 35c2c10 commit 44b310a

File tree

1 file changed

+7
-1
lines changed

1 file changed

+7
-1
lines changed

src/cmd/flux-perilog-run.py

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -175,8 +175,14 @@ async def run_per_rank(name, jobid, args):
175175
if proc.canceled:
176176
timeout_ids.set(rank)
177177
rc = 128 + signal.SIGTERM
178-
elif rc != 0:
178+
elif rc > 0 and rc <= 128:
179+
# process failed with non-zero exit code. Add this rank to
180+
# the failed set which will be drained.
179181
fail_ids.set(rank)
182+
else:
183+
# process was signaled (returncode < 0) or shell reported it
184+
# was signaled (128+n). Do nothing in this case.
185+
pass
180186
if rc > returncode:
181187
returncode = rc
182188

0 commit comments

Comments
 (0)