flux-perilog-run: do not drain ranks when job is canceled

grondo · grondo · commit 44b310ab45d1 · 2024-02-14T23:54:18.000Z
Problem: When a user cancels a job while the prolog is running, the prolog is canceled with a SIGTERM. This causes the flux-exec tasks spawned by the flux-perilog-run --exec-per-rank option to be terminated, which causes those ranks to be drained. Thus, a user can cause nodes to be drained by canceling their job. If an exec-per-rank process is terminated by a signal but wasn't canceled by flux-perilog-run itself (i.e. proc.canceled is not True), then assume the prolog was terminated and do not drain ranks. That is, only drain ranks when subprocess returncode is between 0 and 129 to account for shell-reported signal (128+n) or actual signal (-n). Fixes #5734
diff --git a/src/cmd/flux-perilog-run.py b/src/cmd/flux-perilog-run.py
@@ -175,8 +175,14 @@ async def run_per_rank(name, jobid, args):
         if proc.canceled:
             timeout_ids.set(rank)
             rc = 128 + signal.SIGTERM
-        elif rc != 0:
+        elif rc > 0 and rc <= 128:
+            #  process failed with non-zero exit code. Add this rank to
+            #  the failed set which will be drained.
             fail_ids.set(rank)
+        else:
+            #  process was signaled (returncode < 0) or shell reported it
+            #  was signaled (128+n). Do nothing in this case.
+            pass
         if rc > returncode:
             returncode = rc