Regular killing reschedules a cancel of scheduler job #6870

agoscinski · 2025-05-09T09:25:18Z

PR #6793 introduced the cancelation of earlier kill actions. This had the problem if two kill commands are set in a sequence, the second kill action will cancel the first one which triggered the cancelation of the scheduler job within an EBM. The second kill command however did not retrigger the cancelation of the scheduler job. This bug appeared because we have two places where the killing logic is placed. More information about this can be found in PR #6868 that fixes this properly refactoring the kill action. This PR only serves as a fast temporary fix with workarounds to get it ready for release v2.7.0. The proper solution takes more time.

Before this PR, when the killing command failed through the EBM, the scheduler job could not be cancelled through a kill anymore. Since we have now force-kill option to bypass the EBM, we can reschedule the cancelation of the scheduler job to gracefully kill a process.

agoscinski · 2025-05-09T09:27:09Z

src/aiida/engine/processes/calcjobs/tasks.py

        except TransportTaskException as exception:
            raise plumpy.process_states.PauseInterruption(f'Pausing after failed transport task: {exception}')
        except plumpy.process_states.KillInterruption as exception:
-            await self._kill_job(node, transport_queue)


So we do not trigger in the ProcessState anymore in the cancelation of the scheduler job. This made it hard to manage the kill command as killing happens at two places.

src/aiida/engine/processes/process.py

codecov · 2025-05-09T09:44:27Z

Codecov Report

❌ Patch coverage is 90.62500% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 78.60%. Comparing base (43176cb) to head (7818465).
⚠️ Report is 51 commits behind head on main.

Files with missing lines	Patch %	Lines
src/aiida/engine/processes/process.py	90.63%	3 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #6870      +/-   ##
==========================================
+ Coverage   78.60%   78.60%   +0.01%     
==========================================
  Files         567      567              
  Lines       43097    43127      +30     
==========================================
+ Hits        33871    33895      +24     
- Misses       9226     9232       +6

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

agoscinski · 2025-05-09T17:17:43Z

src/aiida/engine/processes/process.py

+            # PR_COMMENT: We cannot reuse _killing because of type issues, it is a CancellableAction.
+            #             We can wrap a task around a CancellableAction but the CancellableAction catches silently any
+            #             error whilel here we need to know if the cancelation of the scheduler job failed.


Please remove as soon as read

Suggested change

# PR_COMMENT: We cannot reuse _killing because of type issues, it is a CancellableAction.

# We can wrap a task around a CancellableAction but the CancellableAction catches silently any

# error whilel here we need to know if the cancelation of the scheduler job failed.

agoscinski · 2025-05-09T17:17:52Z

src/aiida/engine/processes/process.py

+    # PR_COMMENT This is a copy of the function in engine/processes/calcjobs/tasks.py
+    #            and will merged to one place in PR #6868


Suggested change

# PR_COMMENT This is a copy of the function in engine/processes/calcjobs/tasks.py

# and will merged to one place in PR #6868

agoscinski · 2025-05-09T17:18:05Z

src/aiida/engine/processes/process.py

+        # PR_COMMENT Because we need to overwrite the logic of the cancelation of the self._killing task of the
+        #            scheduler job, we need to copy this logic of the parent class in plumpy, we need to adapt the
+        #            cancelation of the last sent killing action to also resend the kill/cancelation of the scheduler
+        #            job as we stop this canelation by canceling the last killing action


Suggested change

# PR_COMMENT Because we need to overwrite the logic of the cancelation of the self._killing task of the

# scheduler job, we need to copy this logic of the parent class in plumpy, we need to adapt the

# cancelation of the last sent killing action to also resend the kill/cancelation of the scheduler

# job as we stop this canelation by canceling the last killing action

I don't know, maybe you wanted to remove this?

Squashed commit at 2025-05-09 21:53 PR aiidateam#6793 introduced the cancelation of earlier kill actions. This had the problem if two kill commands are set in a sequence, the second kill action will cancel the first one which triggered the cancelation of the scheduler job within an EBM. The second kill command however did not retrigger the cancelation of the scheduler job. This bug appeared because we have two places where the killing logic is placed. More information about this can be found in PR aiidateam#6868 that fixes this properly refactoring the kill action. This PR only serves as a fast temporary fix with workarounds. Before this PR, when the killing command failed through the EBM, the scheduler job could not be cancelled through a kill anymore. Since we have now force-kill option to bypass the EBM, we can reschedule the cancelation of the scheduler job to gracefully kill a process.

Squashed commit at 2025-05-09 21:53 PR #6793 introduced the cancelation of earlier kill actions. This had the problem if two kill commands are set in a sequence, the second kill action will cancel the first one which triggered the cancelation of the scheduler job within an EBM. The second kill command however did not retrigger the cancelation of the scheduler job. This bug appeared because we have two places where the killing logic is placed. More information about this can be found in PR #6868 that fixes this properly refactoring the kill action. This PR only serves as a fast temporary fix with workarounds. Before this PR, when the killing command failed through the EBM, the scheduler job could not be cancelled through a kill anymore. Since we have now force-kill option to bypass the EBM, we can reschedule the cancelation of the scheduler job to gracefully kill a process.

tests/cmdline/commands/test_process.py

khsrali · 2025-05-22T09:04:33Z

tests/cmdline/commands/test_process.py

+        # kill should start EBM and be not successful in EBM
+        run_cli_command(cmd_process.process_kill, [str(node.pk), '--wait'])
+        await_condition(lambda: not node.is_killed, timeout=kill_timeout)
+
+        # kill should restart EBM and be not successful in EBM
+        run_cli_command(cmd_process.process_kill, [str(node.pk), '--wait'])
+        await_condition(lambda: not node.is_killed, timeout=kill_timeout)


why this is repeated two times

The second time it is restarted. I added a sentence to be clearer what is tested by executing the command twice
# this tests if the old task is cancelled and restarted successfully

khsrali · 2025-05-22T09:08:21Z

tests/cmdline/commands/test_process.py

-    # (e.g. in scenarios that transport is working again)
+    """Kill a process that is waiting after failed EBM during upload. It should be possible to kill it normally.
+
+    A process that failed upload (e.g. in scenarios that transport is working again) and is then killed with


Please provide a comment why specifically upload is important?
this could be an EBM of another kill, right?

Note, as of before, a process that's failed via EBM, will be marked as Excepted,
therefore it's not possible anymore to kill gracefully.

One has still be able to kill gracefully once the connection is back.
We need a test to covers that.

Please provide a comment why specifically upload is important?

It is only important that is not the killing itself but that it is the transport task that fails. I renamed function and updated docstring

"""Kill a process that is waiting after failed EBM during a transport task. It should be possible to kill it normally. A process that failed a transport task temporary should be killable. (e.g. the upload failed because of the interent connection problems that are fixed during the kill). """

Note, as of before, a process that's failed via EBM, will be marked as Excepted,
therefore it's not possible anymore to kill gracefully

Yes this has been changed in this PR. You can now still gracefully kill when EBM fails. This did not make sense. Since we have now the force option we have a clear distinction between graceful and nongraceful kill. This in-between thing when the EBM failed never made sense.

One has still be able to kill gracefully once the connection is back.
We need a test to covers that.

This test exactly covers this case. If EBM fails, send a graceful kill, it works. This could be tested for every state of the process where the transport could fail but this might be a bit overkill.

khsrali

Thanks a lot @agoscinski
Looks good to me, a few minor comments.

khsrali · 2025-06-02T13:07:38Z

src/aiida/engine/processes/process.py

+        # PR_COMMENT Because we need to overwrite the logic of the cancelation of the self._killing task of the
+        #            scheduler job, we need to copy this logic of the parent class in plumpy, we need to adapt the
+        #            cancelation of the last sent killing action to also resend the kill/cancelation of the scheduler
+        #            job as we stop this canelation by canceling the last killing action


I don't know, maybe you wanted to remove this?

khsrali · 2025-06-02T13:08:31Z

src/aiida/engine/processes/process.py

+            # PR_COMMENT: We cannot reuse _killing because of type issues, it is a CancellableAction.
+            #             We can wrap a task around a CancellableAction but the CancellableAction catches silently any
+            #             error whilel here we need to know if the cancelation of the scheduler job failed.


khsrali · 2025-06-02T13:08:54Z

src/aiida/engine/processes/process.py

+    # PR_COMMENT This is a copy of the function in engine/processes/calcjobs/tasks.py
+    #            and will merged to one place in PR #6868


khsrali · 2025-06-03T07:59:04Z

src/aiida/engine/processes/process.py

+            try:
+                self.loop.run_until_complete(self._cancelling_scheduler_job)
+            except Exception as exc:
+                self.node.logger.error(f'While cancelling job error was raised: {exc!s}')


maybe !s is redundant as you already have f""

removed !s

khsrali · 2025-06-03T08:00:30Z

tests/cmdline/commands/test_process.py

+    It should be possible to kill it normally. A process that failed upload (e.g. in scenarios that transport is working
+    again) and is then killed


khsrali · 2025-06-03T08:02:18Z

tests/cmdline/commands/test_process.py

+        await_condition(lambda: not node.is_killed, timeout=kill_timeout)
+
+        # kill should restart EBM and be not successful in EBM
+        # this tests if the old task is cancelled and restarted successfully


but how the "the old task is cancelled" is being tested, exactly?

I added a log message in the process that is now checked here. We cannot directly check if it is cancelled as the daemon has the asyncio task queue

agoscinski

Removed PR comment and adapted test to check cancelation by checking the log. See last commit

agoscinski · 2025-06-04T18:52:02Z

tests/cmdline/commands/test_process.py

+        await_condition(lambda: not node.is_killed, timeout=kill_timeout)
+
+        # kill should restart EBM and be not successful in EBM
+        # this tests if the old task is cancelled and restarted successfully


I added a log message in the process that is now checked here. We cannot directly check if it is cancelled as the daemon has the asyncio task queue

PR aiidateam#6793 introduced the cancelation of earlier kill actions. This had the problem if two kill commands are set in a sequence, the second kill action will cancel the first one which triggered the cancelation of the scheduler job within an EBM. The second kill command however did not retrigger the cancelation of the scheduler job. This bug appeared because we have two places where the killing logic is placed. More information about this can be found in PR aiidateam#6868 that fixes this properly refactoring the kill action. This PR only serves as a fast temporary fix with workarounds. Before this PR, when the killing command failed through the EBM, the scheduler job could not be cancelled through a kill anymore. Since we have now force-kill option to bypass the EBM, we can reschedule the cancelation of the scheduler job to gracefully kill a process.

Co-authored-by: Ali Khosravi <[email protected]>

The original design of `wait` and `timeout` was to distinguish between actions that immediately return and actions that scheduled. This mechanism was however never used and resulted in an misinterpretation in the force-kill PR aiidateam#6793 introducing a bug fixed in PR aiidateam#6870. The mechanism of `wait` and `timeout` was also never correctly implemented. In this PR we rectify the logic and simplify it by handling immediate and scheduled actions the same way. Related commits in aiida-core 8388018, cd0d15c and plumpy 1b6ecb8 One can specifiy a `timeout <= 0` to express that the action should not wait for a response while one can specify `timeout==float('inf')` (default value) to wait until a response has been received without a timeout.

The original design of `wait` and `timeout` was to distinguish between actions that immediately return and actions that scheduled. This mechanism was however never used and resulted in an misinterpretation in the force-kill PR #6793 introducing a bug fixed in PR #6870. The mechanism of `wait` and `timeout` was also never correctly implemented. In this PR we rectify the logic and simplify it by handling immediate and scheduled actions the same way. Related commits in aiida-core 8388018, cd0d15c and plumpy 1b6ecb8 One can specifiy a `timeout <= 0` to express that the action should not wait for a response while one can specify `timeout==float('inf')` (default value) to wait until a response has been received without a timeout.

The original design of `wait` and `timeout` was to distinguish between actions that immediately return and actions that scheduled. This mechanism was however never used and resulted in an misinterpretation in the force-kill PR aiidateam#6793 introducing a bug fixed in PR aiidateam#6870. The mechanism of `wait` and `timeout` was also never correctly implemented. In this PR we rectify the logic and simplify it by handling immediate and scheduled actions the same way. Related commits in aiida-core 8388018, cd0d15c and plumpy 1b6ecb8 One can specifiy a `timeout <= 0` to express that the action should not wait for a response while one can specify `timeout==float('inf')` (default value) to wait until a response has been received without a timeout.

agoscinski requested review from khsrali and unkcpz May 9, 2025 09:25

agoscinski added this to aiida-core v2.7.0 May 9, 2025

agoscinski commented May 9, 2025

View reviewed changes

src/aiida/engine/processes/process.py Show resolved Hide resolved

aiidateam deleted a comment from khsrali May 9, 2025

agoscinski force-pushed the fix/kill-continue-cancel-job branch from a8789b9 to 133d884 Compare May 9, 2025 17:04

agoscinski commented May 9, 2025

View reviewed changes

agoscinski mentioned this pull request May 9, 2025

Release v2.7.0pre1 #6872

Merged

agoscinski force-pushed the fix/kill-continue-cancel-job branch from 6aad3b0 to 133d884 Compare May 20, 2025 17:43

khsrali reviewed May 22, 2025

View reviewed changes

agoscinski force-pushed the fix/kill-continue-cancel-job branch 4 times, most recently from ec2ba6e to 16162b2 Compare June 2, 2025 12:51

agoscinski requested a review from khsrali June 2, 2025 12:55

agoscinski mentioned this pull request Jun 2, 2025

Merge wait and timeout to one CLI option for verdi process {kill|play|pause} #6902

Merged

khsrali approved these changes Jun 3, 2025

View reviewed changes

agoscinski commented Jun 4, 2025

View reviewed changes

agoscinski requested a review from khsrali June 4, 2025 18:54

agoscinski force-pushed the fix/kill-continue-cancel-job branch from 61d9160 to 0fba44f Compare June 4, 2025 18:54

agoscinski and others added 6 commits June 4, 2025 21:00

apply review

881bd2f

Update tests/cmdline/commands/test_process.py

bd4a1d9

Co-authored-by: Ali Khosravi <[email protected]>

Update tests/cmdline/commands/test_process.py

17bc9c4

Co-authored-by: Ali Khosravi <[email protected]>

Apply review

626992d

apply review

7818465

agoscinski force-pushed the fix/kill-continue-cancel-job branch from 0fba44f to 7818465 Compare June 4, 2025 19:00

agoscinski merged commit e768b70 into aiidateam:main Jun 4, 2025
11 checks passed

github-project-automation bot moved this to Done in aiida-core v2.7.0 Jun 4, 2025

agoscinski deleted the fix/kill-continue-cancel-job branch June 4, 2025 19:33

	# PR_COMMENT: We cannot reuse _killing because of type issues, it is a CancellableAction.
	# We can wrap a task around a CancellableAction but the CancellableAction catches silently any
	# error whilel here we need to know if the cancelation of the scheduler job failed.

		# PR_COMMENT This is a copy of the function in engine/processes/calcjobs/tasks.py
		# and will merged to one place in PR #6868

		It should be possible to kill it normally. A process that failed upload (e.g. in scenarios that transport is working
		again) and is then killed

Regular killing reschedules a cancel of scheduler job #6870

Regular killing reschedules a cancel of scheduler job #6870

Uh oh!

Conversation

agoscinski commented May 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

codecov bot commented May 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

agoscinski Jun 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

khsrali left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

agoscinski left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

agoscinski commented May 9, 2025 •

edited

Loading

codecov bot commented May 9, 2025 •

edited

Loading

agoscinski Jun 2, 2025 •

edited

Loading