-
Notifications
You must be signed in to change notification settings - Fork 232
Regular killing reschedules a cancel of scheduler job #6870
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Regular killing reschedules a cancel of scheduler job #6870
Conversation
| except TransportTaskException as exception: | ||
| raise plumpy.process_states.PauseInterruption(f'Pausing after failed transport task: {exception}') | ||
| except plumpy.process_states.KillInterruption as exception: | ||
| await self._kill_job(node, transport_queue) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we do not trigger in the ProcessState anymore in the cancelation of the scheduler job. This made it hard to manage the kill command as killing happens at two places.
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #6870 +/- ##
==========================================
+ Coverage 78.60% 78.60% +0.01%
==========================================
Files 567 567
Lines 43097 43127 +30
==========================================
+ Hits 33871 33895 +24
- Misses 9226 9232 +6 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
a8789b9 to
133d884
Compare
| # PR_COMMENT: We cannot reuse _killing because of type issues, it is a CancellableAction. | ||
| # We can wrap a task around a CancellableAction but the CancellableAction catches silently any | ||
| # error whilel here we need to know if the cancelation of the scheduler job failed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove as soon as read
| # PR_COMMENT: We cannot reuse _killing because of type issues, it is a CancellableAction. | |
| # We can wrap a task around a CancellableAction but the CancellableAction catches silently any | |
| # error whilel here we need to know if the cancelation of the scheduler job failed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here also
| # PR_COMMENT This is a copy of the function in engine/processes/calcjobs/tasks.py | ||
| # and will merged to one place in PR #6868 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| # PR_COMMENT This is a copy of the function in engine/processes/calcjobs/tasks.py | |
| # and will merged to one place in PR #6868 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here
| # PR_COMMENT Because we need to overwrite the logic of the cancelation of the self._killing task of the | ||
| # scheduler job, we need to copy this logic of the parent class in plumpy, we need to adapt the | ||
| # cancelation of the last sent killing action to also resend the kill/cancelation of the scheduler | ||
| # job as we stop this canelation by canceling the last killing action |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| # PR_COMMENT Because we need to overwrite the logic of the cancelation of the self._killing task of the | |
| # scheduler job, we need to copy this logic of the parent class in plumpy, we need to adapt the | |
| # cancelation of the last sent killing action to also resend the kill/cancelation of the scheduler | |
| # job as we stop this canelation by canceling the last killing action |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know, maybe you wanted to remove this?
Squashed commit at 2025-05-09 21:53 PR aiidateam#6793 introduced the cancelation of earlier kill actions. This had the problem if two kill commands are set in a sequence, the second kill action will cancel the first one which triggered the cancelation of the scheduler job within an EBM. The second kill command however did not retrigger the cancelation of the scheduler job. This bug appeared because we have two places where the killing logic is placed. More information about this can be found in PR aiidateam#6868 that fixes this properly refactoring the kill action. This PR only serves as a fast temporary fix with workarounds. Before this PR, when the killing command failed through the EBM, the scheduler job could not be cancelled through a kill anymore. Since we have now force-kill option to bypass the EBM, we can reschedule the cancelation of the scheduler job to gracefully kill a process.
Squashed commit at 2025-05-09 21:53 PR #6793 introduced the cancelation of earlier kill actions. This had the problem if two kill commands are set in a sequence, the second kill action will cancel the first one which triggered the cancelation of the scheduler job within an EBM. The second kill command however did not retrigger the cancelation of the scheduler job. This bug appeared because we have two places where the killing logic is placed. More information about this can be found in PR #6868 that fixes this properly refactoring the kill action. This PR only serves as a fast temporary fix with workarounds. Before this PR, when the killing command failed through the EBM, the scheduler job could not be cancelled through a kill anymore. Since we have now force-kill option to bypass the EBM, we can reschedule the cancelation of the scheduler job to gracefully kill a process.
6aad3b0 to
133d884
Compare
| # kill should start EBM and be not successful in EBM | ||
| run_cli_command(cmd_process.process_kill, [str(node.pk), '--wait']) | ||
| await_condition(lambda: not node.is_killed, timeout=kill_timeout) | ||
|
|
||
| # kill should restart EBM and be not successful in EBM | ||
| run_cli_command(cmd_process.process_kill, [str(node.pk), '--wait']) | ||
| await_condition(lambda: not node.is_killed, timeout=kill_timeout) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why this is repeated two times
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The second time it is restarted. I added a sentence to be clearer what is tested by executing the command twice
# this tests if the old task is cancelled and restarted successfully
| # (e.g. in scenarios that transport is working again) | ||
| """Kill a process that is waiting after failed EBM during upload. It should be possible to kill it normally. | ||
| A process that failed upload (e.g. in scenarios that transport is working again) and is then killed with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please provide a comment why specifically upload is important?
this could be an EBM of another kill, right?
Note, as of before, a process that's failed via EBM, will be marked as Excepted,
therefore it's not possible anymore to kill gracefully.
One has still be able to kill gracefully once the connection is back.
We need a test to covers that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please provide a comment why specifically upload is important?
It is only important that is not the killing itself but that it is the transport task that fails. I renamed function and updated docstring
"""Kill a process that is waiting after failed EBM during a transport task.
It should be possible to kill it normally. A process that failed a transport task temporary should be killable.
(e.g. the upload failed because of the interent connection problems that are fixed during the kill).
"""
Note, as of before, a process that's failed via EBM, will be marked as Excepted,
therefore it's not possible anymore to kill gracefully
Yes this has been changed in this PR. You can now still gracefully kill when EBM fails. This did not make sense. Since we have now the force option we have a clear distinction between graceful and nongraceful kill. This in-between thing when the EBM failed never made sense.
One has still be able to kill gracefully once the connection is back.
We need a test to covers that.
This test exactly covers this case. If EBM fails, send a graceful kill, it works. This could be tested for every state of the process where the transport could fail but this might be a bit overkill.
ec2ba6e to
16162b2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot @agoscinski
Looks good to me, a few minor comments.
| # PR_COMMENT Because we need to overwrite the logic of the cancelation of the self._killing task of the | ||
| # scheduler job, we need to copy this logic of the parent class in plumpy, we need to adapt the | ||
| # cancelation of the last sent killing action to also resend the kill/cancelation of the scheduler | ||
| # job as we stop this canelation by canceling the last killing action |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know, maybe you wanted to remove this?
| # PR_COMMENT: We cannot reuse _killing because of type issues, it is a CancellableAction. | ||
| # We can wrap a task around a CancellableAction but the CancellableAction catches silently any | ||
| # error whilel here we need to know if the cancelation of the scheduler job failed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here also
| # PR_COMMENT This is a copy of the function in engine/processes/calcjobs/tasks.py | ||
| # and will merged to one place in PR #6868 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here
| try: | ||
| self.loop.run_until_complete(self._cancelling_scheduler_job) | ||
| except Exception as exc: | ||
| self.node.logger.error(f'While cancelling job error was raised: {exc!s}') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe !s is redundant as you already have f""
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed !s
| It should be possible to kill it normally. A process that failed upload (e.g. in scenarios that transport is working | ||
| again) and is then killed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
| await_condition(lambda: not node.is_killed, timeout=kill_timeout) | ||
|
|
||
| # kill should restart EBM and be not successful in EBM | ||
| # this tests if the old task is cancelled and restarted successfully |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but how the "the old task is cancelled" is being tested, exactly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a log message in the process that is now checked here. We cannot directly check if it is cancelled as the daemon has the asyncio task queue
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed PR comment and adapted test to check cancelation by checking the log. See last commit
| await_condition(lambda: not node.is_killed, timeout=kill_timeout) | ||
|
|
||
| # kill should restart EBM and be not successful in EBM | ||
| # this tests if the old task is cancelled and restarted successfully |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a log message in the process that is now checked here. We cannot directly check if it is cancelled as the daemon has the asyncio task queue
61d9160 to
0fba44f
Compare
PR aiidateam#6793 introduced the cancelation of earlier kill actions. This had the problem if two kill commands are set in a sequence, the second kill action will cancel the first one which triggered the cancelation of the scheduler job within an EBM. The second kill command however did not retrigger the cancelation of the scheduler job. This bug appeared because we have two places where the killing logic is placed. More information about this can be found in PR aiidateam#6868 that fixes this properly refactoring the kill action. This PR only serves as a fast temporary fix with workarounds. Before this PR, when the killing command failed through the EBM, the scheduler job could not be cancelled through a kill anymore. Since we have now force-kill option to bypass the EBM, we can reschedule the cancelation of the scheduler job to gracefully kill a process.
Co-authored-by: Ali Khosravi <[email protected]>
Co-authored-by: Ali Khosravi <[email protected]>
0fba44f to
7818465
Compare
The original design of `wait` and `timeout` was to distinguish between actions that immediately return and actions that scheduled. This mechanism was however never used and resulted in an misinterpretation in the force-kill PR aiidateam#6793 introducing a bug fixed in PR aiidateam#6870. The mechanism of `wait` and `timeout` was also never correctly implemented. In this PR we rectify the logic and simplify it by handling immediate and scheduled actions the same way. Related commits in aiida-core 8388018, cd0d15c and plumpy 1b6ecb8 One can specifiy a `timeout <= 0` to express that the action should not wait for a response while one can specify `timeout==float('inf')` (default value) to wait until a response has been received without a timeout.
The original design of `wait` and `timeout` was to distinguish between actions that immediately return and actions that scheduled. This mechanism was however never used and resulted in an misinterpretation in the force-kill PR aiidateam#6793 introducing a bug fixed in PR aiidateam#6870. The mechanism of `wait` and `timeout` was also never correctly implemented. In this PR we rectify the logic and simplify it by handling immediate and scheduled actions the same way. Related commits in aiida-core 8388018, cd0d15c and plumpy 1b6ecb8 One can specifiy a `timeout <= 0` to express that the action should not wait for a response while one can specify `timeout==float('inf')` (default value) to wait until a response has been received without a timeout.
The original design of `wait` and `timeout` was to distinguish between actions that immediately return and actions that scheduled. This mechanism was however never used and resulted in an misinterpretation in the force-kill PR #6793 introducing a bug fixed in PR #6870. The mechanism of `wait` and `timeout` was also never correctly implemented. In this PR we rectify the logic and simplify it by handling immediate and scheduled actions the same way. Related commits in aiida-core 8388018, cd0d15c and plumpy 1b6ecb8 One can specifiy a `timeout <= 0` to express that the action should not wait for a response while one can specify `timeout==float('inf')` (default value) to wait until a response has been received without a timeout.
The original design of `wait` and `timeout` was to distinguish between actions that immediately return and actions that scheduled. This mechanism was however never used and resulted in an misinterpretation in the force-kill PR aiidateam#6793 introducing a bug fixed in PR aiidateam#6870. The mechanism of `wait` and `timeout` was also never correctly implemented. In this PR we rectify the logic and simplify it by handling immediate and scheduled actions the same way. Related commits in aiida-core 8388018, cd0d15c and plumpy 1b6ecb8 One can specifiy a `timeout <= 0` to express that the action should not wait for a response while one can specify `timeout==float('inf')` (default value) to wait until a response has been received without a timeout.
PR #6793 introduced the cancelation of earlier kill actions. This had the problem if two kill commands are set in a sequence, the second kill action will cancel the first one which triggered the cancelation of the scheduler job within an EBM. The second kill command however did not retrigger the cancelation of the scheduler job. This bug appeared because we have two places where the killing logic is placed. More information about this can be found in PR #6868 that fixes this properly refactoring the kill action. This PR only serves as a fast temporary fix with workarounds to get it ready for release v2.7.0. The proper solution takes more time.
Before this PR, when the killing command failed through the EBM, the scheduler job could not be cancelled through a kill anymore. Since we have now force-kill option to bypass the EBM, we can reschedule the cancelation of the scheduler job to gracefully kill a process.