Skip to content

Conversation

@agoscinski
Copy link
Contributor

@agoscinski agoscinski commented May 9, 2025

PR #6793 introduced the cancelation of earlier kill actions. This had the problem if two kill commands are set in a sequence, the second kill action will cancel the first one which triggered the cancelation of the scheduler job within an EBM. The second kill command however did not retrigger the cancelation of the scheduler job. This bug appeared because we have two places where the killing logic is placed. More information about this can be found in PR #6868 that fixes this properly refactoring the kill action. This PR only serves as a fast temporary fix with workarounds to get it ready for release v2.7.0. The proper solution takes more time.

Before this PR, when the killing command failed through the EBM, the scheduler job could not be cancelled through a kill anymore. Since we have now force-kill option to bypass the EBM, we can reschedule the cancelation of the scheduler job to gracefully kill a process.

except TransportTaskException as exception:
raise plumpy.process_states.PauseInterruption(f'Pausing after failed transport task: {exception}')
except plumpy.process_states.KillInterruption as exception:
await self._kill_job(node, transport_queue)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we do not trigger in the ProcessState anymore in the cancelation of the scheduler job. This made it hard to manage the kill command as killing happens at two places.

@codecov
Copy link

codecov bot commented May 9, 2025

Codecov Report

❌ Patch coverage is 90.62500% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 78.60%. Comparing base (43176cb) to head (7818465).
⚠️ Report is 51 commits behind head on main.

Files with missing lines Patch % Lines
src/aiida/engine/processes/process.py 90.63% 3 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #6870      +/-   ##
==========================================
+ Coverage   78.60%   78.60%   +0.01%     
==========================================
  Files         567      567              
  Lines       43097    43127      +30     
==========================================
+ Hits        33871    33895      +24     
- Misses       9226     9232       +6     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@aiidateam aiidateam deleted a comment from khsrali May 9, 2025
@agoscinski agoscinski force-pushed the fix/kill-continue-cancel-job branch from a8789b9 to 133d884 Compare May 9, 2025 17:04
Comment on lines 359 to 361
# PR_COMMENT: We cannot reuse _killing because of type issues, it is a CancellableAction.
# We can wrap a task around a CancellableAction but the CancellableAction catches silently any
# error whilel here we need to know if the cancelation of the scheduler job failed.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove as soon as read

Suggested change
# PR_COMMENT: We cannot reuse _killing because of type issues, it is a CancellableAction.
# We can wrap a task around a CancellableAction but the CancellableAction catches silently any
# error whilel here we need to know if the cancelation of the scheduler job failed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here also

Comment on lines 413 to 414
# PR_COMMENT This is a copy of the function in engine/processes/calcjobs/tasks.py
# and will merged to one place in PR #6868
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# PR_COMMENT This is a copy of the function in engine/processes/calcjobs/tasks.py
# and will merged to one place in PR #6868

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here

Comment on lines 342 to 345
# PR_COMMENT Because we need to overwrite the logic of the cancelation of the self._killing task of the
# scheduler job, we need to copy this logic of the parent class in plumpy, we need to adapt the
# cancelation of the last sent killing action to also resend the kill/cancelation of the scheduler
# job as we stop this canelation by canceling the last killing action
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# PR_COMMENT Because we need to overwrite the logic of the cancelation of the self._killing task of the
# scheduler job, we need to copy this logic of the parent class in plumpy, we need to adapt the
# cancelation of the last sent killing action to also resend the kill/cancelation of the scheduler
# job as we stop this canelation by canceling the last killing action

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know, maybe you wanted to remove this?

agoscinski added a commit to agoscinski/aiida-core that referenced this pull request May 9, 2025
Squashed commit at 2025-05-09 21:53

PR aiidateam#6793 introduced the cancelation of earlier kill actions. This had
the problem if two kill commands are set in a sequence, the second kill
action will cancel the first one which triggered the cancelation of the
scheduler job within an EBM. The second kill command however did not
retrigger the cancelation of the scheduler job. This bug appeared
because we have two places where the killing logic is placed. More
information about this can be found in PR aiidateam#6868 that fixes this properly
refactoring the kill action. This PR only serves as a fast temporary fix
with workarounds.

Before this PR, when the killing command failed through the EBM, the
scheduler job could not be cancelled through a kill anymore. Since we
have now force-kill option to bypass the EBM, we can reschedule the
cancelation of the scheduler job to gracefully kill a process.
@agoscinski agoscinski mentioned this pull request May 9, 2025
agoscinski added a commit that referenced this pull request May 9, 2025
Squashed commit at 2025-05-09 21:53

PR #6793 introduced the cancelation of earlier kill actions. This had
the problem if two kill commands are set in a sequence, the second kill
action will cancel the first one which triggered the cancelation of the
scheduler job within an EBM. The second kill command however did not
retrigger the cancelation of the scheduler job. This bug appeared
because we have two places where the killing logic is placed. More
information about this can be found in PR #6868 that fixes this properly
refactoring the kill action. This PR only serves as a fast temporary fix
with workarounds.

Before this PR, when the killing command failed through the EBM, the
scheduler job could not be cancelled through a kill anymore. Since we
have now force-kill option to bypass the EBM, we can reschedule the
cancelation of the scheduler job to gracefully kill a process.
@agoscinski agoscinski force-pushed the fix/kill-continue-cancel-job branch from 6aad3b0 to 133d884 Compare May 20, 2025 17:43
Comment on lines 303 to 309
# kill should start EBM and be not successful in EBM
run_cli_command(cmd_process.process_kill, [str(node.pk), '--wait'])
await_condition(lambda: not node.is_killed, timeout=kill_timeout)

# kill should restart EBM and be not successful in EBM
run_cli_command(cmd_process.process_kill, [str(node.pk), '--wait'])
await_condition(lambda: not node.is_killed, timeout=kill_timeout)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why this is repeated two times

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The second time it is restarted. I added a sentence to be clearer what is tested by executing the command twice
# this tests if the old task is cancelled and restarted successfully

# (e.g. in scenarios that transport is working again)
"""Kill a process that is waiting after failed EBM during upload. It should be possible to kill it normally.
A process that failed upload (e.g. in scenarios that transport is working again) and is then killed with
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please provide a comment why specifically upload is important?
this could be an EBM of another kill, right?

Note, as of before, a process that's failed via EBM, will be marked as Excepted,
therefore it's not possible anymore to kill gracefully.

One has still be able to kill gracefully once the connection is back.
We need a test to covers that.

Copy link
Contributor Author

@agoscinski agoscinski Jun 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please provide a comment why specifically upload is important?

It is only important that is not the killing itself but that it is the transport task that fails. I renamed function and updated docstring

    """Kill a process that is waiting after failed EBM during a transport task.

    It should be possible to kill it normally. A process that failed a transport task temporary should be killable.
    (e.g. the upload failed because of the interent connection problems that are fixed during the kill).
    """

Note, as of before, a process that's failed via EBM, will be marked as Excepted,
therefore it's not possible anymore to kill gracefully

Yes this has been changed in this PR. You can now still gracefully kill when EBM fails. This did not make sense. Since we have now the force option we have a clear distinction between graceful and nongraceful kill. This in-between thing when the EBM failed never made sense.

One has still be able to kill gracefully once the connection is back.
We need a test to covers that.

This test exactly covers this case. If EBM fails, send a graceful kill, it works. This could be tested for every state of the process where the transport could fail but this might be a bit overkill.

@agoscinski agoscinski force-pushed the fix/kill-continue-cancel-job branch 4 times, most recently from ec2ba6e to 16162b2 Compare June 2, 2025 12:51
@agoscinski agoscinski requested a review from khsrali June 2, 2025 12:55
Copy link
Contributor

@khsrali khsrali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot @agoscinski
Looks good to me, a few minor comments.

Comment on lines 342 to 345
# PR_COMMENT Because we need to overwrite the logic of the cancelation of the self._killing task of the
# scheduler job, we need to copy this logic of the parent class in plumpy, we need to adapt the
# cancelation of the last sent killing action to also resend the kill/cancelation of the scheduler
# job as we stop this canelation by canceling the last killing action
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know, maybe you wanted to remove this?

Comment on lines 359 to 361
# PR_COMMENT: We cannot reuse _killing because of type issues, it is a CancellableAction.
# We can wrap a task around a CancellableAction but the CancellableAction catches silently any
# error whilel here we need to know if the cancelation of the scheduler job failed.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here also

Comment on lines 413 to 414
# PR_COMMENT This is a copy of the function in engine/processes/calcjobs/tasks.py
# and will merged to one place in PR #6868
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here

try:
self.loop.run_until_complete(self._cancelling_scheduler_job)
except Exception as exc:
self.node.logger.error(f'While cancelling job error was raised: {exc!s}')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe !s is redundant as you already have f""

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed !s

Comment on lines +235 to +234
It should be possible to kill it normally. A process that failed upload (e.g. in scenarios that transport is working
again) and is then killed
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

await_condition(lambda: not node.is_killed, timeout=kill_timeout)

# kill should restart EBM and be not successful in EBM
# this tests if the old task is cancelled and restarted successfully
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but how the "the old task is cancelled" is being tested, exactly?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a log message in the process that is now checked here. We cannot directly check if it is cancelled as the daemon has the asyncio task queue

Copy link
Contributor Author

@agoscinski agoscinski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed PR comment and adapted test to check cancelation by checking the log. See last commit

await_condition(lambda: not node.is_killed, timeout=kill_timeout)

# kill should restart EBM and be not successful in EBM
# this tests if the old task is cancelled and restarted successfully
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a log message in the process that is now checked here. We cannot directly check if it is cancelled as the daemon has the asyncio task queue

@agoscinski agoscinski requested a review from khsrali June 4, 2025 18:54
@agoscinski agoscinski force-pushed the fix/kill-continue-cancel-job branch from 61d9160 to 0fba44f Compare June 4, 2025 18:54
agoscinski and others added 6 commits June 4, 2025 21:00
PR aiidateam#6793 introduced the cancelation of earlier kill actions. This had
the problem if two kill commands are set in a sequence, the second kill
action will cancel the first one which triggered the cancelation of the
scheduler job within an EBM. The second kill command however did not
retrigger the cancelation of the scheduler job. This bug appeared
because we have two places where the killing logic is placed. More
information about this can be found in PR aiidateam#6868 that fixes this properly
refactoring the kill action. This PR only serves as a fast temporary fix
with workarounds.

Before this PR, when the killing command failed through the EBM, the
scheduler job could not be cancelled through a kill anymore. Since we
have now force-kill option to bypass the EBM, we can reschedule the
cancelation of the scheduler job to gracefully kill a process.
@agoscinski agoscinski force-pushed the fix/kill-continue-cancel-job branch from 0fba44f to 7818465 Compare June 4, 2025 19:00
@agoscinski agoscinski merged commit e768b70 into aiidateam:main Jun 4, 2025
11 checks passed
@agoscinski agoscinski deleted the fix/kill-continue-cancel-job branch June 4, 2025 19:33
agoscinski added a commit to agoscinski/aiida-core that referenced this pull request Jun 4, 2025
The original design of `wait` and `timeout` was to distinguish between
actions that immediately return and actions that scheduled. This
mechanism was however never used and resulted in an misinterpretation
in the force-kill PR aiidateam#6793 introducing a bug fixed in PR aiidateam#6870.
The mechanism of `wait` and `timeout` was also never correctly implemented.
In this PR we rectify the logic and simplify it by handling immediate
and scheduled actions the same way.

Related commits in aiida-core 8388018, cd0d15c and plumpy 1b6ecb8

One can specifiy a `timeout <= 0` to express that the action should not wait for
a response while one can specify `timeout==float('inf')` (default value) to wait
until a response has been received without a timeout.
agoscinski added a commit to agoscinski/aiida-core that referenced this pull request Jun 4, 2025
The original design of `wait` and `timeout` was to distinguish between
actions that immediately return and actions that scheduled. This
mechanism was however never used and resulted in an misinterpretation
in the force-kill PR aiidateam#6793 introducing a bug fixed in PR aiidateam#6870.
The mechanism of `wait` and `timeout` was also never correctly implemented.
In this PR we rectify the logic and simplify it by handling immediate
and scheduled actions the same way.

Related commits in aiida-core 8388018, cd0d15c and plumpy 1b6ecb8

One can specifiy a `timeout <= 0` to express that the action should not wait for
a response while one can specify `timeout==float('inf')` (default value) to wait
until a response has been received without a timeout.
agoscinski added a commit that referenced this pull request Jun 11, 2025
The original design of `wait` and `timeout` was to distinguish between
actions that immediately return and actions that scheduled. This
mechanism was however never used and resulted in an misinterpretation
in the force-kill PR #6793 introducing a bug fixed in PR #6870.
The mechanism of `wait` and `timeout` was also never correctly implemented.
In this PR we rectify the logic and simplify it by handling immediate
and scheduled actions the same way.

Related commits in aiida-core 8388018, cd0d15c and plumpy 1b6ecb8

One can specifiy a `timeout <= 0` to express that the action should not wait for
a response while one can specify `timeout==float('inf')` (default value) to wait
until a response has been received without a timeout.
agoscinski added a commit to agoscinski/aiida-core that referenced this pull request Jun 11, 2025
The original design of `wait` and `timeout` was to distinguish between
actions that immediately return and actions that scheduled. This
mechanism was however never used and resulted in an misinterpretation
in the force-kill PR aiidateam#6793 introducing a bug fixed in PR aiidateam#6870.
The mechanism of `wait` and `timeout` was also never correctly implemented.
In this PR we rectify the logic and simplify it by handling immediate
and scheduled actions the same way.

Related commits in aiida-core 8388018, cd0d15c and plumpy 1b6ecb8

One can specifiy a `timeout <= 0` to express that the action should not wait for
a response while one can specify `timeout==float('inf')` (default value) to wait
until a response has been received without a timeout.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

2 participants