-
Notifications
You must be signed in to change notification settings - Fork 232
🐛 FIX: Task.cancel should not set state as EXCEPTED
#4792
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Task.cancel should not set state as EXCEPTED
Codecov Report
@@ Coverage Diff @@
## develop #4792 +/- ##
===========================================
- Coverage 79.58% 79.57% -0.01%
===========================================
Files 515 515
Lines 36931 36941 +10
===========================================
+ Hits 29387 29391 +4
- Misses 7544 7550 +6
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
|
After discussing with @muhrin , the source of the issue may lie in AiiDA directly using Aiida uses aiida-core/aiida/engine/daemon/runner.py Lines 31 to 32 in 285ca45
aiida-core/aiida/engine/transports.py Lines 122 to 123 in b07841a
aiida-core/aiida/engine/processes/calcjobs/manager.py Lines 290 to 291 in b07841a
In particular the first one seems problematic. @muhrin writes
Concerning keeping track of tasks associated with a
I.e. I assume we will need to make sure to use |
|
Perhaps this (closing the processes) should even go inside aiida-core/aiida/engine/runners.py Lines 161 to 170 in da872b2
It seems to me that the runners currently don't keep track of the processes they are running, so one would need to add this accounting. |
|
@ltalirz I think you are one step behind me here lol:
This is what I already said in the original issues description: #4648 (comment)
I'm already working on this, and a way to gracefully stop processes: aiidateam/plumpy#213 |
|
I see - sorry, I hadn't seen aiidateam/plumpy#213 Do we still need these changes in plumpy then https://github.com/aiidateam/plumpy/pull/214/files ? Hm... maybe re-raising makes sense nevertheless... |
I think we might as well, because these changes just make sure the same thing is happening for python 3.7 and 3.8+, i.e. re-raising if |
|
There was no explicit accounting of what processes a runner was running because once stopped it cannot be started again. Typically then the Python interpreter as a whole would shut down anyway, because the daemon worker or interactive shell was stopping and so the tasks would be requeued automatically at some point by RabbitMQ. |
|
Thanks for the explanation @sphuber ! I guess we nevertheless all agree that it would be useful if the runners could clean up after themselves, even if just to make it easier to pinpoint leaking tasks (or thinking e.g. of the test suite) and avoiding unexpected side effects of closing a runner. As for this PR, I guess we can still proceed; the idea being that the fixes will be re-evaluated once the more controlled shutdown is in place. |
|
Just to note this is the commit that added this extra complexity in the stopping of the daemon worker: 281241c |
|
Fine to approve once the tests pass (maybe restarting will be enough here) |
yeh just the stupid pymatgen issue Anyhow, first I will now release a new version of plumpy, with the minimal fix aiidateam/plumpy#214 and update to that version in this PR, which should then close #4648, and unblock the 1.6 release Then I can devote some more time to aiidateam/plumpy#213 and a more complete mechanism for stopping processes (intertwined with this issue of not excepting processes when a daemon worker loses connection with RMQ) |
|
@chrisjsewell By the way, do you think it would be possible to test this? It would be great if we had some insurance against the bug resurfacing |
ermm you mean like this 😉: 6f041a5 (which did fail before the changes) |
|
Looks great, thanks a lot! Maybe you want to clean up the PR description now and have it close #4648 ? |
See #4648 (comment) for the diagnosis of the issue.
Currently, stopping the daemon in python 3.7 excepts all processes.
This is due to the code in
shutdown_runner,which cancels all asyncio tasks running on the loop,
including process continue and transport tasks.
Cancelling a task raises an
asyncio.CancellErrror.In python 3.8+ this exception only inherits from
BaseException,and so is not caught by any
except Exception"checkpoints" in plumpy/aiida-core.In python <= 3.7 however, the exception is equal to
concurrent.futures.CancelledError,and so it was caught by one of:
Process.step,Running.executeorProcessLauncher.handle_continue_exceptionand the process was set to an excepted state.
Ideally in the long-term, we will alter
shutdown_runner,to not use such a "brute-force" mechanism.
But in the short-term term this commit directly fixes the issue,
by re-raising the
asyncio.CancelledErrorexception.