Skip to content

Conversation

@chrisjsewell
Copy link
Member

@chrisjsewell chrisjsewell commented Mar 4, 2021

See #4648 (comment) for the diagnosis of the issue.

Currently, stopping the daemon in python 3.7 excepts all processes.
This is due to the code in shutdown_runner,
which cancels all asyncio tasks running on the loop,
including process continue and transport tasks.

Cancelling a task raises an asyncio.CancellErrror.
In python 3.8+ this exception only inherits from BaseException,
and so is not caught by any except Exception "checkpoints" in plumpy/aiida-core.
In python <= 3.7 however, the exception is equal to concurrent.futures.CancelledError,
and so it was caught by one of:
Process.step, Running.execute or ProcessLauncher.handle_continue_exception
and the process was set to an excepted state.

Ideally in the long-term, we will alter shutdown_runner,
to not use such a "brute-force" mechanism.
But in the short-term term this commit directly fixes the issue,
by re-raising the asyncio.CancelledError exception.

@chrisjsewell chrisjsewell changed the title 🐛 FIX: Task.cancel` should not set state as EXCEPTED 🐛 FIX: Task.cancel should not set state as EXCEPTED Mar 4, 2021
@codecov
Copy link

codecov bot commented Mar 5, 2021

Codecov Report

Merging #4792 (87be417) into develop (a01b3ff) will decrease coverage by 0.02%.
The diff coverage is 45.46%.

Impacted file tree graph

@@             Coverage Diff             @@
##           develop    #4792      +/-   ##
===========================================
- Coverage    79.58%   79.57%   -0.01%     
===========================================
  Files          515      515              
  Lines        36931    36941      +10     
===========================================
+ Hits         29387    29391       +4     
- Misses        7544     7550       +6     
Flag Coverage Δ
django 74.29% <45.46%> (-<0.01%) ⬇️
sqlalchemy 73.19% <45.46%> (-<0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
aiida/engine/transports.py 93.34% <33.34%> (-3.15%) ⬇️
aiida/manage/external/rmq.py 45.17% <33.34%> (-0.39%) ⬇️
aiida/engine/processes/calcjobs/tasks.py 71.38% <60.00%> (-0.33%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a01b3ff...87be417. Read the comment docs.

@ltalirz
Copy link
Member

ltalirz commented Mar 8, 2021

After discussing with @muhrin , the source of the issue may lie in AiiDA directly using .cancel() rather than going through the processes' built-in interruption mechanism.

Aiida uses .cancel() in three places:

for task in tasks:
task.cancel()

elif open_callback_handle is not None:
open_callback_handle.cancel()

if not request.done():
request.cancel()

In particular the first one seems problematic.

@muhrin writes

By looping over all the tasks in the event loop (with no knowledge or care of which process they belong to) it's really ripping the rug out from under the Process and the process should not be expected to deal with such a scenario gracefully

from an architectural perspective I would think that shutdown_runner would rather be looping over all Processees and interrupting them, and they in turn would do any cleanup of outstanding coroutines, etc.

So the gist of what I'm suggesting is:

  1. Make sure close() (in AiiDA's Process subclass) frees all resources (i.e. interrupts all the tasks it created)
  2. In shutdown_runner loop over all processes and call .close()

Concerning keeping track of tasks associated with a Process

There isn't a general mechanism that automatically captures this and so the process may have to do it on a case-by-case basis. For example transport tasks will probably be important to look at and handle manually..

I.e. I assume we will need to make sure to use Process.add_cleanup

@ltalirz
Copy link
Member

ltalirz commented Mar 8, 2021

Perhaps this (closing the processes) should even go inside Runner.close()

def close(self) -> None:
"""Close the runner by stopping the loop."""
assert not self._closed
self.stop()
reset_event_loop_policy()
self._closed = True
def instantiate_process(self, process: TYPE_RUN_PROCESS, *args, **inputs):
from .utils import instantiate_process
return instantiate_process(self, process, *args, **inputs)

It seems to me that the runners currently don't keep track of the processes they are running, so one would need to add this accounting.
@sphuber is that correct? how would one loop over all processes that belong to a runner?

@chrisjsewell
Copy link
Member Author

@ltalirz I think you are one step behind me here lol:

the source of the issue may lie in AiiDA directly using .cancel() rather than going through the processes' built-in interruption mechanism. In particular the first one seems problematic.

This is what I already said in the original issues description: #4648 (comment)

It seems to me that the runners currently don't keep track of the processes they are running, so one would need to add this accounting. @sphuber is that correct? how would one loop over all processes that belong to a runner?

I'm already working on this, and a way to gracefully stop processes: aiidateam/plumpy#213

@ltalirz
Copy link
Member

ltalirz commented Mar 8, 2021

I see - sorry, I hadn't seen aiidateam/plumpy#213

Do we still need these changes in plumpy then https://github.com/aiidateam/plumpy/pull/214/files ?

Hm... maybe re-raising makes sense nevertheless...

@chrisjsewell
Copy link
Member Author

chrisjsewell commented Mar 8, 2021

Do we still need the changes in plumpy then aiidateam/plumpy#214 (files) ?
Hm... maybe re-raising makes sense nevertheless...

I think we might as well, because these changes just make sure the same thing is happening for python 3.7 and 3.8+, i.e. re-raising if asyncio.CancelledError is already happening for python 3.8+

@sphuber
Copy link
Contributor

sphuber commented Mar 8, 2021

There was no explicit accounting of what processes a runner was running because once stopped it cannot be started again. Typically then the Python interpreter as a whole would shut down anyway, because the daemon worker or interactive shell was stopping and so the tasks would be requeued automatically at some point by RabbitMQ.

@ltalirz
Copy link
Member

ltalirz commented Mar 8, 2021

Thanks for the explanation @sphuber !

I guess we nevertheless all agree that it would be useful if the runners could clean up after themselves, even if just to make it easier to pinpoint leaking tasks (or thinking e.g. of the test suite) and avoiding unexpected side effects of closing a runner.

As for this PR, I guess we can still proceed; the idea being that the fixes will be re-evaluated once the more controlled shutdown is in place.

@chrisjsewell
Copy link
Member Author

Just to note this is the commit that added this extra complexity in the stopping of the daemon worker: 281241c
Before this the daemon workers were just immediately killed

@chrisjsewell chrisjsewell requested a review from ltalirz March 9, 2021 01:41
@ltalirz
Copy link
Member

ltalirz commented Mar 9, 2021

Fine to approve once the tests pass (maybe restarting will be enough here)

@chrisjsewell
Copy link
Member Author

Fine to approve once the tests pass (maybe restarting will be enough here)

yeh just the stupid pymatgen issue

Anyhow, first I will now release a new version of plumpy, with the minimal fix aiidateam/plumpy#214 and update to that version in this PR, which should then close #4648, and unblock the 1.6 release

Then I can devote some more time to aiidateam/plumpy#213 and a more complete mechanism for stopping processes (intertwined with this issue of not excepting processes when a daemon worker loses connection with RMQ)

@ltalirz
Copy link
Member

ltalirz commented Mar 9, 2021

@chrisjsewell By the way, do you think it would be possible to test this?
I guess one can manually create a runner, start a task and then shut the runner down and check what happened to the process (?).

It would be great if we had some insurance against the bug resurfacing

@chrisjsewell
Copy link
Member Author

chrisjsewell commented Mar 9, 2021

By the way, do you think it would be possible to test this?

ermm you mean like this 😉: 6f041a5 (which did fail before the changes)

@ltalirz
Copy link
Member

ltalirz commented Mar 9, 2021

Looks great, thanks a lot!

Maybe you want to clean up the PR description now and have it close #4648 ?

@chrisjsewell
Copy link
Member Author

chrisjsewell commented Mar 9, 2021

Maybe you want to clean up the PR description now and have it close #4648 ?

yep will do 👍 (its already closing #4648) just waiting for the plumpy feedstock PR to appear

@chrisjsewell chrisjsewell marked this pull request as ready for review March 9, 2021 22:51
@chrisjsewell chrisjsewell merged commit b3f7080 into aiidateam:develop Mar 9, 2021
@chrisjsewell chrisjsewell deleted the fix-task-cancel branch March 9, 2021 22:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet