-
Notifications
You must be signed in to change notification settings - Fork 232
Closed
Labels
Milestone
Description
Edit: this issue only occurs in python 3.7, so unitl the issue is fuly resolved, a "fix" is to upgrade to python 3.8
Steps to reproduce
Steps to reproduce the behavior:
- Submit a job to the daemon
- Start the daemon, if not already running
- Stop or restart the daemon
Expected behavior
The daemon stops in a reasonably short time, and the job is "frozen" and will safely continue when the daemon restarts.
Actual problematic behaviour
Instead, when stopping and/or restarting the deamon, I get a TIMEOUT.
Then, I get things like this from verdi process list:
PK Created Process label Process State Process status
---- --------- --------------- --------------- -------------------------------------
184 20h ago PwCalculation ⨯ Excepted Transport task update was interrupted
189 1h ago PwCalculation ⨯ Excepted Transport task submit was interrupted
and verdi process report shows this:
$ verdi process report 184
*** 184: None
*** Scheduler output: N/A
*** Scheduler errors: N/A
*** 3 LOG MESSAGES:
+-> ERROR at 2021-01-08 18:53:56.557646+00:00
| Traceback (most recent call last):
| File "/home/pizzi/git/aiida-core/aiida/engine/utils.py", line 171, in exponential_backoff_retry
| result = await coro()
| File "/home/pizzi/git/aiida-core/aiida/engine/processes/calcjobs/tasks.py", line 177, in do_update
| job_info = await cancellable.with_interrupt(update_request)
| File "/home/pizzi/git/aiida-core/aiida/engine/utils.py", line 87, in with_interrupt
| result = await next(wait_iter)
| File "/usr/lib/python3.7/asyncio/tasks.py", line 560, in _wait_for_one
| return f.result() # May raise f.exception().
| concurrent.futures._base.CancelledError
+-> REPORT at 2021-01-08 18:53:36.744091+00:00
| [184|PwCalculation|on_except]: Traceback (most recent call last):
| File "/home/pizzi/.virtualenvs/aiida-dev/lib/python3.7/site-packages/plumpy/processes.py", line 1072, in step
| next_state = await self._run_task(self._state.execute)
| File "/home/pizzi/.virtualenvs/aiida-dev/lib/python3.7/site-packages/plumpy/processes.py", line 498, in _run_task
| result = await coro(*args, **kwargs)
| File "/home/pizzi/git/aiida-core/aiida/engine/processes/calcjobs/tasks.py", line 358, in execute
| job_done = await self._launch_task(task_update_job, node, self.process.runner.job_manager)
| File "/home/pizzi/git/aiida-core/aiida/engine/processes/calcjobs/tasks.py", line 394, in _launch_task
| result = await self._task
| File "/usr/lib/python3.7/asyncio/tasks.py", line 318, in __wakeup
| future.result()
| concurrent.futures._base.CancelledError
+-> ERROR at 2021-01-08 18:53:36.535279+00:00
| Traceback (most recent call last):
| File "/home/pizzi/git/aiida-core/aiida/engine/utils.py", line 171, in exponential_backoff_retry
| result = await coro()
| File "/home/pizzi/git/aiida-core/aiida/engine/processes/calcjobs/tasks.py", line 177, in do_update
| job_info = await cancellable.with_interrupt(update_request)
| File "/home/pizzi/git/aiida-core/aiida/engine/utils.py", line 87, in with_interrupt
| result = await next(wait_iter)
| File "/usr/lib/python3.7/asyncio/tasks.py", line 556, in _wait_for_one
| f = await done.get()
| File "/usr/lib/python3.7/asyncio/queues.py", line 159, in get
| await getter
| File "/usr/lib/python3.7/asyncio/futures.py", line 263, in __await__
| yield self # This tells Task to wait for completion.
| File "/usr/lib/python3.7/asyncio/tasks.py", line 318, in __wakeup
| future.result()
| File "/usr/lib/python3.7/asyncio/futures.py", line 176, in result
| raise CancelledError
| concurrent.futures._base.CancelledError
Further notes
I think this is actually the main problem that also manifested itself in #4595 and #4345
I think that many interruptions that should be "safe" are instead considered exceptions (here in the specific case of stopping the daemon, but also in other cases like SSH errors like maybe in #4345).