Skip to content

Restarting the daemon excepts all jobs (aiida-core 1.6, python 3.7) #4648

@giovannipizzi

Description

@giovannipizzi

Edit: this issue only occurs in python 3.7, so unitl the issue is fuly resolved, a "fix" is to upgrade to python 3.8


Steps to reproduce

Steps to reproduce the behavior:

  1. Submit a job to the daemon
  2. Start the daemon, if not already running
  3. Stop or restart the daemon

Expected behavior

The daemon stops in a reasonably short time, and the job is "frozen" and will safely continue when the daemon restarts.

Actual problematic behaviour

Instead, when stopping and/or restarting the deamon, I get a TIMEOUT.
Then, I get things like this from verdi process list:

  PK  Created    Process label    Process State    Process status
----  ---------  ---------------  ---------------  -------------------------------------
 184  20h ago    PwCalculation    ⨯ Excepted       Transport task update was interrupted
 189  1h ago     PwCalculation    ⨯ Excepted       Transport task submit was interrupted

and verdi process report shows this:

$ verdi process report 184
*** 184: None
*** Scheduler output: N/A
*** Scheduler errors: N/A
*** 3 LOG MESSAGES:
+-> ERROR at 2021-01-08 18:53:56.557646+00:00
 | Traceback (most recent call last):
 |   File "/home/pizzi/git/aiida-core/aiida/engine/utils.py", line 171, in exponential_backoff_retry
 |     result = await coro()
 |   File "/home/pizzi/git/aiida-core/aiida/engine/processes/calcjobs/tasks.py", line 177, in do_update
 |     job_info = await cancellable.with_interrupt(update_request)
 |   File "/home/pizzi/git/aiida-core/aiida/engine/utils.py", line 87, in with_interrupt
 |     result = await next(wait_iter)
 |   File "/usr/lib/python3.7/asyncio/tasks.py", line 560, in _wait_for_one
 |     return f.result()  # May raise f.exception().
 | concurrent.futures._base.CancelledError
+-> REPORT at 2021-01-08 18:53:36.744091+00:00
 | [184|PwCalculation|on_except]: Traceback (most recent call last):
 |   File "/home/pizzi/.virtualenvs/aiida-dev/lib/python3.7/site-packages/plumpy/processes.py", line 1072, in step
 |     next_state = await self._run_task(self._state.execute)
 |   File "/home/pizzi/.virtualenvs/aiida-dev/lib/python3.7/site-packages/plumpy/processes.py", line 498, in _run_task
 |     result = await coro(*args, **kwargs)
 |   File "/home/pizzi/git/aiida-core/aiida/engine/processes/calcjobs/tasks.py", line 358, in execute
 |     job_done = await self._launch_task(task_update_job, node, self.process.runner.job_manager)
 |   File "/home/pizzi/git/aiida-core/aiida/engine/processes/calcjobs/tasks.py", line 394, in _launch_task
 |     result = await self._task
 |   File "/usr/lib/python3.7/asyncio/tasks.py", line 318, in __wakeup
 |     future.result()
 | concurrent.futures._base.CancelledError
+-> ERROR at 2021-01-08 18:53:36.535279+00:00
 | Traceback (most recent call last):
 |   File "/home/pizzi/git/aiida-core/aiida/engine/utils.py", line 171, in exponential_backoff_retry
 |     result = await coro()
 |   File "/home/pizzi/git/aiida-core/aiida/engine/processes/calcjobs/tasks.py", line 177, in do_update
 |     job_info = await cancellable.with_interrupt(update_request)
 |   File "/home/pizzi/git/aiida-core/aiida/engine/utils.py", line 87, in with_interrupt
 |     result = await next(wait_iter)
 |   File "/usr/lib/python3.7/asyncio/tasks.py", line 556, in _wait_for_one
 |     f = await done.get()
 |   File "/usr/lib/python3.7/asyncio/queues.py", line 159, in get
 |     await getter
 |   File "/usr/lib/python3.7/asyncio/futures.py", line 263, in __await__
 |     yield self  # This tells Task to wait for completion.
 |   File "/usr/lib/python3.7/asyncio/tasks.py", line 318, in __wakeup
 |     future.result()
 |   File "/usr/lib/python3.7/asyncio/futures.py", line 176, in result
 |     raise CancelledError
 | concurrent.futures._base.CancelledError

Further notes

I think this is actually the main problem that also manifested itself in #4595 and #4345

I think that many interruptions that should be "safe" are instead considered exceptions (here in the specific case of stopping the daemon, but also in other cases like SSH errors like maybe in #4345).

@muhrin @sphuber @unkcpz @chrisjsewell

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions