Skip to content

Commit 2267feb

Browse files
user guide: flesh out workflow completion / stall sectionj
1 parent 5a244ef commit 2267feb

File tree

5 files changed

+104
-16
lines changed

5 files changed

+104
-16
lines changed

src/glossary.rst

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1373,7 +1373,7 @@ Glossary
13731373
Every :term:`task` has a set of standard :term:`outputs <task output>`
13741374
that trigger :term:`task state` changes:
13751375

1376-
- ``:expired```
1376+
- ``:expired``
13771377
- ``:submitted``, or ``:submit-failed``
13781378
- ``:started``
13791379
- ``:succeeded``, or ``:failed``
@@ -1601,6 +1601,10 @@ Glossary
16011601
User intervention is required to fix a stall, e.g. by retriggering
16021602
tasks after fixing the problems that caused them to fail.
16031603

1604+
.. seealso:
1605+
1606+
* :ref:`Cylc User Guide <scheduler stall>`
1607+
16041608
16051609
suicide trigger
16061610
Suicide triggers remove tasks from the :term:`n=0 window <n-window>`.

src/img/gui-stall.png

218 KB
Loading

src/reference/changes.rst

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -63,6 +63,8 @@ Technical details:
6363
removed if necessary to allow the tasks to re-run in order.
6464

6565

66+
.. _changes.warning_triangles:
67+
6668
Warning Triangles
6769
^^^^^^^^^^^^^^^^^
6870

src/user-guide/running-workflows/workflow-completion.rst

Lines changed: 95 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -3,28 +3,92 @@
33
Workflow Completion
44
===================
55

6-
A workflow can :term:`shut down <shutdown>` once all
7-
:term:`active tasks <active task>` complete without spawning further
8-
downstream activity - i.e., when :term:`n=0 window <n-window>` empties out.
6+
Once Cylc has run all of the tasks in the :term:`graph` (i.e. once it has
7+
reached the end of the workflow and there are no tasks left), the workflow
8+
will shut down automatically.
9+
10+
A workflow with no tasks left is said to have "completed".
11+
12+
When you restart a workflow, it will restart in the same state it shut down in.
13+
So if you restart a completed workflow (one with no remaining tasks), it will
14+
come back with no tasks. Having no more tasks to run, the workflow will
15+
automatically shut down after the configured
16+
:cylc:conf:`restart timeout <[scheduler][events]restart timeout>`.
17+
18+
If you want to re-run some tasks in a completed workflow, restart the workflow
19+
then
20+
:ref:`re-trigger the selected tasks <interventions.re-run-multiple-tasks>`
21+
or :ref:`trigger a new flow <interventions.reflow>` to run through the graph
22+
(before the restart timeout passes).
23+
24+
A common pattern is to restart a completed workflow and extend it for a few
25+
cycles. The easiest way to achieve this is to use the
26+
:cylc:conf:`stop after cycle point <[scheduling]stop after cycle point>`
27+
rather than the
28+
:cylc:conf:`final cycle point <[scheduling]final cycle point>`, this prevents
29+
the workflow from completing, making it easier to restart it from where it
30+
left off. For a worked example, see :ref:`examples.extending-workflow`.
31+
932

1033
.. _scheduler stall:
1134

1235
Scheduler Stall
1336
===============
1437

15-
A workflow has stalled if:
38+
If Cylc is unable to make progress through the :term:`graph` (i.e, if the path
39+
through the graph is "blocked"), then the workflow is considered
40+
:term:`stalled <stall>`.
1641

17-
* No tasks are waiting on unsatisfied external events, like clock triggers and xtriggers.
18-
* AND All activity has ceased.
19-
* AND The workflow has not run to completion.
42+
Stalls are usually caused by unexpected task failures.
2043

21-
A workflow which has stalled requires manual intervention to continue.
44+
A stalled workflow has not run to completion but cannot continue without manual
45+
intervention. Typically this involves
46+
:ref:`fixing and rerunning a failed task <interventions.edit-a-tasks-configuration>`.
47+
48+
49+
Stall Conditions
50+
----------------
51+
52+
A workflow has stalled if:
53+
54+
* The workflow has not run to completion (i.e, there are still tasks left
55+
for Cylc to run).
56+
* AND no tasks are waiting on unsatisfied
57+
:ref:`external events <Section External Triggers>` (e.g, clock triggers
58+
and xtriggers).
59+
* AND All activity has ceased (i.e, no preparing, submitted or running tasks).
2260

2361
Stalls are caused by :term:`final status incomplete tasks <output completion>`
2462
and :term:`partially satisfied tasks <prerequisite>`.
2563

2664
These most often result from task failures that the workflow does not
27-
handle automatically by retries or optional branching.
65+
handle automatically by :term:`retries <retry>` or :term:`graph branching`.
66+
67+
68+
Diagnosing Stalls
69+
-----------------
70+
71+
A screenshot of the Cylc GUI displaying a stalled workflow:
72+
73+
.. image:: ../../img/gui-stall.png
74+
:align: center
75+
:width: 90%
76+
77+
|
78+
79+
In the above screenshot:
80+
81+
* The stall was caused by the failure of the task ``2/a``.
82+
* The stall event is recorded in the :term:`workflow log` file (shown on the
83+
right) along with the list of :term:`incomplete tasks <output completion>`
84+
that caused it (2/a did not complete the required outputs: succeeded).
85+
* In the GUI, the :ref:`warning triangle <changes.warning_triangles>`
86+
will light up to notify you of the error, hover over it to see the log
87+
messages.
88+
89+
90+
Stall Timeouts
91+
--------------
2892

2993
A stalled scheduler stays alive for a configurable timeout period
3094
to allow you to intervene, e.g. by manually triggering an incomplete
@@ -34,12 +98,28 @@ If a stalled workflow does eventually shut down, on the stall timeout
3498
or by stop command, it will immediately stall again on restart to await
3599
manual intervention.
36100

37-
.. warning::
101+
Stall timeout behaviour is controlled by the following configurations:
102+
103+
.. admonition:: Configuration
104+
:class: note
105+
106+
:cylc:conf:`[scheduler][events]stall timeout`
107+
The length of time before a stalled workflow will shut down.
108+
:cylc:conf:`[scheduler][events]abort on stall timeout`
109+
Whether the scheduler should shut down immediately with error status if
110+
the stall timeout is reached.
111+
112+
113+
Stall Events
114+
------------
38115

39-
Look in the :term:`scheduler log` to see which tasks caused a stall.
116+
Cylc emits the :ref:`stall <user_guide.workflow_events.stall>` event when a
117+
scheduler stalls.
40118

41-
.. seealso::
119+
.. admonition:: Configuration
120+
:class: note
42121

43-
* :cylc:conf:`[scheduler][events]stall timeout`
44-
* :cylc:conf:`[scheduler][events]abort on stall timeout`
45-
* :cylc:conf:`[scheduler][events]stall handlers`
122+
:cylc:conf:`[scheduler][events]mail events = stall`
123+
Configure emails for stall events.
124+
:cylc:conf:`[scheduler][events]stall handlers`
125+
Configure custom event handlers to run on stall events.

src/user-guide/writing-workflows/scheduler.rst

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -92,6 +92,8 @@ Some workflow events have related configurations e.g. for setting the timeout.
9292

9393
.. describe:: stall
9494

95+
.. _user_guide.workflow_events.stall:
96+
9597
:Event Handler: `stall handlers`
9698

9799
The workflow :term:`stalled <stall>` (i.e. the scheduler cannot make any

0 commit comments

Comments
 (0)