Skip to content

Commit 58f6abc

Browse files
Merge pull request #817 from cylc/8.4.x-sync
🤖 Merge 8.4.x-sync into master
2 parents c0f07b9 + 5e38a66 commit 58f6abc

File tree

6 files changed

+138
-89
lines changed

6 files changed

+138
-89
lines changed

src/glossary.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -55,6 +55,7 @@ Glossary
5555

5656
.. seealso::
5757

58+
* :ref:`Tutorial <tutorial.retries>`
5859
* :ref:`Cylc User Guide <TaskRetries>`
5960

6061

src/tutorial/furthertopics/retries.rst

Lines changed: 16 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,12 @@
1+
.. _tutorial.retries:
2+
13
Retries
24
=======
35

6+
.. seealso::
7+
8+
:ref:`Cylc User Guide <TaskRetries>`
9+
410
Retries allow us to automatically re-submit tasks which have failed due to
511
failure in submission or execution.
612

@@ -102,13 +108,16 @@ This means that if the ``roll_doubles`` task fails, Cylc expects to
102108
retry running it 5 times before finally failing. Each retry will have
103109
a delay of 6 seconds.
104110

105-
We can apply multiple retry periods with the ``execution retry delays`` setting
106-
by separating them with commas, for example the following line would tell Cylc
107-
to retry a task four times, once after 15 seconds, then once after 10 minutes,
108-
then once after one hour then once after three hours.
111+
We can apply multiple retry periods with the
112+
`execution retry delays <[runtime][<namespace>]execution retry delays>` setting
113+
by separating them with commas, e.g:
109114

110115
.. code-block:: cylc
111116
117+
# If the task fails, wait 15 seconss, then retry it.
118+
# If the retry fails, wait a further 10 minutes, then retry it again.
119+
# If the second retry fails, wait a further 1 hour, then retry it again.
120+
# If the third retry fails, wait a further 3 hours, then retry it again.
112121
execution retry delays = PT15S, PT10M, PT1H, PT3H
113122
114123
@@ -158,4 +167,6 @@ This time, the task should definitely succeed before the third retry.
158167
Further Reading
159168
---------------
160169

161-
For more information see the `Cylc User Guide`_.
170+
* :ref:`Cylc User Guide <TaskRetries>`
171+
* `[runtime][<namespace>]execution retry delays`.
172+
* `[runtime][<namespace>]submission retry delays`.

src/tutorial/runtime/runtime-configuration.rst

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -186,6 +186,11 @@ Jobs can fail for several reasons:
186186
left. Otherwise they return to the waiting state, to wait on the next try.
187187

188188

189+
.. seealso::
190+
191+
* :ref:`Tutorial <tutorial.retries>`.
192+
* :ref:`User Guide <TaskRetries>`.
193+
189194

190195
.. _tutorial.start_stop_restart:
191196

src/user-guide/running-workflows/index.rst

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,6 @@ Running Workflows
88

99
scheduler-start-up
1010
tasks-jobs-ui
11-
retrying-tasks
1211
tracking-task-state
1312
workflow-completion
1413
reflow
Lines changed: 3 additions & 48 deletions
Original file line numberDiff line numberDiff line change
@@ -1,51 +1,6 @@
1+
:orphan:
2+
13
Retrying Tasks
24
==============
35

4-
.. versionchanged:: 8.0.0
5-
6-
Tasks that fail but are configured to :term:`retry` return to the ``waiting``
7-
state, with a new clock trigger to handle the configured retry delay.
8-
9-
.. note::
10-
11-
A task that is waiting on a retry will already have one or more failed jobs
12-
associated with it.
13-
14-
15-
.. note::
16-
17-
Tasks only enter the ``submit-failed`` state if job submission fails with no
18-
retries left. Otherwise they return to the waiting state, to wait on the
19-
next try.
20-
21-
Tasks only enter the ``failed`` state if job execution fails with no retries
22-
left. Otherwise they return to the waiting state, to wait on the next try.
23-
24-
25-
26-
Aborting a Retry Sequence
27-
-------------------------
28-
29-
To prevent a task from retrying, remove it from the scheduler's
30-
:term:`active window`. For a task ``3/foo`` in workflow ``brew``:
31-
32-
.. code-block:: console
33-
34-
$ cylc remove brew//3/foo
35-
36-
If you *kill* a running task that has more retries configured, it goes to the
37-
``held`` state so you can decide whether to release it and continue the retry
38-
sequence, or remove it.
39-
40-
.. code-block:: console
41-
42-
$ cylc kill brew//3/foo # 3/foo goes to held state post kill
43-
$ cylc release brew//3/foo # release to continue retrying...
44-
$ cylc remove brew//3/foo # ... OR remove the task to stop retries
45-
46-
47-
If you want trigger downstream tasks despite ``3/foo`` being removed before it
48-
could succeed, use ``cylc set`` to artificially mark its
49-
:term:`required outputs <required output>`
50-
as complete (and with the ``--flow`` option, if needed to make a specific
51-
:term:`flow` continue on from there).
6+
This section has moved to :ref:`TaskRetries`.

src/user-guide/writing-workflows/runtime.rst

Lines changed: 113 additions & 35 deletions
Original file line numberDiff line numberDiff line change
@@ -511,60 +511,138 @@ adding empty runtime placeholders instead of allowing implicit tasks:
511511
512512
.. _TaskRetries:
513513

514-
Task Retry On Failure
515-
---------------------
514+
Automatically Retrying Tasks
515+
----------------------------
516516

517-
.. seealso::
517+
.. tutorial:: tutorial.retries
518518

519-
:cylc:conf:`[runtime][<namespace>]execution retry delays`.
519+
Cylc can be configured to automatically resubmit (i.e, retry) jobs which failed
520+
or submit-failed using these task configurations:
520521

521-
Tasks can have a list of :term:`ISO8601 durations <ISO8601 duration>` as retry
522-
intervals. If the job fails the task will return to the ``waiting`` state
523-
with a clock-trigger configured with the next retry delay.
522+
.. cylc-scope:: flow.cylc[runtime][<namespace>]
524523

524+
`execution retry delays`
525+
Configure retries for jobs which failed during execution (failed jobs - |job-failed|).
526+
`submission retry delays`
527+
Configure retries for jobs which failed during submission so never ran
528+
(submit-failed jobs - |job-submit-failed|).
525529

526-
.. note::
530+
Retry delays should be set to a list of
531+
:term:`ISO8601 durations <ISO8601 duration>` that specify how long to wait
532+
before retrying the task again, e.g:
533+
534+
.. code-block:: cylc
535+
536+
[runtime]
537+
[[my-task]]
538+
script = do-something
539+
540+
# If the job fails, wait 30 seconds, then try again
541+
execution retry delays = PT30S
527542
528-
Tasks only enter the ``submit-failed`` state if job submission fails with no
529-
retries left. Otherwise they return to the waiting state, to wait on the
530-
next try.
543+
# If the job submit-fails, wait one minute then try again.
544+
# If the retry submit-fails, wait a further 5 minutes, then try again.
545+
# If the second retry submit-fails, wait a further 15 minutes, then try again.
546+
submission retry delays = PT1M, PT5M, PT15M
531547
532-
Tasks only enter the ``failed`` state if job execution fails with no retries
533-
left. Otherwise they return to the waiting state, to wait on the next try.
534548
549+
Details
550+
^^^^^^^
551+
552+
For a task with execution / submission retries configured:
553+
554+
* When a job fails or submit-fails, the task will change back into the
555+
``waiting`` state |task-waiting| and a retry will be scheduled.
556+
* The task will not enter the failed or submit-failed state until all retries
557+
have been exhausted. This means that graph triggers
558+
(e.g. ``foo:failed => bar``) and `task events <flow.cylc[runtime][<namespace>][events]>`
559+
(e.g. `[events]failed handlers`) will not be run until the task runs out of
560+
retries (rather than after the first failure / submission-failure) and will
561+
not be run if the retry subsequently succeeds.
562+
* The :ref:`$CYLC_TASK_TRY_NUMBER <Task Job Script Variables>`
563+
environment variable increments with each
564+
automatic submission, allowing you to vary task behaviour between retries.
565+
566+
.. cylc-scope::
567+
568+
.. versionchanged:: 8.0.0
569+
570+
Tasks that fail but are configured to :term:`retry` return to the ``waiting``
571+
state, with a new clock trigger to handle the configured retry delay.
572+
573+
.. note::
574+
575+
A task that is waiting on a retry will already have one or more failed jobs
576+
associated with it.
535577

536578

537-
In the following example, tasks ``bad`` and ``flaky`` each have 3 retries
538-
configured, with a 10 second delay between. On the final try, ``bad`` fails
539-
again and goes to the ``failed`` state, while ``flaky`` succeeds and triggers
540-
task ``whizz`` downstream. The scheduler will then stall because
541-
``bad`` failed (which is a :term:`final status`) with incomplete outputs.
579+
Advanced Example
580+
^^^^^^^^^^^^^^^^
542581

543582
.. code-block:: cylc
544583
545584
[scheduling]
546585
[[graph]]
547586
R1 = """
548-
bad => cheese
549-
flaky => whizz
550-
"""
551-
[runtime]
552-
[[bad]]
553-
# retry 3 times then fail
554-
script = """
555-
sleep 10
556-
false
587+
# If task "a" succeeds in three attempts or fewer, then run the
588+
# task "continue":
589+
a:succeed? => continue
590+
591+
# If task "a" still fails after two retries, then run "recover":
592+
a:fail? => recover
557593
"""
558-
execution retry delays = 3*PT10S
559-
[[flaky]]
560-
# retry 3 times then succeed
594+
595+
[runtime]
596+
[[a]]
561597
script = """
562-
sleep 10
563-
test $CYLC_TASK_TRY_NUMBER -gt 3
598+
if [[ $CYLC_TASK_TRY_NUMBER -eq 1 ]]; then
599+
# this is not an automatic retry
600+
export DEBUG=false
601+
else
602+
# this is a retry -> turn on some extra debugging
603+
export DEBUG=true
604+
fi
605+
do-something
564606
"""
565-
execution retry delays = 3*PT10S
566-
[[cheese, whizz]]
567-
script = "sleep 10"
607+
608+
# Schedule two retries for this task:
609+
# * The first retry will happen one minute after the task fails.
610+
# * The second retry will happen two minutes after the first retry
611+
# fails.
612+
execution retry delays = PT1M, PT3M
613+
614+
[[[events]]
615+
# These "failed" task events will only be actioned if the task
616+
# has exhausted all of its retries:
617+
mail events = failed
618+
failed handlers = my-task-event-handler
619+
620+
621+
Aborting a Retry Sequence
622+
^^^^^^^^^^^^^^^^^^^^^^^^^
623+
624+
To prevent a task from retrying, remove it from the scheduler's
625+
:term:`active window`, e.g:
626+
627+
.. code-block:: console
628+
629+
$ cylc remove <workflow>//3/foo # remove task 3//foo preventing it from retrying
630+
631+
If you *kill* a running task that has more retries configured, it goes to the
632+
``held`` state |task-held| so you can decide whether to release it and continue
633+
the retry sequence, or remove it.
634+
635+
.. code-block:: console
636+
637+
$ cylc kill brew//3/foo # 3/foo goes to held state post kill
638+
$ cylc release brew//3/foo # release to continue retrying...
639+
$ cylc remove brew//3/foo # ... OR remove the task to stop retries
640+
641+
If you want trigger downstream tasks despite ``3/foo`` being removed before it
642+
could succeed, use ``cylc set`` to artificially mark its
643+
:term:`required outputs <required output>`
644+
as complete (and with the ``--flow`` option, if needed to make a specific
645+
:term:`flow` continue on from there).
568646

569647

570648
.. _user_guide.runtime.task_event_handling:

0 commit comments

Comments
 (0)