@@ -511,60 +511,138 @@ adding empty runtime placeholders instead of allowing implicit tasks:
511511
512512 .. _TaskRetries :
513513
514- Task Retry On Failure
515- ---------------------
514+ Automatically Retrying Tasks
515+ ----------------------------
516516
517- .. seealso ::
517+ .. tutorial :: tutorial.retries
518518
519- :cylc:conf: `[runtime][<namespace>]execution retry delays `.
519+ Cylc can be configured to automatically resubmit (i.e, retry) jobs which failed
520+ or submit-failed using these task configurations:
520521
521- Tasks can have a list of :term: `ISO8601 durations <ISO8601 duration> ` as retry
522- intervals. If the job fails the task will return to the ``waiting `` state
523- with a clock-trigger configured with the next retry delay.
522+ .. cylc-scope :: flow.cylc[runtime][<namespace>]
524523
524+ `execution retry delays `
525+ Configure retries for jobs which failed during execution (failed jobs - |job-failed |).
526+ `submission retry delays `
527+ Configure retries for jobs which failed during submission so never ran
528+ (submit-failed jobs - |job-submit-failed |).
525529
526- .. note ::
530+ Retry delays should be set to a list of
531+ :term: `ISO8601 durations <ISO8601 duration> ` that specify how long to wait
532+ before retrying the task again, e.g:
533+
534+ .. code-block :: cylc
535+
536+ [runtime]
537+ [[my-task]]
538+ script = do-something
539+
540+ # If the job fails, wait 30 seconds, then try again
541+ execution retry delays = PT30S
527542
528- Tasks only enter the ``submit-failed `` state if job submission fails with no
529- retries left. Otherwise they return to the waiting state, to wait on the
530- next try.
543+ # If the job submit-fails, wait one minute then try again.
544+ # If the retry submit-fails, wait a further 5 minutes, then try again.
545+ # If the second retry submit-fails, wait a further 15 minutes, then try again.
546+ submission retry delays = PT1M, PT5M, PT15M
531547
532- Tasks only enter the ``failed `` state if job execution fails with no retries
533- left. Otherwise they return to the waiting state, to wait on the next try.
534548
549+ Details
550+ ^^^^^^^
551+
552+ For a task with execution / submission retries configured:
553+
554+ * When a job fails or submit-fails, the task will change back into the
555+ ``waiting `` state |task-waiting | and a retry will be scheduled.
556+ * The task will not enter the failed or submit-failed state until all retries
557+ have been exhausted. This means that graph triggers
558+ (e.g. ``foo:failed => bar ``) and `task events <flow.cylc[runtime][<namespace>][events]> `
559+ (e.g. `[events]failed handlers `) will not be run until the task runs out of
560+ retries (rather than after the first failure / submission-failure) and will
561+ not be run if the retry subsequently succeeds.
562+ * The :ref: `$CYLC_TASK_TRY_NUMBER <Task Job Script Variables >`
563+ environment variable increments with each
564+ automatic submission, allowing you to vary task behaviour between retries.
565+
566+ .. cylc-scope ::
567+
568+ .. versionchanged :: 8.0.0
569+
570+ Tasks that fail but are configured to :term: `retry ` return to the ``waiting ``
571+ state, with a new clock trigger to handle the configured retry delay.
572+
573+ .. note ::
574+
575+ A task that is waiting on a retry will already have one or more failed jobs
576+ associated with it.
535577
536578
537- In the following example, tasks ``bad `` and ``flaky `` each have 3 retries
538- configured, with a 10 second delay between. On the final try, ``bad `` fails
539- again and goes to the ``failed `` state, while ``flaky `` succeeds and triggers
540- task ``whizz `` downstream. The scheduler will then stall because
541- ``bad `` failed (which is a :term: `final status `) with incomplete outputs.
579+ Advanced Example
580+ ^^^^^^^^^^^^^^^^
542581
543582.. code-block :: cylc
544583
545584 [scheduling]
546585 [[graph]]
547586 R1 = """
548- bad => cheese
549- flaky => whizz
550- """
551- [runtime]
552- [[bad]]
553- # retry 3 times then fail
554- script = """
555- sleep 10
556- false
587+ # If task "a" succeeds in three attempts or fewer, then run the
588+ # task "continue":
589+ a:succeed? => continue
590+
591+ # If task "a" still fails after two retries, then run "recover":
592+ a:fail? => recover
557593 """
558- execution retry delays = 3*PT10S
559- [[flaky] ]
560- # retry 3 times then succeed
594+
595+ [runtime ]
596+ [[a]]
561597 script = """
562- sleep 10
563- test $CYLC_TASK_TRY_NUMBER -gt 3
598+ if [[ $CYLC_TASK_TRY_NUMBER -eq 1 ]]; then
599+ # this is not an automatic retry
600+ export DEBUG=false
601+ else
602+ # this is a retry -> turn on some extra debugging
603+ export DEBUG=true
604+ fi
605+ do-something
564606 """
565- execution retry delays = 3*PT10S
566- [[cheese, whizz]]
567- script = "sleep 10"
607+
608+ # Schedule two retries for this task:
609+ # * The first retry will happen one minute after the task fails.
610+ # * The second retry will happen two minutes after the first retry
611+ # fails.
612+ execution retry delays = PT1M, PT3M
613+
614+ [[[events]]
615+ # These "failed" task events will only be actioned if the task
616+ # has exhausted all of its retries:
617+ mail events = failed
618+ failed handlers = my-task-event-handler
619+
620+
621+ Aborting a Retry Sequence
622+ ^^^^^^^^^^^^^^^^^^^^^^^^^
623+
624+ To prevent a task from retrying, remove it from the scheduler's
625+ :term: `active window `, e.g:
626+
627+ .. code-block :: console
628+
629+ $ cylc remove <workflow>//3/foo # remove task 3//foo preventing it from retrying
630+
631+ If you *kill * a running task that has more retries configured, it goes to the
632+ ``held `` state |task-held | so you can decide whether to release it and continue
633+ the retry sequence, or remove it.
634+
635+ .. code-block :: console
636+
637+ $ cylc kill brew//3/foo # 3/foo goes to held state post kill
638+ $ cylc release brew//3/foo # release to continue retrying...
639+ $ cylc remove brew//3/foo # ... OR remove the task to stop retries
640+
641+ If you want trigger downstream tasks despite ``3/foo `` being removed before it
642+ could succeed, use ``cylc set `` to artificially mark its
643+ :term: `required outputs <required output> `
644+ as complete (and with the ``--flow `` option, if needed to make a specific
645+ :term: `flow ` continue on from there).
568646
569647
570648.. _user_guide.runtime.task_event_handling :
0 commit comments