@@ -511,60 +511,138 @@ adding empty runtime placeholders instead of allowing implicit tasks:
511
511
512
512
.. _TaskRetries :
513
513
514
- Task Retry On Failure
515
- ---------------------
514
+ Automatically Retrying Tasks
515
+ ----------------------------
516
516
517
- .. seealso ::
517
+ .. tutorial :: tutorial.retries
518
518
519
- :cylc:conf: `[runtime][<namespace>]execution retry delays `.
519
+ Cylc can be configured to automatically resubmit (i.e, retry) jobs which failed
520
+ or submit-failed using these task configurations:
520
521
521
- Tasks can have a list of :term: `ISO8601 durations <ISO8601 duration> ` as retry
522
- intervals. If the job fails the task will return to the ``waiting `` state
523
- with a clock-trigger configured with the next retry delay.
522
+ .. cylc-scope :: flow.cylc[runtime][<namespace>]
524
523
524
+ `execution retry delays `
525
+ Configure retries for jobs which failed during execution (failed jobs - |job-failed |).
526
+ `submission retry delays `
527
+ Configure retries for jobs which failed during submission so never ran
528
+ (submit-failed jobs - |job-submit-failed |).
525
529
526
- .. note ::
530
+ Retry delays should be set to a list of
531
+ :term: `ISO8601 durations <ISO8601 duration> ` that specify how long to wait
532
+ before retrying the task again, e.g:
533
+
534
+ .. code-block :: cylc
535
+
536
+ [runtime]
537
+ [[my-task]]
538
+ script = do-something
539
+
540
+ # If the job fails, wait 30 seconds, then try again
541
+ execution retry delays = PT30S
527
542
528
- Tasks only enter the ``submit-failed `` state if job submission fails with no
529
- retries left. Otherwise they return to the waiting state, to wait on the
530
- next try.
543
+ # If the job submit-fails, wait one minute then try again.
544
+ # If the retry submit-fails, wait a further 5 minutes, then try again.
545
+ # If the second retry submit-fails, wait a further 15 minutes, then try again.
546
+ submission retry delays = PT1M, PT5M, PT15M
531
547
532
- Tasks only enter the ``failed `` state if job execution fails with no retries
533
- left. Otherwise they return to the waiting state, to wait on the next try.
534
548
549
+ Details
550
+ ^^^^^^^
551
+
552
+ For a task with execution / submission retries configured:
553
+
554
+ * When a job fails or submit-fails, the task will change back into the
555
+ ``waiting `` state |task-waiting | and a retry will be scheduled.
556
+ * The task will not enter the failed or submit-failed state until all retries
557
+ have been exhausted. This means that graph triggers
558
+ (e.g. ``foo:failed => bar ``) and `task events <flow.cylc[runtime][<namespace>][events]> `
559
+ (e.g. `[events]failed handlers `) will not be run until the task runs out of
560
+ retries (rather than after the first failure / submission-failure) and will
561
+ not be run if the retry subsequently succeeds.
562
+ * The :ref: `$CYLC_TASK_TRY_NUMBER <Task Job Script Variables >`
563
+ environment variable increments with each
564
+ automatic submission, allowing you to vary task behaviour between retries.
565
+
566
+ .. cylc-scope ::
567
+
568
+ .. versionchanged :: 8.0.0
569
+
570
+ Tasks that fail but are configured to :term: `retry ` return to the ``waiting ``
571
+ state, with a new clock trigger to handle the configured retry delay.
572
+
573
+ .. note ::
574
+
575
+ A task that is waiting on a retry will already have one or more failed jobs
576
+ associated with it.
535
577
536
578
537
- In the following example, tasks ``bad `` and ``flaky `` each have 3 retries
538
- configured, with a 10 second delay between. On the final try, ``bad `` fails
539
- again and goes to the ``failed `` state, while ``flaky `` succeeds and triggers
540
- task ``whizz `` downstream. The scheduler will then stall because
541
- ``bad `` failed (which is a :term: `final status `) with incomplete outputs.
579
+ Advanced Example
580
+ ^^^^^^^^^^^^^^^^
542
581
543
582
.. code-block :: cylc
544
583
545
584
[scheduling]
546
585
[[graph]]
547
586
R1 = """
548
- bad => cheese
549
- flaky => whizz
550
- """
551
- [runtime]
552
- [[bad]]
553
- # retry 3 times then fail
554
- script = """
555
- sleep 10
556
- false
587
+ # If task "a" succeeds in three attempts or fewer, then run the
588
+ # task "continue":
589
+ a:succeed? => continue
590
+
591
+ # If task "a" still fails after two retries, then run "recover":
592
+ a:fail? => recover
557
593
"""
558
- execution retry delays = 3*PT10S
559
- [[flaky] ]
560
- # retry 3 times then succeed
594
+
595
+ [runtime ]
596
+ [[a]]
561
597
script = """
562
- sleep 10
563
- test $CYLC_TASK_TRY_NUMBER -gt 3
598
+ if [[ $CYLC_TASK_TRY_NUMBER -eq 1 ]]; then
599
+ # this is not an automatic retry
600
+ export DEBUG=false
601
+ else
602
+ # this is a retry -> turn on some extra debugging
603
+ export DEBUG=true
604
+ fi
605
+ do-something
564
606
"""
565
- execution retry delays = 3*PT10S
566
- [[cheese, whizz]]
567
- script = "sleep 10"
607
+
608
+ # Schedule two retries for this task:
609
+ # * The first retry will happen one minute after the task fails.
610
+ # * The second retry will happen two minutes after the first retry
611
+ # fails.
612
+ execution retry delays = PT1M, PT3M
613
+
614
+ [[[events]]
615
+ # These "failed" task events will only be actioned if the task
616
+ # has exhausted all of its retries:
617
+ mail events = failed
618
+ failed handlers = my-task-event-handler
619
+
620
+
621
+ Aborting a Retry Sequence
622
+ ^^^^^^^^^^^^^^^^^^^^^^^^^
623
+
624
+ To prevent a task from retrying, remove it from the scheduler's
625
+ :term: `active window `, e.g:
626
+
627
+ .. code-block :: console
628
+
629
+ $ cylc remove <workflow>//3/foo # remove task 3//foo preventing it from retrying
630
+
631
+ If you *kill * a running task that has more retries configured, it goes to the
632
+ ``held `` state |task-held | so you can decide whether to release it and continue
633
+ the retry sequence, or remove it.
634
+
635
+ .. code-block :: console
636
+
637
+ $ cylc kill brew//3/foo # 3/foo goes to held state post kill
638
+ $ cylc release brew//3/foo # release to continue retrying...
639
+ $ cylc remove brew//3/foo # ... OR remove the task to stop retries
640
+
641
+ If you want trigger downstream tasks despite ``3/foo `` being removed before it
642
+ could succeed, use ``cylc set `` to artificially mark its
643
+ :term: `required outputs <required output> `
644
+ as complete (and with the ``--flow `` option, if needed to make a specific
645
+ :term: `flow ` continue on from there).
568
646
569
647
570
648
.. _user_guide.runtime.task_event_handling :
0 commit comments