Skip to content

Commit 1d93e55

Browse files
committed
[Wf-Diagnostics] add troubleshooting guide for activity and workflow retries
1 parent a1c85bd commit 1d93e55

File tree

3 files changed

+53
-2
lines changed

3 files changed

+53
-2
lines changed

src/.vuepress/config.js

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -178,6 +178,7 @@ module.exports = {
178178
'08-workflow-troubleshooting/',
179179
'08-workflow-troubleshooting/01-timeouts',
180180
'08-workflow-troubleshooting/02-activity-failures',
181+
'08-workflow-troubleshooting/03-retries',
181182
],
182183
},
183184
{

src/docs/08-workflow-troubleshooting/01-timeouts.md

Lines changed: 19 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -27,20 +27,37 @@ Optionally you can also increase the number of pollers per worker by providing t
2727
[Link to options in go client](https://pkg.go.dev/go.uber.org/[email protected]/internal#WorkerOptions)
2828
[Link to options in java client](https://github.com/uber/cadence-java-client/blob/master/src/main/java/com/uber/cadence/internal/worker/PollerOptions.java#L124)
2929

30-
## Timeouts without heartbeating enabled
30+
## Timeouts without heartbeat timeout or retry policy configured
3131

3232
Activities time out StartToClose or ScheduleToClose if the activity took longer than the configured timeout.
3333

3434
[Link to description of timeouts](https://cadenceworkflow.io/docs/concepts/activities/#timeouts)
3535

3636
For long running activities, while the activity is executing, the worker can die due to regular deployments or host restarts or failures. Cadence doesn't know about this and will wait for StartToClose or ScheduleToClose timeouts to kick in.
3737

38-
Mitigation: Consider enabling heartbeating
38+
Mitigation: Consider configuring heartbeat timeout and a retry policy
3939

4040
[Configuring heartbeat timeout example](https://github.com/uber-common/cadence-samples/blob/df6f7bdba978d6565ad78e9f86d9cd31dfac9f78/cmd/samples/expense/workflow.go#L23)
41+
[Check retry policy for activity](https://cadenceworkflow.io/docs/concepts/activities/#retries)
4142

4243
For short running activities, heart beating is not required but maybe consider increasing the timeout value to suit the actual activity execution time.
4344

45+
## Timeouts without heartbeat timeout configured but a retry policy configured
46+
47+
Retry policies are good to be configured so that activities can be retried after timeouts or failures. For long running activities, while the activity is executing, the worker can die due to regular deployments or host restarts or failures. Cadence doesn't know about this and will wait for StartToClose or ScheduleToClose timeouts to kick in. The retry is attempted only after this timeout. Enabling heartbeating would cause the activity to timeout earlier and will be retried on another worker.
48+
49+
Mitigation: Consider configuring heartbeat timeout
50+
51+
[Configuring heartbeat timeout example](https://github.com/uber-common/cadence-samples/blob/df6f7bdba978d6565ad78e9f86d9cd31dfac9f78/cmd/samples/expense/workflow.go#L23)
52+
53+
## Timeouts with heartbeating enabled but without a retry policy configured
54+
55+
Heartbeat timeouts are used to detect when a worker died or restarted during deployments. With heartbeat timeout enabled, the activity will timeout faster. But without a retry policy, it will not be scheduled again on a healthy worker.
56+
57+
Mitigation: Consider adding retry policy to an activity
58+
59+
[Check retry policy for activity](https://cadenceworkflow.io/docs/concepts/activities/#retries)
60+
4461
## Heartbeat Timeouts after enabling heartbeating
4562

4663
Activity has enabled heart beating but the activity timed out with heart beat timeout. This is because the server did not receive a heart beat in the time interval configured as the heart beat timeout.
Lines changed: 33 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,33 @@
1+
---
2+
layout: default
3+
title: Retries
4+
permalink: /docs/workflow-troubleshooting/retries
5+
---
6+
7+
# Retries
8+
9+
Cadence has a retry feature where a retry policy can be configured so that an activity or a workflow can be retried when it fails or times out.
10+
11+
Read more about [activity retries](https://cadenceworkflow.io/docs/concepts/activities/#retries) and [workflow retries](https://cadenceworkflow.io/docs/concepts/workflows/#workflow-retries)
12+
13+
## Workflow execution history of retries
14+
15+
One thing to note is how activity retries and workflow retries are shown in the Cadence Web UI. All the activity retries are not part of workflow execution history and only the last attempt is shown with the attempt number.
16+
17+
Moreover, attempt number starts from 0, so Attempt:0 refers to the first and original attempt or Attempt:1 refers to the second attempt or first retried attempt.
18+
19+
For workflow retries, when a workflow fails or times out and is retried, it completes the previous execution with a ContinuedAsNew event and a new execution is started with Attempt 1. The ContinuedAsNew event holds the details of the failure reason.
20+
21+
## Configuration of activity retries and workflow retries
22+
23+
Some of the configurable values could be misconfigured and a result will not have the intended behaviour. These are listed here.
24+
25+
## MaximumAttempts set to 1
26+
27+
In both activity retries and workflow retries it is sufficient to mention a maximum number of attempts or an expiration interval. However, the maximum number of attempts counts the original attempt of the activity also. As a result, setting maximum number of attempts to 1 means the activity or workflow will not be retried.
28+
29+
## ExpirationIntervalInSeconds less than InitialIntervalInSeconds
30+
31+
In both activity retries and workflow retries it is sufficient to mention a maximum number of attempts or an expiration interval. The first retry attempt waits for the InitialIntervalInSeconds before starting and when an expiration interval is set lower than the initial interval, the retry policy becomes invalid and the activity or workflow will not be retried.
32+
33+

0 commit comments

Comments
 (0)