-
Notifications
You must be signed in to change notification settings - Fork 85
[Wf-Diagnostics] add troubleshooting guide for activity and workflow retries #201
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
14 commits
Select commit
Hold shift + click to select a range
1d93e55
[Wf-Diagnostics] add troubleshooting guide for activity and workflow …
sankari165 b1395b9
Update 01-timeouts.md
sankari165 54bc161
Update 01-timeouts.md
sankari165 d55d3b6
Update src/docs/08-workflow-troubleshooting/01-timeouts.md
sankari165 798cb2e
Update src/docs/08-workflow-troubleshooting/01-timeouts.md
sankari165 c7466aa
Update src/docs/08-workflow-troubleshooting/03-retries.md
sankari165 228e134
Update src/docs/08-workflow-troubleshooting/03-retries.md
sankari165 61136bb
Update src/docs/08-workflow-troubleshooting/03-retries.md
sankari165 3019dfd
Update src/docs/08-workflow-troubleshooting/03-retries.md
sankari165 b01daa0
Update src/docs/08-workflow-troubleshooting/03-retries.md
sankari165 31b3c4a
Update src/docs/08-workflow-troubleshooting/03-retries.md
sankari165 7323e1d
Update 01-timeouts.md
sankari165 d05690a
Update src/docs/08-workflow-troubleshooting/01-timeouts.md
sankari165 c9cb2e2
Update src/docs/08-workflow-troubleshooting/01-timeouts.md
sankari165 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -27,28 +27,45 @@ Optionally you can also increase the number of pollers per worker by providing t | |
| [Link to options in go client](https://pkg.go.dev/go.uber.org/[email protected]/internal#WorkerOptions) | ||
| [Link to options in java client](https://github.com/uber/cadence-java-client/blob/master/src/main/java/com/uber/cadence/internal/worker/PollerOptions.java#L124) | ||
|
|
||
| ## Timeouts without heartbeating enabled | ||
| ## No heartbeat timeout or retry policy configured | ||
|
|
||
| Activities time out StartToClose or ScheduleToClose if the activity took longer than the configured timeout. | ||
|
|
||
| [Link to description of timeouts](https://cadenceworkflow.io/docs/concepts/activities/#timeouts) | ||
|
|
||
| For long running activities, while the activity is executing, the worker can die due to regular deployments or host restarts or failures. Cadence doesn't know about this and will wait for StartToClose or ScheduleToClose timeouts to kick in. | ||
| For long running activities, while the activity is executing, the worker can die due to regular deployments or host restarts or failures. Cadence doesn't know about this and will wait for StartToClose or ScheduleToClose timeouts to kick in. | ||
|
|
||
| Mitigation: Consider enabling heartbeating | ||
| Mitigation: Consider configuring heartbeat timeout and a retry policy | ||
|
|
||
| [Configuring heartbeat timeout example](https://github.com/uber-common/cadence-samples/blob/df6f7bdba978d6565ad78e9f86d9cd31dfac9f78/cmd/samples/expense/workflow.go#L23) | ||
| [Example](https://github.com/uber-common/cadence-samples/blob/df6f7bdba978d6565ad78e9f86d9cd31dfac9f78/cmd/samples/expense/workflow.go#L23) | ||
| [Check retry policy for activity](https://cadenceworkflow.io/docs/concepts/activities/#retries) | ||
|
|
||
| For short running activities, heart beating is not required but maybe consider increasing the timeout value to suit the actual activity execution time. | ||
|
|
||
| ## Heartbeat Timeouts after enabling heartbeating | ||
| ## Retry policy configured without setting heartbeat timeout | ||
|
|
||
| Activity has enabled heart beating but the activity timed out with heart beat timeout. This is because the server did not receive a heart beat in the time interval configured as the heart beat timeout. | ||
| Retry policies are configured so activities can be retried after timeouts or failures. For long-running activities, the worker can die while the activity is executing, e.g. due to regular deployments or host restarts or failures. Cadence doesn't know about this and will wait for StartToClose or ScheduleToClose timeouts to kick in. The retry is attempted only after this timeout. Configuring heartbeat timeout would cause the activity to timeout earlier so it can be retried on another worker. | ||
|
|
||
| Mitigation: Consider configuring heartbeat timeout | ||
|
|
||
| [Example](https://github.com/uber-common/cadence-samples/blob/df6f7bdba978d6565ad78e9f86d9cd31dfac9f78/cmd/samples/expense/workflow.go#L23) | ||
|
|
||
| ## Heartbeat timeout configured without a retry policy | ||
|
|
||
| Heartbeat timeouts are used to detect when a worker died or restarted. With heartbeat timeout configured, the activity will timeout faster. But without a retry policy, it will not be scheduled again on a healthy worker. | ||
|
|
||
| Mitigation: Consider adding retry policy to an activity | ||
|
|
||
| [Check retry policy for activity](https://cadenceworkflow.io/docs/concepts/activities/#retries) | ||
|
|
||
| ## Heartbeat timeout seen after configuring heartbeat timeout | ||
|
|
||
| Activity has configured heartbeat timeout and the activity timed out with heart beat timeout. This is because the server did not receive a heart beat in the time interval configured as the heart beat timeout. This could happen if the activity is actually not executing or the activity is not sending periodic heartbeats. The first case is good since the activity now times out instead of being stuck until startToClose or scheduleToClose kicks in. The second case needs a fix. | ||
|
|
||
| Mitigation: Once heartbeat timeout is configured in activity options, you need to make sure the activity periodically sends a heart beat to the server to make sure the server is aware of the activity being alive. | ||
|
|
||
| [Example to send periodic heart beat](https://github.com/uber-common/cadence-samples/blob/df6f7bdba978d6565ad78e9f86d9cd31dfac9f78/cmd/samples/fileprocessing/activities.go#L111) | ||
|
|
||
| In go client, there is an option to register the activity with auto heart beating so that it is done automatically | ||
|
|
||
| [Enabling auto heart beat during activity registration example](https://pkg.go.dev/go.uber.org/[email protected]/internal#WorkerOptions) | ||
| [Configuring auto heart beat during activity registration example](https://pkg.go.dev/go.uber.org/[email protected]/internal#WorkerOptions) | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,33 @@ | ||
| --- | ||
| layout: default | ||
| title: Retries | ||
| permalink: /docs/workflow-troubleshooting/retries | ||
| --- | ||
|
|
||
| # Retries | ||
|
|
||
| Cadence has a retry feature where a retry policy can be configured so that an activity or a workflow will be retried when it fails or times out. | ||
|
|
||
| Read more about [activity retries](https://cadenceworkflow.io/docs/concepts/activities/#retries) and [workflow retries](https://cadenceworkflow.io/docs/concepts/workflows/#workflow-retries). | ||
|
|
||
| ## Workflow execution history of retries | ||
|
|
||
| One thing to note is how activity retries and workflow retries are shown in the Cadence Web UI. Information about activity retries is not stored in Cadence. Only the last attempt is shown with the attempt number. | ||
|
|
||
| Moreover, attempt number starts from 0, so `Attempt: 0` refers to the first and original attempt, `Attempt: 1` refers to the second attempt or first retried attempt. | ||
|
|
||
| For workflow retries, when a workflow fails or times out and is retried, it completes the previous execution with a ContinuedAsNew event and a new execution is started with Attempt 1. The ContinuedAsNew event holds the details of the failure reason. | ||
|
|
||
| ## Configuration of activity retries and workflow retries | ||
|
|
||
| Some of the configurable values could be misconfigured and as a result will not have the intended behaviour. These are listed here. | ||
|
|
||
| ## MaximumAttempts set to 1 | ||
|
|
||
| In both activity retries and workflow retries it is sufficient to mention a maximum number of attempts or an expiration interval. However, the maximum number of attempts counts the original attempt of the activity also. As a result, setting maximum number of attempts to 1 means the activity or workflow will not be retried. | ||
|
|
||
| ## ExpirationIntervalInSeconds less than InitialIntervalInSeconds | ||
|
|
||
| In both activity retries and workflow retries it is sufficient to specify a maximum number of attempts or an expiration interval. The first retry attempt waits for the InitialIntervalInSeconds before starting and when an expiration interval is set lower than the initial interval, the retry policy becomes invalid and the activity or workflow will not be retried. | ||
|
|
||
|
|
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.