Skip to content

Commit aadcbce

Browse files
authored
[Wf-Diagnostics] add troubleshooting guide for timeouts in workflow (#189)
1 parent 41a16f4 commit aadcbce

File tree

6 files changed

+9247
-9179
lines changed

6 files changed

+9247
-9179
lines changed

src/.vuepress/config.js

Lines changed: 10 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -166,16 +166,23 @@ module.exports = {
166166
'07-operation-guide/05-migration',
167167
],
168168
},
169+
{
170+
title: 'Workflow Troubleshooting',
171+
path: '/docs/08-workflow-troubleshooting/',
172+
children: [
173+
'08-workflow-troubleshooting/01-timeouts',
174+
],
175+
},
169176
{
170177
title: 'Glossary',
171178
path: '../GLOSSARY',
172179
},
173180
{
174181
title: 'About',
175-
path: '/docs/08-about',
182+
path: '/docs/09-about',
176183
children: [
177-
'08-about/',
178-
'08-about/01-license',
184+
'09-about/',
185+
'09-about/01-license',
179186
],
180187
},
181188
],
Lines changed: 52 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,52 @@
1+
---
2+
layout: default
3+
title: Timeouts
4+
permalink: /docs/workflow-troubleshooting/timeouts
5+
---
6+
7+
# Timeouts
8+
9+
## Missing Pollers
10+
11+
Cadence workers are part of the service that hosts and executes the workflow. They are of two types: activity worker and workflow worker. Each of these workers are responsible for having pollers which are go-routines that poll for activity tasks and decision tasks respectively from the Cadence server. Without pollers, the workflow cannot proceed with the execution.
12+
13+
Mitigation: Make sure these workers are configured with the task lists that are used in the workflow and activities so the server can dispatch tasks to the cadence workers.
14+
15+
[Worker setup example](https://github.com/uber-common/cadence-samples/blob/master/cmd/samples/pageflow/main.go#L18)
16+
17+
## Tasklist backlog despite having pollers
18+
19+
If a tasklist has pollers but the backlog continues to grow then it is a supply-demand issue. The workflow is growing faster than what the workers can handle. The server wants to dispatch more tasks to the workers but they are not able to keep up.
20+
21+
Mitigation: Increase the number of cadence workers by horizontally scaling up the instances where the workflow is running.
22+
23+
Optionally you can also increase the number of pollers per worker by providing this via worker options.
24+
25+
[Link to options in go client](https://pkg.go.dev/go.uber.org/[email protected]/internal#WorkerOptions)
26+
[Link to options in java client](https://github.com/uber/cadence-java-client/blob/master/src/main/java/com/uber/cadence/internal/worker/PollerOptions.java#L124)
27+
28+
## Timeouts without heartbeating enabled
29+
30+
Activities time out StartToClose or ScheduleToClose if the activity took longer than the configured timeout.
31+
32+
[Link to description of timeouts](https://cadenceworkflow.io/docs/concepts/activities/#timeouts)
33+
34+
For long running activities, while the activity is executing, the worker can die due to regular deployments or host restarts or failures. Cadence doesn't know about this and will wait for StartToClose or ScheduleToClose timeouts to kick in.
35+
36+
Mitigation: Consider enabling heartbeating
37+
38+
[Configuring heartbeat timeout example](https://github.com/uber-common/cadence-samples/blob/df6f7bdba978d6565ad78e9f86d9cd31dfac9f78/cmd/samples/expense/workflow.go#L23)
39+
40+
For short running activities, heart beating is not required but maybe consider increasing the timeout value to suit the actual activity execution time.
41+
42+
## Heartbeat Timeouts after enabling heartbeating
43+
44+
Activity has enabled heart beating but the activity timed out with heart beat timeout. This is because the server did not receive a heart beat in the time interval configured as the heart beat timeout.
45+
46+
Mitigation: Once heartbeat timeout is configured in activity options, you need to make sure the activity periodically sends a heart beat to the server to make sure the server is aware of the activity being alive.
47+
48+
[Example to send periodic heart beat](https://github.com/uber-common/cadence-samples/blob/df6f7bdba978d6565ad78e9f86d9cd31dfac9f78/cmd/samples/fileprocessing/activities.go#L111)
49+
50+
In go client, there is an option to register the activity with auto heart beating so that it is done automatically
51+
52+
[Enabling auto heart beat during activity registration example](https://pkg.go.dev/go.uber.org/[email protected]/internal#WorkerOptions)
Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
---
2+
layout: default
3+
title: Overview
4+
permalink: /docs/workflow-troubleshooting
5+
---
6+
7+
# Workflow Troubleshooting Overview
8+
9+
This document will serve as a guide for troubleshooting a workflow for potential issues.
File renamed without changes.
File renamed without changes.

0 commit comments

Comments
 (0)