Skip to content

Commit 5f6a311

Browse files
Bueller87“Kevin”
andauthored
Add Doc: Go Client - Worker auto scaling (#251)
* added Metrics and Compatibility Guide * clarify critical production problems * Add more examples at the top, move compat section to top --------- Co-authored-by: “Kevin” <“[email protected]”>
1 parent 6ff75b2 commit 5f6a311

File tree

6 files changed

+169
-0
lines changed

6 files changed

+169
-0
lines changed
Lines changed: 168 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,168 @@
1+
---
2+
layout: default
3+
title: Worker auto scaling
4+
permalink: /docs/go-client/worker-auto-scaling
5+
---
6+
7+
## Overview
8+
9+
### What AutoScaler does
10+
11+
Cadence Worker AutoScaler automatically adjusts your worker configuration to optimize resource utilization and prevent common scaling issues. Instead of manually tuning poller counts, AutoScaler monitors real-time metrics from your workers and the Cadence service to make intelligent scaling decisions.
12+
13+
The AutoScaler addresses these critical production problems:
14+
- **Insufficient throughput capacity**: Automatically scales up pollers when task load increases, ensuring workflows are processed without delays
15+
- **Better resource utilization**: By adjusting poller counts based on actual task demand rather than static configuration, workers utilize their allocated CPU resources more efficiently, preventing unnecessary downscaling by compute autoscalers (e.g. AWS EC2 Auto Scaling)
16+
- **Manual configuration complexity**: Eliminates the need for service owners to understand and tune complex worker parameters
17+
18+
### Key benefits
19+
20+
- **Zero-configuration scaling**: Works out of the box with sensible defaults
21+
- **Improved resource efficiency**: Automatically scales up when needed, scales down when idle
22+
- **Reduced operational overhead**: No more manual tuning of poller counts and execution limits
23+
- **Production reliability**: Prevents scaling-related incidents and workflow processing delays
24+
25+
### How to get started
26+
>To get started, just add the following to your worker options:
27+
```go
28+
worker.Options{
29+
...
30+
AutoScalerOptions: worker.AutoScalerOptions{
31+
Enabled: true,
32+
}
33+
}
34+
```
35+
36+
>⚠️ **Note:** If enabled, the AutoScaler will ignore these options:
37+
```go
38+
worker.Options{
39+
...
40+
MaxConcurrentActivityTaskPollers: 4,
41+
MaxConcurrentDecisionTaskPollers: 2,
42+
...
43+
}
44+
```
45+
### Compatibility considerations
46+
47+
**Poller Count Setup**: Before enabling AutoScaler, ensure your initial poller count equals the maximum of your decision and activity worker poller counts. This prevents AutoScaler from starting with insufficient polling capacity.
48+
49+
>For example:
50+
```go
51+
worker.Options{
52+
...
53+
AutoScalerOptions: worker.AutoScalerOptions{
54+
Enabled: true,
55+
PollerMinCount: 2,
56+
PollerMaxCount: 8,
57+
PollerInitCount: 4, // Max of previous manually configured pollers (4 and 2 above)
58+
}
59+
}
60+
```
61+
62+
63+
## Scenario: Low CPU Utilization
64+
65+
### Problem description
66+
67+
One of the most common production issues with Cadence workers occurs when compute autoscalers incorrectly scale down worker instances due to low CPU utilization. This creates a deceptive situation where workers appear to be underutilized from a resource perspective, but are actually performing critical work.
68+
69+
Here's what typically happens: Cadence workers spend most of their time polling the Cadence service for tasks. This polling activity is lightweight and doesn't consume significant CPU resources, leading to consistently low CPU usage metrics (often 5-15%). Compute autoscalers like Kubernetes HPA (Horizontal Pod Autoscaler) or cloud provider autoscaling groups see these low CPU numbers and interpret them as a signal that fewer worker instances are needed.
70+
71+
When the autoscaler reduces the number of worker instances, several problems emerge:
72+
- **Reduced polling capacity**: Fewer workers means fewer pollers actively checking for new tasks, which can delay task processing
73+
- **Cascading delays**: As tasks take longer to be picked up, workflow execution times increase, potentially causing timeouts or SLA violations
74+
- **Inefficient resource allocation**: The remaining workers may actually be quite busy processing tasks, but the polling overhead doesn't reflect in CPU metrics
75+
76+
This problem is particularly acute in environments where:
77+
- Workflow tasks are I/O intensive rather than CPU intensive
78+
- Workers handle high volumes of short-duration tasks
79+
- There are periods of variable workload where quick scaling response is crucial
80+
81+
The fundamental issue is that traditional CPU-based autoscaling doesn't account for the unique nature of Cadence worker operations, where "being busy" doesn't necessarily translate to high CPU usage.
82+
83+
### How AutoScaler helps
84+
85+
AutoScaler solves the CPU utilization problem by providing intelligent metrics that better represent actual worker utilization. Instead of relying solely on CPU metrics, AutoScaler monitors Cadence-specific signals to make scaling decisions.
86+
87+
The AutoScaler tracks several key indicators:
88+
- **Poller utilization**: How busy the pollers are, regardless of CPU usage
89+
- **Task pickup latency**: How quickly tasks are being retrieved from task lists
90+
- **Queue depth**: The number of pending tasks waiting to be processed
91+
- **Worker capacity**: The actual capacity of workers to handle more work
92+
93+
When AutoScaler detects that workers are genuinely underutilized (based on Cadence metrics, not just CPU), it can safely reduce poller counts. Conversely, when it detects that workers are busy or task lists are backing up, it increases poller counts to improve task pickup rates.
94+
95+
This approach prevents the common scenario where compute autoscalers scale down workers that appear idle but are actually critical for maintaining workflow performance. AutoScaler provides a more accurate representation of worker utilization that can be used to make better scaling decisions at both the worker configuration level and the compute infrastructure level.
96+
97+
98+
99+
## Scenario: Task List Backlogs
100+
101+
### Understanding task list imbalances
102+
103+
Task list backlogs occur when some task lists receive more traffic than their allocated pollers can handle, while other task lists remain underutilized. This imbalance is common in production environments with multiple workflows, domains, or variable traffic patterns.
104+
105+
The core problem stems from static poller allocation. Traditional Cadence worker configurations assign a fixed number of pollers to each task list, which works well for predictable workloads but fails when:
106+
107+
- **Traffic varies between task lists**: Some task lists get heavy traffic while others remain quiet
108+
- **Workload patterns change over time**: Peak hours create temporary backlogs that resolve slowly
109+
- **Multi-domain deployments**: Traffic spikes in one domain affect resource availability for others
110+
- **Workflow dependencies**: Backlogs in upstream task lists cascade to dependent workflows
111+
112+
These imbalances lead to uneven workflow execution times, SLA violations, and inefficient resource utilization. Manual solutions like increasing worker counts or reconfiguring poller allocation are expensive and reactive.
113+
114+
### AutoScaler's poller management
115+
116+
AutoScaler solves task list backlogs through dynamic poller management that automatically redistributes polling capacity based on real-time demand.
117+
118+
Key capabilities include:
119+
120+
**Automatic backlog detection**: AutoScaler monitors task list metrics like queue depth, pickup latency, and task arrival rates to identify developing backlogs before they become critical.
121+
122+
**Dynamic reallocation**: When a task list needs more capacity, AutoScaler automatically moves pollers from underutilized task lists. This reallocation happens without worker restarts or manual intervention.
123+
124+
**Pattern learning**: The system learns from historical traffic patterns to anticipate regular peak periods and preemptively allocate resources.
125+
126+
**Safety controls**: AutoScaler maintains minimum poller counts for each task list and includes safeguards to prevent resource thrashing or system instability.
127+
128+
This approach ensures that polling capacity is always aligned with actual demand, preventing backlogs while maintaining efficient resource utilization across all task lists.
129+
130+
131+
132+
## Metrics Guide
133+
134+
### Key metrics to monitor
135+
136+
Monitor these key metrics to understand AutoScaler performance:
137+
138+
#### Decision Poller Quota
139+
- **Description:** Track decision poller count over time
140+
- **Name:** cadence-concurrency-auto-scaler.poller-quota
141+
- **Worker Type:** decisionworker
142+
- **Type:** Heatmap
143+
![Decision Poller Quota](img/dash-decision-poller-quota.png)
144+
145+
#### Activity Poller Quota
146+
- **Description:** Track activity poller count over time
147+
- **Name:** cadence-concurrency-auto-scaler.poller-quota
148+
- **Worker Type:** activityworker
149+
- **Type:** Heatmap
150+
![Activity Poller Quota](img/dash-activity-poller-quota.png)
151+
152+
#### Decision Poller Wait Time
153+
- **Description:** Track decision poller wait time over time
154+
- **Name:** cadence-concurrency-auto-scaler.poller-wait-time
155+
- **Worker Type:** decisionworker
156+
- **Type:** Heatmap
157+
![Decision Poller Wait Time](img/dash-decision-poller-wait-time.png)
158+
159+
#### Activity Poller Wait Time
160+
- **Description:** Track activity poller wait time over time
161+
- **Name:** cadence-concurrency-auto-scaler.poller-wait-time
162+
- **Worker Type:** activityworker
163+
- **Type:** Heatmap
164+
![Activity Poller Wait Time](img/dash-activity-poller-wait-time.png)
165+
166+
167+
168+
29 KB
Loading
27.6 KB
Loading
74.3 KB
Loading
70.2 KB
Loading

sidebars.ts

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -101,6 +101,7 @@ const sidebars: SidebarsConfig = {
101101
items: [
102102
{ type: 'doc', id: 'go-client/index' },
103103
{ type: 'doc', id: 'go-client/workers' },
104+
{ type: 'doc', id: 'go-client/worker-auto-scaling' },
104105
{ type: 'doc', id: 'go-client/create-workflows' },
105106
{ type: 'doc', id: 'go-client/starting-workflows' },
106107
{ type: 'doc', id: 'go-client/activities' },

0 commit comments

Comments
 (0)