Skip to content

Commit c61fce3

Browse files
authored
Publish a blog for adaptive tasklist scaler (#239)
1 parent 1d368e1 commit c61fce3

File tree

2 files changed

+122
-0
lines changed

2 files changed

+122
-0
lines changed
Lines changed: 122 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,122 @@
1+
---
2+
title: "Adaptive Tasklist Scaler"
3+
subtitle: test
4+
date: 2025-06-30
5+
authors: shaddoll
6+
tags:
7+
- deep-dive
8+
- cadence-operations
9+
- cadence-matching
10+
---
11+
12+
At Uber, we previously relied on a dynamic configuration service to manually control the number of partitions for scalable tasklists. This configuration approach introduced several operational challenges:
13+
14+
- **Error-prone:** Manual updates and deployments were required.
15+
- **Unresponsive:** Adjustments were typically reactive, often triggered by customer reports or observed backlogs.
16+
- **Irreversible:** Once increased, the number of partitions was rarely decreased due to the complexity of the two-phase process, especially when anticipating future traffic spikes.
17+
18+
To address these issues, we introduced a new component in the Cadence Matching service: **Adaptive Tasklist Scaler**. This component dynamically monitors tasklist traffic and adjusts partition counts automatically. Since its rollout, we've seen a significant reduction in incidents and operational overhead caused by misconfigured tasklists.
19+
20+
---
21+
22+
## What is a Scalable Tasklist?
23+
24+
A **scalable tasklist** is one that supports multiple partitions. Since Cadence’s Matching service is sharded by tasklist, all requests to a specific tasklist are routed to a single Matching host. To avoid bottlenecks and enhance scalability, tasklists can be partitioned so that multiple Matching hosts handle traffic concurrently.
25+
26+
These partitions are transparent to clients. When a request arrives at the Cadence server for a scalable tasklist, the server selects an appropriate partition. More details can be found in [this document](https://github.com/cadence-workflow/cadence/blob/v1.3.1/docs/scalable_tasklist.md).
27+
28+
### How Is the Number of Partitions Manually Configured?
29+
30+
The number of partitions for a tasklist is controlled by two dynamic configuration properties:
31+
32+
1. [`matching.numTasklistReadPartitions`](https://github.com/cadence-workflow/cadence/blob/v1.2.13/common/dynamicconfig/constants.go#L3350): Specifies the number of **read** partitions.
33+
2. [`matching.numTasklistWritePartitions`](https://github.com/cadence-workflow/cadence/blob/v1.2.13/common/dynamicconfig/constants.go#L3344): Specifies the number of **write** partitions.
34+
35+
To prevent misconfiguration, a guardrail is in place to ensure that the number of read partitions is **never less than** the number of write partitions.
36+
37+
When **increasing** the number of partitions, both properties are typically updated simultaneously. However, due to the guardrail, the order of updates doesn't matter—read and write partitions can be increased in any sequence.
38+
39+
In contrast, **decreasing** the number of partitions is more complex and requires a **two-phase process**:
40+
41+
1. **First**, reduce the number of write partitions.
42+
2. **Then**, wait for any backlog in the decommissioned partitions to drain completely.
43+
3. **Finally**, reduce the number of read partitions.
44+
45+
Because this process is tedious, error-prone, and backlog-sensitive, it is rarely performed in production environments.
46+
47+
---
48+
49+
## How Does Adaptive Tasklist Scaler Work?
50+
51+
The architecture of the adaptive tasklist scaler is shown below:
52+
53+
![adaptive tasklist scaler architecture](./adaptive-tasklist-scaler/architecture.png)
54+
55+
### 1. Migrating Configuration to the Database
56+
57+
The first key change was migrating partition count configuration from dynamic config to the Cadence cluster’s database. This allows the configuration to be updated programmatically.
58+
59+
- The **adaptive tasklist scaler** runs in the root partition only.
60+
- It reads and updates the partition count.
61+
- Updates propagate to non-root partitions via a **push model**, and to pollers and producers via a **pull model**.
62+
- A **version number** is associated with each config. The version only increments through scaler updates, ensuring monotonicity and consistency across components.
63+
64+
### 2. Monitoring Tasklist Traffic
65+
66+
The scaler periodically monitors the **write QPS** of each tasklist.
67+
68+
- If QPS exceeds an upscale threshold for a sustained period, the number of **read and write partitions** is increased proportionally.
69+
- If QPS falls below a downscale threshold, only the **write partitions** are reduced initially. The system then waits for drained partitions to clear before reducing the number of **read partitions**, ensuring backlog-free downscaling.
70+
71+
---
72+
73+
## Enabling Adaptive Tasklist Scaler
74+
75+
### Prerequisites
76+
77+
To use this feature, upgrade Cadence to [v1.3.0 or later](https://github.com/cadence-workflow/cadence/tree/v1.3.0).
78+
79+
Also, migrate tasklist partition configurations to the database using [this guide](https://github.com/cadence-workflow/cadence/blob/v1.3.0/docs/migration/tasklist-partition-config.md).
80+
81+
### Configuration
82+
83+
The scaler is governed by the following dynamic configuration parameters:
84+
85+
- `matching.enableAdaptiveScaler`: Enables the scaler at the tasklist level.
86+
- `matching.partitionUpscaleSustainedDuration`: Duration that QPS must stay above threshold before triggering upscale.
87+
- `matching.partitionDownscaleSustainedDuration`: Duration below threshold required before triggering downscale.
88+
- `matching.adaptiveScalerUpdateInterval`: Frequency at which the scaler evaluates and updates partition counts.
89+
- `matching.partitionUpscaleRPS`: QPS threshold per partition that triggers upscale.
90+
- `matching.partitionDownscaleFactor`: Factor applied to introduce hysteresis, lowering the QPS threshold for downscaling to avoid oscillations.
91+
92+
---
93+
94+
## Monitoring and Metrics
95+
96+
Several metrics have been introduced to help monitor the scaler’s behavior:
97+
98+
### QPS and Thresholds
99+
100+
- `estimated_add_task_qps_per_tl`: Estimated QPS of task additions per tasklist.
101+
- `tasklist_partition_upscale_threshold`: Upscale threshold for task additions.
102+
- `tasklist_partition_downscale_threshold`: Downscale threshold for task additions.
103+
104+
> The `estimated_add_task_qps_per_tl` value should remain between the upscale and downscale thresholds. If not, the scaler may not be functioning properly.
105+
106+
### Partition Configurations
107+
108+
- `task_list_partition_config_num_read`: Number of current **read** partitions.
109+
- `task_list_partition_config_num_write`: Number of current **write** partitions.
110+
- `task_list_partition_config_version`: Version of the current partition configuration.
111+
112+
These metrics are emitted by various components: root and non-root partitions, pollers, and producers. Their values should align under normal conditions, except immediately after updates.
113+
114+
---
115+
116+
## Status at Uber
117+
118+
We enabled adaptive tasklist scaler across all Uber clusters in **March 2025**. Since its deployment:
119+
120+
- **Zero incidents** have been reported due to misconfigured tasklists.
121+
- **Operational workload** related to manual scaling has been eliminated.
122+
- **Scalability and resilience** of Matching service have improved significantly.
30.5 KB
Loading

0 commit comments

Comments
 (0)