Skip to content

Commit c4d09e6

Browse files
committed
RFC#0192 - ensures workers do not get unnecessarily killed
1 parent 7cf7165 commit c4d09e6

File tree

3 files changed

+135
-0
lines changed

3 files changed

+135
-0
lines changed

README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -69,3 +69,4 @@ See [mechanics](mechanics.md) for more detail.
6969
| RFC#182 | [Allow remote references to .taskcluster.yml files processed by Taskcluster-GitHub](rfcs/0182-taskcluster-yml-remote-references.md) |
7070
| RFC#189 | [Batch APIs for task definition, status and index path](rfcs/0189-batch-task-apis.md) |
7171
| RFC#191 | [Worker Manager launch configurations](rfcs/0191-worker-manager-launch-configs.md) |
72+
| RFC#192 | [`minCapacity` ensures workers do not get unnecessarily killed](rfcs/0192-min-capacity-ensures-workers-do-not-get-unnecessarily-killed.md) |
Lines changed: 133 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,133 @@
1+
# RFC 192 - `minCapacity` ensures workers do not get unnecessarily killed
2+
* Comments: [#192](https://github.com/taskcluster/taskcluster-rfcs/pull/192)
3+
* Proposed by: [@JohanLorenzo](https://github.com/JohanLorenzo)
4+
5+
# Summary
6+
7+
Optimize worker pools with `minCapacity >= 1` by implementing minimum capacity workers that avoid unnecessary shutdown/restart cycles, preserving caches and reducing task wait times.
8+
9+
## Motivation
10+
11+
Currently, workers in pools with `minCapacity >= 1` exhibit wasteful behavior:
12+
13+
1. **Cache Loss**: Workers shut down after idle timeout (600 seconds for decision pools), losing valuable caches:
14+
- VCS repositories and history
15+
- Package manager caches (npm, pip, cargo, etc.)
16+
- Container images and layers
17+
18+
2. **Provisioning Delays**: New worker provisioning [takes ~75 seconds average for decision pools](https://taskcluster.github.io/mozilla-history/worker-metrics), during which tasks must wait
19+
20+
3. **Resource Waste**: The current cycle of shutdown → detection → spawn → provision → register wastes compute resources and increases task latency
21+
22+
4. **Violation of `minCapacity` Intent**: `minCapacity >= 1` suggests these pools should always have capacity available, but the current implementation allows temporary capacity gaps
23+
24+
# Details
25+
26+
## Current Behavior Analysis
27+
28+
**Affected Worker Pools:**
29+
- Direct `minCapacity: 1`: [`infra/build-decision`](https://github.com/mozilla-releng/fxci-config/blob/43c18aab0826244e369b16a964637b6c411c7760/worker-pools.yml#L220), [`code-review/bot-gcp`](https://github.com/mozilla-releng/fxci-config/blob/43c18aab0826244e369b16a964637b6c411c7760/worker-pools.yml#L3320)
30+
- Keyed `minCapacity: 1`: [`gecko-1/decision-gcp`, `gecko-3/decision-gcp`](https://github.com/mozilla-releng/fxci-config/blob/43c18aab0826244e369b16a964637b6c411c7760/worker-pools.yml#L976), and [pools matching `(app-services|glean|mozillavpn|mobile|mozilla|translations)-1`](https://github.com/mozilla-releng/fxci-config/blob/43c18aab0826244e369b16a964637b6c411c7760/worker-pools.yml#L1088)
31+
32+
**Current Implementation Issues:**
33+
- Worker-manager enforces `minCapacity` by spawning new workers when capacity drops below threshold
34+
- Generic-worker shuts down after `idleTimeoutSecs` regardless of `minCapacity` requirements
35+
- Gap exists between worker shutdown and replacement detection/provisioning
36+
37+
## Proposed Solution
38+
39+
### Core Concept: `minCapacity` Workers
40+
41+
Workers fulfilling `minCapacity >= 1` requirements should never self-terminate. This is achieved through a two-phase implementation: Phase 1 prevents minCapacity workers from self-terminating, and Phase 2 makes worker-manager the central authority for all worker termination decisions.
42+
43+
### Phase 1: Prevent MinCapacity Worker Self-Termination
44+
45+
#### 1. Automatic Activation
46+
47+
**Trigger:** Automatically enabled when `minCapacity >= 1` (no configuration flag needed)
48+
49+
Workers spawned when `runningCapacity < minCapacity` receive `idleTimeoutSecs=0` and never self-terminate.
50+
51+
#### 2. Worker Config Injection
52+
53+
Worker-manager sets `idleTimeoutSecs=0` when spawning minCapacity workers. This makes [generic-worker never terminate by itself](https://github.com/taskcluster/taskcluster/blob/754938c53ba34aea5a50ce610272e7a275c11911/workers/generic-worker/main.go#L536-L542).
54+
55+
#### 3. Capacity Management
56+
57+
**Removing Excess Capacity:**
58+
59+
When `runningCapacity > minCapacity`, [worker-manager scanner identifies](https://github.com/taskcluster/taskcluster/blob/754938c53ba34aea5a50ce610272e7a275c11911/services/worker-manager/src/worker-scanner.js#L69-L76) and terminates excess workers.
60+
61+
**Termination logic:**
62+
- Query [Queue API client](https://github.com/taskcluster/taskcluster/blob/754938c53ba34aea5a50ce610272e7a275c11911/services/worker-manager/src/main.js#L176-L178) to check if worker's latest task is running
63+
- Select oldest idle workers first (by `worker.created` timestamp).
64+
- Use the existing `removeWorker()` methods to terminate worker ([Google `removeWorker()`](https://github.com/taskcluster/taskcluster/blob/754938c53ba34aea5a50ce610272e7a275c11911/services/worker-manager/src/providers/google.js#L168-L192), [AWS `removeWorker()`](https://github.com/taskcluster/taskcluster/blob/754938c53ba34aea5a50ce610272e7a275c11911/services/worker-manager/src/providers/aws.js#L382-L418), [Azure `removeWorker()`](https://github.com/taskcluster/taskcluster/blob/754938c53ba34aea5a50ce610272e7a275c11911/services/worker-manager/src/providers/azure/index.js#L1221-L1233))
65+
66+
67+
## Error Handling and Edge Cases
68+
69+
**Worker Lifecycle Management:**
70+
- **Pool reconfiguration**: Capacity changes trigger worker replacement, not reconfiguration
71+
- **Graceful transitions**: When possible, only terminate idle workers to preserve active caches
72+
- **Resource allocation**: minCapacity workers mixed with other workers on same infrastructure
73+
74+
**Launch Configuration Changes:**
75+
When a launch configuration is changed, removed, or archived, all workers created from the old configuration must be terminated and replaced:
76+
- If a launch configuration is archived (not present in new configuration), identify all long-running workers created from it
77+
- Terminate these workers via cloud provider APIs after checking for running tasks
78+
- Worker-manager will spawn new workers using the updated launch configuration
79+
- This ensures workers always run with current configuration and prevents indefinite use of outdated configurations
80+
81+
## Compatibility Considerations
82+
83+
- Automatic activation when `minCapacity >= 1` (no opt-in flag needed)
84+
- Existing pools continue current behavior; minCapacity workers automatically stop self-terminating
85+
- No changes to generic-worker's idle timeout mechanism
86+
87+
### Phase 2: Centralized Termination Authority
88+
89+
**Goal:** Make worker-manager the sole authority for all worker termination decisions by removing worker self-termination entirely.
90+
91+
#### Implementation Changes
92+
93+
**1. Remove Worker Idle Timeout Code**
94+
95+
Remove [idle timeout mechanism](https://github.com/taskcluster/taskcluster/blob/754938c53ba34aea5a50ce610272e7a275c11911/workers/generic-worker/main.go#L536-L542) from generic-worker:
96+
97+
Workers run indefinitely until worker-manager terminates them. This mean, worker-managers stops sending `idleTimeoutSecs` to workers at spawn time.
98+
99+
**2. Centralized Idle Enforcement**
100+
101+
Worker-manager enforces idle timeout using existing [`queueInactivityTimeout` from lifecycle configuration](https://github.com/taskcluster/taskcluster/blob/754938c53ba34aea5a50ce610272e7a275c11911/services/worker-manager/schemas/v1/worker-lifecycle.yml#L33-L50) and through idleTimeout (as specified in phase 1).
102+
103+
[Scanner polls](https://github.com/taskcluster/taskcluster/blob/754938c53ba34aea5a50ce610272e7a275c11911/services/worker-manager/src/worker-scanner.js#L69-L76) Queue API to track worker idle time:
104+
- Get latest task from `worker.recentTasks`
105+
- Call `queue.status(taskId)` to check if task is running
106+
- Calculate idle time from task's `resolved` timestamp
107+
- Terminate when idle time exceeds `queueInactivityTimeout`
108+
109+
**3. Termination Decision Factors**
110+
111+
Worker-manager terminates workers when:
112+
- Idle timeout exceeded (`queueInactivityTimeout`)
113+
- Capacity exceeds `maxCapacity`
114+
- Capacity exceeds `minCapacity` (terminate oldest first)
115+
- Launch configuration changed/archived
116+
- Worker is unhealthy (provider-specific check)
117+
118+
All terminations check for running tasks before proceeding.
119+
120+
**Migration:**
121+
Deploy as breaking change requiring simultaneous worker-manager and generic-worker updates.
122+
123+
# Implementation
124+
125+
<Once the RFC is decided, these links will provide readers a way to track the
126+
implementation through to completion, and to know if they are running a new
127+
enough version to take advantage of this change. It's fine to update this
128+
section using short PRs or pushing directly to master after the RFC is
129+
decided>
130+
131+
* <link to tracker bug, issue, etc.>
132+
* <...>
133+
* Implemented in Taskcluster version ...

rfcs/README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -57,3 +57,4 @@
5757
| RFC#182 | [Allow remote references to .taskcluster.yml files processed by Taskcluster-GitHub](0182-taskcluster-yml-remote-references.md) |
5858
| RFC#189 | [Batch APIs for task definition, status and index path](0189-batch-task-apis.md) |
5959
| RFC#191 | [Worker Manager launch configurations](0191-worker-manager-launch-configs.md) |
60+
| RFC#192 | [`minCapacity` ensures workers do not get unnecessarily killed](0192-min-capacity-ensures-workers-do-not-get-unnecessarily-killed.md) |

0 commit comments

Comments
 (0)