Skip to content

Commit a80f99d

Browse files
committed
RFC#0192 - ensures workers do not get unnecessarily killed
1 parent 7cf7165 commit a80f99d

File tree

3 files changed

+201
-0
lines changed

3 files changed

+201
-0
lines changed

README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -69,3 +69,4 @@ See [mechanics](mechanics.md) for more detail.
6969
| RFC#182 | [Allow remote references to .taskcluster.yml files processed by Taskcluster-GitHub](rfcs/0182-taskcluster-yml-remote-references.md) |
7070
| RFC#189 | [Batch APIs for task definition, status and index path](rfcs/0189-batch-task-apis.md) |
7171
| RFC#191 | [Worker Manager launch configurations](rfcs/0191-worker-manager-launch-configs.md) |
72+
| RFC#192 | [`minCapacity` ensures workers do not get unnecessarily killed](rfcs/0192-min-capacity-ensures-workers-do-not-get-unnecessarily-killed.md) |
Lines changed: 199 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,199 @@
1+
# RFC 192 - `minCapacity` ensures workers do not get unnecessarily killed
2+
* Comments: [#192](https://github.com/taskcluster/taskcluster-rfcs/pull/192)
3+
* Proposed by: @johanlorenzo
4+
5+
# Summary
6+
7+
Optimize worker pools with `minCapacity >= 1` by implementing minimum capacity workers that avoid unnecessary shutdown/restart cycles, preserving caches and reducing task wait times.
8+
9+
## Motivation
10+
11+
Currently, workers in pools with `minCapacity >= 1` exhibit wasteful behavior:
12+
13+
1. **Cache Loss**: Workers shut down after idle timeout (600 seconds for decision pools), losing valuable caches:
14+
- VCS repositories and history
15+
- Package manager caches (npm, pip, cargo, etc.)
16+
- Container images and layers
17+
18+
2. **Provisioning Delays**: New worker provisioning [takes ~75 seconds average for decision pools](https://taskcluster.github.io/mozilla-history/worker-metrics), during which tasks must wait
19+
20+
3. **Resource Waste**: The current cycle of shutdown → detection → spawn → provision → register wastes compute resources and increases task latency
21+
22+
4. **Violation of `minCapacity` Intent**: `minCapacity >= 1` suggests these pools should always have capacity available, but the current implementation allows temporary capacity gaps
23+
24+
# Details
25+
26+
## Current Behavior Analysis
27+
28+
**Affected Worker Pools:**
29+
- Direct `minCapacity: 1`: [`infra/build-decision`](https://github.com/mozilla-releng/fxci-config/blob/43c18aab0826244e369b16a964637b6c411c7760/worker-pools.yml#L220), [`code-review/bot-gcp`](https://github.com/mozilla-releng/fxci-config/blob/43c18aab0826244e369b16a964637b6c411c7760/worker-pools.yml#L3320)
30+
- Keyed `minCapacity: 1`: [`gecko-1/decision-gcp`, `gecko-3/decision-gcp`](https://github.com/mozilla-releng/fxci-config/blob/43c18aab0826244e369b16a964637b6c411c7760/worker-pools.yml#L976), and [pools matching `(app-services|glean|mozillavpn|mobile|mozilla|translations)-1`](https://github.com/mozilla-releng/fxci-config/blob/43c18aab0826244e369b16a964637b6c411c7760/worker-pools.yml#L1088)
31+
32+
**Current Implementation Issues:**
33+
- Worker-manager enforces `minCapacity` by spawning new workers when capacity drops below threshold
34+
- Generic-worker shuts down after `idleTimeoutSecs` regardless of `minCapacity` requirements
35+
- Gap exists between worker shutdown and replacement detection/provisioning
36+
37+
## Proposed Solution
38+
39+
### Core Concept: `minCapacity` Workers
40+
41+
Workers fulfilling `minCapacity >= 1` requirements should receive significantly longer idle timeouts to preserve caches.
42+
43+
### Implementation Approach
44+
45+
#### 1. Worker Identification and Tagging
46+
47+
**Worker Provisioning Logic:**
48+
- Worker-manager determines worker idle timeout configuration at spawn time based on current pool capacity
49+
- **No Special Replacement Logic**: Failed workers are handled through normal provisioning cycles, not immediate replacement
50+
51+
#### 2. Lifecycle-Based Worker Configuration
52+
53+
Workers are identified with a boolean flag at launch time indicating whether they fulfill minCapacity requirements. This approach works within TaskCluster's architectural constraints where worker-manager cannot communicate configuration changes to running workers.
54+
55+
**Launch-Time Configuration:**
56+
Worker-manager determines worker type at spawn time based on current pool capacity needs. Workers fulfilling `minCapacity` requirements are marked with a boolean flag and receive indefinite idle timeout (0 seconds). Additional workers beyond `minCapacity` receive standard idle timeouts (default 600 seconds).
57+
58+
**Immutable Worker Behavior:**
59+
- Workers receive their complete configuration at startup
60+
- No runtime configuration changes
61+
- Worker idle timeout never changes during worker lifetime
62+
- `minCapacity` requirements fulfilled through worker replacement, not reconfiguration
63+
64+
#### 3. Pool Configuration
65+
66+
**New Configuration Options:**
67+
Pools will have a new boolean flag `enableMinCapacityWorkers` that enables minCapacity worker behavior. When enabled, workers fulfilling minCapacity requirements will run indefinitely (idle timeout set to 0), providing maximum cache preservation.
68+
69+
**Validation:**
70+
- Fail pool provisioning entirely if invalid configuration (e.g., `minCapacity > maxCapacity`)
71+
- Boolean flag is optional and defaults to `false` for backward compatibility
72+
- When `enableMinCapacityWorkers` is `true`, minCapacity workers receive indefinite runtime (idle timeout = 0)
73+
74+
#### 4. Enhanced Provisioning Logic
75+
76+
**Provisioning Strategy:**
77+
The enhanced estimator uses existing `minCapacity` enforcement logic but provisions workers with different idle timeout configurations based on pool needs. When total running capacity falls below minCapacity and `enableMinCapacityWorkers` is enabled, new workers are spawned with indefinite runtime (idle timeout = 0) to serve as long-lived minCapacity workers. Additional workers beyond `minCapacity` are spawned with standard idle timeouts. This demand-based approach avoids over-provisioning and works within existing estimator logic.
78+
79+
**Current Estimator Implementation:** [Estimator.simple() method](https://github.com/taskcluster/taskcluster/blob/d7fbf0ee9e0d93e079cc7ff069eaceecfd7d29ec/services/worker-manager/src/estimator.js#L7-L84) enforces `minCapacity` as a floor at [line 35-39](https://github.com/taskcluster/taskcluster/blob/d7fbf0ee9e0d93e079cc7ff069eaceecfd7d29ec/services/worker-manager/src/estimator.js#L35-L39)
80+
81+
**Capacity Management:**
82+
- **minCapacity increases**: Spawn additional minCapacity workers with indefinite runtime when `enableMinCapacityWorkers` is enabled
83+
- **minCapacity decreases**: Use quarantine-based two-phase removal to safely terminate excess minCapacity workers (see Forceful Termination section)
84+
- **Worker failures**: Failed workers are handled through normal provisioning cycles when total capacity falls below minCapacity
85+
- **Pending tasks check**: Workers should not be terminated if there are pending tasks they could execute
86+
87+
**Capacity Reduction Strategy:**
88+
When `minCapacity` decreases, excess minCapacity workers are removed using a two-phase quarantine approach to prevent task claim expiration. Workers with indefinite idle timeouts cannot have their configuration changed at runtime, so they must be terminated and replaced with standard workers when capacity needs to decrease.
89+
90+
#### 5. Worker Removal Using Quarantine Mechanism
91+
92+
**Existing Quarantine System:**
93+
Taskcluster's Queue service provides a [quarantine mechanism](https://github.com/taskcluster/taskcluster/blob/d7fbf0ee9e0d93e079cc7ff069eaceecfd7d29ec/services/queue/src/api.js#L2492-L2545) that prevents workers from claiming new tasks while keeping them alive. When a worker is quarantined:
94+
- The worker cannot claim new tasks from the queue
95+
- The worker's capacity is not counted toward pool capacity
96+
- The worker remains alive until the quarantine period expires
97+
- This prevents task claim expiration errors when removing workers
98+
99+
**Two-Phase Removal Process:**
100+
101+
When minCapacity workers need to be removed (e.g., when `minCapacity` decreases or launch configuration changes), worker-manager uses a two-phase quarantine-based approach:
102+
103+
**Phase 1 - Quarantine (First Scanner Run):**
104+
1. Worker-manager counts all minCapacity workers in the pool
105+
2. When the count exceeds the pool's `minCapacity` setting, identify excess workers
106+
3. Call the Queue service's [`quarantineWorker` API](https://github.com/taskcluster/taskcluster/blob/d7fbf0ee9e0d93e079cc7ff069eaceecfd7d29ec/services/queue/src/api.js#L2519-L2525) for each excess worker
107+
4. Set `quarantineInfo` to 'minCapacity reduction' to document the reason
108+
5. Quarantined workers immediately stop claiming new tasks
109+
110+
**Phase 2 - Termination (Next Scanner Run):**
111+
1. Worker-manager checks each quarantined worker for three conditions:
112+
- Worker is marked as a minCapacity worker (boolean flag is true)
113+
- Worker was quarantined with reason 'minCapacity reduction'
114+
- Worker is not currently running any tasks
115+
2. If all conditions are met, forcefully terminate the worker via cloud provider API
116+
3. If conditions are not met, wait for the next scanner run
117+
118+
**Alternative Approach:**
119+
Immediate removal on the first scan is simpler to implement but carries the risk of task claim expiration if a worker picks up a task at the moment it's being terminated. The two-phase approach is preferred because it guarantees safe removal.
120+
121+
**Technical Capability:**
122+
Worker-manager has direct cloud provider API access for forceful termination:
123+
- [Worker.states.STOPPING definition](https://github.com/taskcluster/taskcluster/blob/d7fbf0ee9e0d93e079cc7ff069eaceecfd7d29ec/services/worker-manager/src/data.js#L858-L863)
124+
- [AWS removeWorker](https://github.com/taskcluster/taskcluster/blob/d7fbf0ee9e0d93e079cc7ff069eaceecfd7d29ec/services/worker-manager/src/providers/aws.js#L382-L418)
125+
- [Google removeWorker](https://github.com/taskcluster/taskcluster/blob/d7fbf0ee9e0d93e079cc7ff069eaceecfd7d29ec/services/worker-manager/src/providers/google.js#L168-L192)
126+
- [Azure removeWorker](https://github.com/taskcluster/taskcluster/blob/d7fbf0ee9e0d93e079cc7ff069eaceecfd7d29ec/services/worker-manager/src/providers/azure/index.js#L1221-L1233)
127+
128+
#### 6. Monitoring
129+
130+
Existing worker-manager metrics are sufficient for monitoring this feature. The existing `*_queue_running_workers` metrics already track active workers by pool. Implementation-specific monitoring details will be determined during development.
131+
132+
## Rollout Strategy
133+
134+
**Phase 1**: Single Pool Testing
135+
- Enable on one stable test pool (e.g., `gecko-1/decision-gcp` for try branch)
136+
- Monitor success metrics, cache preservation, and termination frequency
137+
- Validate that pool benefits outweigh costs of occasional forceful termination
138+
139+
**Phase 2**: Gradual Expansion
140+
- Roll out to remaining decision pools with stable `minCapacity` requirements
141+
- Prioritize pools where `minCapacity` changes are infrequent
142+
- Monitor trade-off between cache benefits and capacity management responsiveness
143+
144+
**Phase 3**: Full Deployment
145+
- Enable for all eligible pools after validating net positive impact
146+
- Continue monitoring optimization effectiveness across different usage patterns
147+
- Consider disabling for pools where `minCapacity` changes are too frequent
148+
149+
## Error Handling and Edge Cases
150+
151+
**Worker Lifecycle Management:**
152+
- **Pool reconfiguration**: Capacity changes trigger worker replacement, not reconfiguration
153+
- **Graceful transitions**: When possible, only terminate idle workers to preserve active caches
154+
- **Resource allocation**: minCapacity workers mixed with other workers on same infrastructure
155+
156+
**Launch Configuration Changes:**
157+
When a launch configuration is changed, removed, or archived, all workers created from the old configuration must be terminated and replaced:
158+
- If a launch configuration is archived (not present in new configuration), identify all long-running workers created from it
159+
- Use the two-phase quarantine process to safely terminate these workers
160+
- Worker-manager will spawn new workers using the updated launch configuration
161+
- This ensures workers always run with current configuration and prevents indefinite use of outdated configurations
162+
163+
**Capacity Reduction Strategy:**
164+
When `minCapacity` decreases, excess minCapacity workers are removed using the two-phase quarantine process to prevent task claim expiration. Idle workers are prioritized for termination to preserve active caches when possible.
165+
166+
## Compatibility Considerations
167+
168+
**Backward Compatibility:**
169+
- Opt-in via `enableMinCapacityWorkers` boolean flag (defaults to `false`)
170+
- Existing pools continue current behavior unless explicitly enabled
171+
- No changes to generic-worker's core idle timeout mechanism
172+
173+
**Future Direction:**
174+
Once this behavior is proven stable and beneficial, it may become the default behavior for all pools with `minCapacity >= 1`. The boolean flag provides a transition period to validate the approach before making it the standard.
175+
176+
**API Changes:**
177+
- Enhanced worker pool configuration schema
178+
- Existing cloud provider termination APIs used for capacity management
179+
180+
**Security Implications:**
181+
- No security changes - same authentication and authorization flows
182+
- Longer-lived workers maintain same credential rotation schedule
183+
184+
**Performance Implications:**
185+
- **Positive**: Reduced task wait times, preserved caches, fewer API calls
186+
- **Negative**: Slightly higher resource usage for idle `minCapacity` workers
187+
- **Net**: Expected improvement in overall system efficiency
188+
189+
# Implementation
190+
191+
<Once the RFC is decided, these links will provide readers a way to track the
192+
implementation through to completion, and to know if they are running a new
193+
enough version to take advantage of this change. It's fine to update this
194+
section using short PRs or pushing directly to master after the RFC is
195+
decided>
196+
197+
* <link to tracker bug, issue, etc.>
198+
* <...>
199+
* Implemented in Taskcluster version ...

rfcs/README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -57,3 +57,4 @@
5757
| RFC#182 | [Allow remote references to .taskcluster.yml files processed by Taskcluster-GitHub](0182-taskcluster-yml-remote-references.md) |
5858
| RFC#189 | [Batch APIs for task definition, status and index path](0189-batch-task-apis.md) |
5959
| RFC#191 | [Worker Manager launch configurations](0191-worker-manager-launch-configs.md) |
60+
| RFC#192 | [`minCapacity` ensures workers do not get unnecessarily killed](0192-min-capacity-ensures-workers-do-not-get-unnecessarily-killed.md) |

0 commit comments

Comments
 (0)