Skip to content

Commit 4f9302d

Browse files
committed
RFC#0192 - ensures workers do not get unnecessarily killed
1 parent 7cf7165 commit 4f9302d

File tree

3 files changed

+259
-0
lines changed

3 files changed

+259
-0
lines changed

README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -69,3 +69,4 @@ See [mechanics](mechanics.md) for more detail.
6969
| RFC#182 | [Allow remote references to .taskcluster.yml files processed by Taskcluster-GitHub](rfcs/0182-taskcluster-yml-remote-references.md) |
7070
| RFC#189 | [Batch APIs for task definition, status and index path](rfcs/0189-batch-task-apis.md) |
7171
| RFC#191 | [Worker Manager launch configurations](rfcs/0191-worker-manager-launch-configs.md) |
72+
| RFC#192 | [`minCapacity` ensures workers do not get unnecessarily killed](rfcs/0192-min-capacity-ensures-workers-do-not-get-unnecessarily-killed.md) |
Lines changed: 257 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,257 @@
1+
# RFC 192 - `minCapacity` ensures workers do not get unnecessarily killed
2+
* Comments: [#192](https://github.com/taskcluster/taskcluster-rfcs/pull/192)
3+
* Proposed by: @johanlorenzo
4+
5+
# Summary
6+
7+
Optimize worker pools with `minCapacity >= 1` by implementing minimum capacity workers that avoid unnecessary shutdown/restart cycles, preserving caches and reducing task wait times.
8+
9+
## Motivation
10+
11+
Currently, workers in pools with `minCapacity >= 1` exhibit wasteful behavior:
12+
13+
1. **Cache Loss**: Workers shut down after idle timeout (600 seconds for decision pools), losing valuable caches:
14+
- VCS repositories and history
15+
- Package manager caches (npm, pip, cargo, etc.)
16+
- Container images and layers
17+
18+
2. **Provisioning Delays**: New worker provisioning takes ~73-74 seconds average for decision pools, during which tasks must wait
19+
20+
3. **Resource Waste**: The current cycle of shutdown → detection → spawn → provision → register wastes compute resources and increases task latency
21+
22+
4. **Violation of MinCapacity Intent**: `minCapacity >= 1` suggests these pools should always have capacity available, but the current implementation allows temporary capacity gaps
23+
24+
# Details
25+
26+
## Current Behavior Analysis
27+
28+
**Affected Worker Pools:**
29+
- Direct `minCapacity: 1`: [`infra/build-decision`](https://github.com/mozilla-releng/fxci-config/blob/43c18aab0826244e369b16a964637b6c411c7760/worker-pools.yml#L220), [`code-review/bot-gcp`](https://github.com/mozilla-releng/fxci-config/blob/43c18aab0826244e369b16a964637b6c411c7760/worker-pools.yml#L3320)
30+
- Keyed `minCapacity: 1`: [`gecko-1/decision-gcp`, `gecko-3/decision-gcp`](https://github.com/mozilla-releng/fxci-config/blob/43c18aab0826244e369b16a964637b6c411c7760/worker-pools.yml#L976), and [pools matching `(app-services|glean|mozillavpn|mobile|mozilla|translations)-1`](https://github.com/mozilla-releng/fxci-config/blob/43c18aab0826244e369b16a964637b6c411c7760/worker-pools.yml#L1088)
31+
32+
**Current Implementation Issues:**
33+
- Worker-manager enforces minCapacity by spawning new workers when capacity drops below threshold
34+
- Generic-worker shuts down after `afterIdleSeconds` regardless of minCapacity requirements
35+
- Gap exists between worker shutdown and replacement detection/provisioning
36+
37+
## Proposed Solution
38+
39+
### Core Concept: MinCapacity Workers
40+
41+
Workers fulfilling `minCapacity >= 1` requirements should receive significantly longer idle timeouts to preserve caches.
42+
43+
### Implementation Approach
44+
45+
#### 1. Worker Identification and Tagging
46+
47+
**Worker Provisioning Logic:**
48+
- Worker-manager determines worker idle timeout configuration at spawn time based on current pool capacity
49+
- Workers are launched with predetermined `intended_role` for tracking purposes only
50+
- Uses existing [`launch_config_id` system](https://github.com/taskcluster/taskcluster/blob/d7fbf0ee9e0d93e079cc7ff069eaceecfd7d29ec/services/worker-manager/src/data.js#L313-L362) to ensure per-variant capacity tracking
51+
- **No Special Replacement Logic**: Failed workers are handled through normal provisioning cycles, not immediate replacement
52+
53+
#### 2. Lifecycle-Based Worker Configuration
54+
55+
Workers are assigned minCapacity or overflow roles at launch time and never change configuration during their lifetime. This approach works within TaskCluster's architectural constraints where worker-manager cannot communicate configuration changes to running workers.
56+
57+
**Launch-Time Configuration:**
58+
Worker-manager determines worker role at spawn time based on current pool capacity needs. Workers fulfilling minCapacity requirements receive the pool's configured minCapacityIdleTimeoutSecs (0 for indefinite runtime) and are tagged with 'min-capacity' role. Additional workers beyond minCapacity receive standard idle timeouts (default 600 seconds) and are tagged as 'overflow' workers.
59+
60+
**Immutable Worker Behavior:**
61+
- Workers receive their complete configuration at startup
62+
- No runtime configuration changes
63+
- Worker role and idle timeout never change during worker lifetime
64+
- MinCapacity requirements fulfilled through worker replacement, not reconfiguration
65+
66+
#### 3. Pool Configuration
67+
68+
**New Configuration Options:**
69+
Pools will have a new `minCapacityIdleTimeoutSecs` parameter that enables minCapacity worker behavior. Setting this to 0 means minCapacity workers will run indefinitely (no idle timeout), providing maximum cache preservation.
70+
71+
**Validation:**
72+
- Fail pool provisioning entirely if invalid configuration (e.g., `minCapacity > maxCapacity`)
73+
- Require `minCapacityIdleTimeoutSecs >= 0` if minCapacity workers are desired
74+
- Setting `minCapacityIdleTimeoutSecs = 0` enables indefinite runtime for maximum cache preservation
75+
76+
#### 4. Enhanced Provisioning Logic
77+
78+
**Provisioning Strategy:**
79+
The enhanced estimator uses existing minCapacity enforcement logic but provisions workers with different idle timeout configurations based on pool needs. When total running capacity falls below minCapacity, new workers are spawned with indefinite runtime (minCapacityIdleTimeoutSecs = 0) to serve as long-lived minCapacity workers. Additional workers beyond minCapacity are spawned with standard idle timeouts. This demand-based approach avoids over-provisioning and works within existing estimator logic.
80+
81+
**Current Estimator Implementation:** [Estimator.simple() method](https://github.com/taskcluster/taskcluster/blob/d7fbf0ee9e0d93e079cc7ff069eaceecfd7d29ec/services/worker-manager/src/estimator.js#L7-L84) enforces minCapacity as a floor at [line 35-39](https://github.com/taskcluster/taskcluster/blob/d7fbf0ee9e0d93e079cc7ff069eaceecfd7d29ec/services/worker-manager/src/estimator.js#L35-L39)
82+
83+
**Capacity Management:**
84+
- **minCapacity increases**: Spawn additional minCapacity workers with indefinite runtime
85+
- **minCapacity decreases**: Forcefully terminate excess minCapacity workers (see Capacity Reduction Trade-offs)
86+
- **Worker failures**: Failed workers are handled through normal provisioning cycles when total capacity falls below minCapacity
87+
- **Multi-region**: Pool-wide capacity management (any region can fulfill minCapacity)
88+
89+
**Capacity Reduction Trade-offs:**
90+
When minCapacity decreases, excess minCapacity workers must be forcefully terminated since:
91+
- Worker-manager cannot communicate with running workers to reduce their idle timeouts
92+
- Waiting for natural idle timeout could take hours (defeating responsive capacity management)
93+
- Forceful termination provides immediate capacity adjustment but loses cache benefits
94+
95+
#### 5. Forceful Termination for Capacity Management
96+
97+
**Technical Capability:**
98+
Worker-manager has direct cloud provider API access and can forcefully terminate instances:
99+
- **GCP**: [`workerId = instanceId` mapping](https://github.com/taskcluster/taskcluster/blob/d7fbf0ee9e0d93e079cc7ff069eaceecfd7d29ec/services/worker-manager/src/providers/google.js#L341), uses [`compute.instances.delete()` API](https://github.com/taskcluster/taskcluster/blob/d7fbf0ee9e0d93e079cc7ff069eaceecfd7d29ec/services/worker-manager/src/providers/google.js#L181-L185)
100+
- **AWS**: [`workerId = instanceId` mapping](https://github.com/taskcluster/taskcluster/blob/d7fbf0ee9e0d93e079cc7ff069eaceecfd7d29ec/services/worker-manager/src/providers/aws.js#L243), uses [`TerminateInstancesCommand` API](https://github.com/taskcluster/taskcluster/blob/d7fbf0ee9e0d93e079cc7ff069eaceecfd7d29ec/services/worker-manager/src/providers/aws.js#L393-L395)
101+
- **Azure**: [Worker ID maps to VM name](https://github.com/taskcluster/taskcluster/blob/d7fbf0ee9e0d93e079cc7ff069eaceecfd7d29ec/services/worker-manager/src/providers/azure/index.js#L302), uses [VM deletion API](https://github.com/taskcluster/taskcluster/blob/d7fbf0ee9e0d93e079cc7ff069eaceecfd7d29ec/services/worker-manager/src/providers/azure/index.js#L1249-L1254)
102+
103+
**Implementation:**
104+
Worker-manager has existing capability to forcefully terminate workers by updating the database to mark workers as STOPPING state and then making direct cloud provider API calls to terminate the instances. This process works across all supported cloud providers (GCP, AWS, Azure) using their respective termination APIs.
105+
106+
**Implementation References:**
107+
- [Worker.states.STOPPING definition](https://github.com/taskcluster/taskcluster/blob/d7fbf0ee9e0d93e079cc7ff069eaceecfd7d29ec/services/worker-manager/src/data.js#L858-L863)
108+
- [AWS removeWorker implementation](https://github.com/taskcluster/taskcluster/blob/d7fbf0ee9e0d93e079cc7ff069eaceecfd7d29ec/services/worker-manager/src/providers/aws.js#L382-L418)
109+
- [Google removeWorker implementation](https://github.com/taskcluster/taskcluster/blob/d7fbf0ee9e0d93e079cc7ff069eaceecfd7d29ec/services/worker-manager/src/providers/google.js#L168-L192)
110+
- [Azure removeWorker implementation](https://github.com/taskcluster/taskcluster/blob/d7fbf0ee9e0d93e079cc7ff069eaceecfd7d29ec/services/worker-manager/src/providers/azure/index.js#L1221-L1233)
111+
112+
**Trade-offs:**
113+
-**Immediate capacity response**: minCapacity changes take effect in ~30 seconds
114+
-**Cache loss**: Defeats primary optimization goal when terminating active workers
115+
- ⚠️ **Selective termination**: Prioritize terminating idle workers when possible
116+
117+
#### 6. Enhanced Health Monitoring
118+
119+
**MinCapacity Worker Health Checks:**
120+
- More aggressive health monitoring for minCapacity workers
121+
- Faster detection of unresponsive minCapacity workers
122+
- Automatic detection and fallback if minCapacity workers cause issues
123+
124+
#### 7. Monitoring and Alerting
125+
126+
**Implementation Location:**
127+
New metrics will be added to [services/worker-manager/src/monitor.js](https://github.com/taskcluster/taskcluster/blob/d7fbf0ee9e0d93e079cc7ff069eaceecfd7d29ec/services/worker-manager/src/monitor.js#L377) following the existing pattern established in [lines 246-377](https://github.com/taskcluster/taskcluster/blob/d7fbf0ee9e0d93e079cc7ff069eaceecfd7d29ec/services/worker-manager/src/monitor.js#L246-L377). These metrics will be displayed in the Taskcluster UI Status Dashboard for human operators and exposed via Prometheus `/metrics` endpoint for external monitoring systems to create custom dashboards and alerting.
128+
129+
**New Prometheus Metrics:**
130+
131+
1. **MinCapacity Worker Count** - Gauge metric tracking the number of workers fulfilling minCapacity requirements, labeled by worker state (running/stopping/requested)
132+
133+
2. **Overflow Worker Count** - Gauge metric tracking the number of workers beyond minCapacity requirements, labeled by worker state
134+
135+
3. **MinCapacity Worker Terminations** - Counter metric tracking total minCapacity workers terminated, labeled by termination reason (capacity-reduction/health-check/idle-timeout)
136+
137+
4. **MinCapacity Deficit Duration** - Histogram metric measuring time spent below minCapacity threshold, with buckets for different time ranges (1s to 10 minutes)
138+
139+
**Metrics Integration:**
140+
Metrics will be exposed in [estimator.js](https://github.com/taskcluster/taskcluster/blob/d7fbf0ee9e0d93e079cc7ff069eaceecfd7d29ec/services/worker-manager/src/estimator.js#L89-L92) using the existing `exposeMetrics()` pattern to report worker counts by pool and role during each provisioning cycle.
141+
142+
**Alerting Rules:**
143+
Two key alerting rules will monitor minCapacity worker health: a MinCapacityDeficit alert triggered when worker pools fall below minCapacity for more than 30 seconds, and a HighMinCapacityTerminations alert triggered when termination rates exceed 0.1 workers per 5-minute window for more than 2 minutes.
144+
145+
## Success Measurement
146+
147+
**Primary Metrics:**
148+
149+
1. **Task Wait Time Reduction** - Measure 95th percentile task pending-to-running duration before and after implementation by worker pool
150+
151+
2. **Worker Spawn Frequency Reduction** - Track the rate of new worker requests relative to total capacity changes to measure churn reduction
152+
153+
3. **MinCapacity Worker Stability** - Calculate percentage of time pools maintain minCapacity without deficits
154+
155+
**Secondary Metrics:**
156+
157+
4. **Cache Preservation Effectiveness** - Compare 90th percentile worker uptime between minCapacity and overflow workers to measure cache retention
158+
159+
5. **Termination Frequency Impact** - Monitor rate of minCapacity worker terminations per hour
160+
161+
6. **Capacity Response Time** - Average time to fulfill minCapacity requirements after capacity deficits occur
162+
163+
**Success Criteria:**
164+
- **Task wait time**: 20% reduction in P95 pending→running time
165+
- **Worker churn**: 50% reduction in worker spawn frequency for minCapacity pools
166+
- **Cache effectiveness**: 80% of minCapacity workers have >1 hour uptime
167+
- **Deficit resolution**: <60 seconds average time to resolve minCapacity deficits
168+
- **Termination cost**: <5% of minCapacity workers terminated for capacity reduction per day
169+
170+
**Trade-off Evaluation:**
171+
Cost-benefit analysis will compare the benefits of reduced task wait times (measured by time savings multiplied by task throughput) against the costs of cache loss from forced terminations (measured by termination rate multiplied by average cache warmup time).
172+
173+
## Rollout Strategy
174+
175+
**Phase 1**: Single Pool Testing
176+
- Enable on one stable test pool (e.g., `gecko-1/decision-gcp` for try branch)
177+
- Monitor success metrics, cache preservation, and termination frequency
178+
- Validate that pool benefits outweigh costs of occasional forceful termination
179+
180+
**Phase 2**: Gradual Expansion
181+
- Roll out to remaining decision pools with stable minCapacity requirements
182+
- Prioritize pools where minCapacity changes are infrequent
183+
- Monitor trade-off between cache benefits and capacity management responsiveness
184+
185+
**Phase 3**: Full Deployment
186+
- Enable for all eligible pools after validating net positive impact
187+
- Continue monitoring optimization effectiveness across different usage patterns
188+
- Consider disabling for pools where minCapacity changes are too frequent
189+
190+
## Error Handling and Edge Cases
191+
192+
**Worker Lifecycle Management:**
193+
- **Pool reconfiguration**: Capacity changes trigger worker replacement, not reconfiguration
194+
- **Graceful transitions**: When possible, only terminate idle workers to preserve active caches
195+
- **Resource allocation**: MinCapacity workers mixed with overflow workers on same infrastructure
196+
197+
**Capacity Reduction Strategy:**
198+
When minCapacity decreases, a hybrid approach minimizes cache loss by first prioritizing termination of idle excess workers to preserve active caches. If additional capacity reduction is needed, busy workers are terminated immediately, starting with the oldest workers first to maximize cache preservation for recently started workers.
199+
200+
## Compatibility Considerations
201+
202+
**Backward Compatibility:**
203+
- Opt-in via `minCapacityIdleTimeoutSecs` configuration
204+
- Existing pools continue current behavior unless explicitly enabled
205+
- No changes to generic-worker's core idle timeout mechanism
206+
207+
**API Changes:**
208+
- New worker database field for role tracking
209+
- Enhanced worker pool configuration schema
210+
- Existing cloud provider termination APIs used for capacity management
211+
212+
**Security Implications:**
213+
- No security changes - same authentication and authorization flows
214+
- Longer-lived workers maintain same credential rotation schedule
215+
216+
**Performance Implications:**
217+
- **Positive**: Reduced task wait times, preserved caches, fewer API calls
218+
- **Negative**: Slightly higher resource usage for idle minCapacity workers
219+
- **Net**: Expected improvement in overall system efficiency
220+
221+
## Documentation Updates Required
222+
223+
**Configuration Schema:**
224+
- [`/services/worker-manager/schemas/constants.yml`](https://github.com/taskcluster/taskcluster/blob/d7fbf0ee9e0d93e079cc7ff069eaceecfd7d29ec/services/worker-manager/schemas/constants.yml) - Add minCapacityIdleTimeoutSecs schema definitions
225+
226+
**Operational Documentation:**
227+
- [`/services/worker-manager/README.md`](https://github.com/taskcluster/taskcluster/blob/d7fbf0ee9e0d93e079cc7ff069eaceecfd7d29ec/services/worker-manager/README.md) - Update provisioning loop diagrams for minCapacity worker behavior
228+
- [`/services/worker-manager/docs/providers.md`](https://github.com/taskcluster/taskcluster/blob/d7fbf0ee9e0d93e079cc7ff069eaceecfd7d29ec/services/worker-manager/docs/providers.md) - Add minCapacity worker provisioning behavior
229+
230+
**Worker Documentation:**
231+
- [`/workers/generic-worker/README.md`](https://github.com/taskcluster/taskcluster/blob/d7fbf0ee9e0d93e079cc7ff069eaceecfd7d29ec/workers/generic-worker/README.md) - Document launch-time idle timeout configuration
232+
233+
**Monitoring Documentation:**
234+
- [`/libraries/monitor/README.md`](https://github.com/taskcluster/taskcluster/blob/d7fbf0ee9e0d93e079cc7ff069eaceecfd7d29ec/libraries/monitor/README.md) - Document new minCapacity worker metrics and alerting
235+
236+
## Migration Path
237+
238+
**For Existing Deployments:**
239+
1. Deploy enhanced worker-manager with feature disabled by default
240+
2. Update worker pool configurations to enable minCapacity workers
241+
3. Monitor success metrics and validate cache preservation vs termination trade-offs
242+
4. Gradually expand to additional pools based on observed benefits
243+
244+
**Database Migration:**
245+
The database migration requires adding a new `intended_role` text column to the workers table for tracking worker roles. Additionally, an index on worker_pool_id and intended_role columns will be created specifically for min-capacity workers to enable efficient capacity queries during provisioning decisions.
246+
247+
# Implementation
248+
249+
<Once the RFC is decided, these links will provide readers a way to track the
250+
implementation through to completion, and to know if they are running a new
251+
enough version to take advantage of this change. It's fine to update this
252+
section using short PRs or pushing directly to master after the RFC is
253+
decided>
254+
255+
* <link to tracker bug, issue, etc.>
256+
* <...>
257+
* Implemented in Taskcluster version ...

rfcs/README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -57,3 +57,4 @@
5757
| RFC#182 | [Allow remote references to .taskcluster.yml files processed by Taskcluster-GitHub](0182-taskcluster-yml-remote-references.md) |
5858
| RFC#189 | [Batch APIs for task definition, status and index path](0189-batch-task-apis.md) |
5959
| RFC#191 | [Worker Manager launch configurations](0191-worker-manager-launch-configs.md) |
60+
| RFC#192 | [`minCapacity` ensures workers do not get unnecessarily killed](0192-min-capacity-ensures-workers-do-not-get-unnecessarily-killed.md) |

0 commit comments

Comments
 (0)