RFC#0192 - ensures workers do not get unnecessarily killed

JohanLorenzo · JohanLorenzo · commit 119cdb2cd71a · 2025-09-26T18:57:14.000+02:00
diff --git a/README.md b/README.md
@@ -69,3 +69,4 @@ See [mechanics](mechanics.md) for more detail.
 | RFC#182 | [Allow remote references to .taskcluster.yml files processed by Taskcluster-GitHub](rfcs/0182-taskcluster-yml-remote-references.md)                                                    |
 | RFC#189 | [Batch APIs for task definition, status and index path](rfcs/0189-batch-task-apis.md)                                                                                                  |
 | RFC#191 | [Worker Manager launch configurations](rfcs/0191-worker-manager-launch-configs.md)                                                                                                     |
+| RFC#192 | [`minCapacity` ensures workers do not get unnecessarily killed](rfcs/0192-min-capacity-ensures-workers-do-not-get-unnecessarily-killed.md)                                             |
diff --git a/rfcs/0192-min-capacity-ensures-workers-do-not-get-unnecessarily-killed.md b/rfcs/0192-min-capacity-ensures-workers-do-not-get-unnecessarily-killed.md
@@ -0,0 +1,229 @@
+# RFC 192 - `minCapacity` ensures workers do not get unnecessarily killed
+* Comments: [#192](https://github.com/taskcluster/taskcluster-rfcs/pull/192)
+* Proposed by: @johanlorenzo
+
+# Summary
+
+Optimize worker pools with `minCapacity >= 1` by implementing minimum capacity workers that avoid unnecessary shutdown/restart cycles, preserving caches and reducing task wait times.
+
+## Motivation
+
+Currently, workers in pools with `minCapacity >= 1` exhibit wasteful behavior:
+
+1. **Cache Loss**: Workers shut down after idle timeout (600 seconds for decision pools), losing valuable caches:
+   - VCS repositories and history
+   - Package manager caches (npm, pip, cargo, etc.)
+   - Container images and layers
+
+2. **Provisioning Delays**: New worker provisioning [takes ~75 seconds average for decision pools](https://taskcluster.github.io/mozilla-history/worker-metrics), during which tasks must wait
+
+3. **Resource Waste**: The current cycle of shutdown → detection → spawn → provision → register wastes compute resources and increases task latency
+
+4. **Violation of `minCapacity` Intent**: `minCapacity >= 1` suggests these pools should always have capacity available, but the current implementation allows temporary capacity gaps
+
+# Details
+
+## Current Behavior Analysis
+
+**Affected Worker Pools:**
+- Direct `minCapacity: 1`: [`infra/build-decision`](https://github.com/mozilla-releng/fxci-config/blob/43c18aab0826244e369b16a964637b6c411c7760/worker-pools.yml#L220), [`code-review/bot-gcp`](https://github.com/mozilla-releng/fxci-config/blob/43c18aab0826244e369b16a964637b6c411c7760/worker-pools.yml#L3320)
+- Keyed `minCapacity: 1`: [`gecko-1/decision-gcp`, `gecko-3/decision-gcp`](https://github.com/mozilla-releng/fxci-config/blob/43c18aab0826244e369b16a964637b6c411c7760/worker-pools.yml#L976), and [pools matching `(app-services|glean|mozillavpn|mobile|mozilla|translations)-1`](https://github.com/mozilla-releng/fxci-config/blob/43c18aab0826244e369b16a964637b6c411c7760/worker-pools.yml#L1088)
+
+**Current Implementation Issues:**
+- Worker-manager enforces `minCapacity` by spawning new workers when capacity drops below threshold
+- Generic-worker shuts down after `afterIdleSeconds` regardless of `minCapacity` requirements
+- Gap exists between worker shutdown and replacement detection/provisioning
+
+## Proposed Solution
+
+### Core Concept: `minCapacity` Workers
+
+Workers fulfilling `minCapacity >= 1` requirements should receive significantly longer idle timeouts to preserve caches.
+
+### Implementation Approach
+
+#### 1. Worker Identification and Tagging
+
+**Worker Provisioning Logic:**
+- Worker-manager determines worker idle timeout configuration at spawn time based on current pool capacity
+- Uses existing [`launch_config_id` system](https://github.com/taskcluster/taskcluster/blob/d7fbf0ee9e0d93e079cc7ff069eaceecfd7d29ec/services/worker-manager/src/data.js#L313-L362) to ensure per-variant capacity tracking
+- **No Special Replacement Logic**: Failed workers are handled through normal provisioning cycles, not immediate replacement
+
+#### 2. Lifecycle-Based Worker Configuration
+
+Workers are assigned `minCapacity` or overflow roles at launch time and never change configuration during their lifetime. This approach works within TaskCluster's architectural constraints where worker-manager cannot communicate configuration changes to running workers.
+
+**Launch-Time Configuration:**
+Worker-manager determines worker role at spawn time based on current pool capacity needs. Workers fulfilling `minCapacity` requirements receive the pool's configured `minCapacityIdleTimeoutSecs` (0 for indefinite runtime) and are tagged with 'min-capacity' role. Additional workers beyond `minCapacity` receive standard idle timeouts (default 600 seconds) and are tagged as 'overflow' workers.
+
+**Immutable Worker Behavior:**
+- Workers receive their complete configuration at startup
+- No runtime configuration changes
+- Worker role and idle timeout never change during worker lifetime
+- `minCapacity` requirements fulfilled through worker replacement, not reconfiguration
+
+#### 3. Pool Configuration
+
+**New Configuration Options:**
+Pools will have a new `minCapacityIdleTimeoutSecs` parameter that enables `minCapacity` worker behavior. Setting this to 0 means `minCapacity` workers will run indefinitely (no idle timeout), providing maximum cache preservation.
+
+**Validation:**
+- Fail pool provisioning entirely if invalid configuration (e.g., `minCapacity > maxCapacity`)
+- Require `minCapacityIdleTimeoutSecs >= 0` if `minCapacity` workers are desired
+- Setting `minCapacityIdleTimeoutSecs = 0` enables indefinite runtime for maximum cache preservation
+
+#### 4. Enhanced Provisioning Logic
+
+**Provisioning Strategy:**
+The enhanced estimator uses existing `minCapacity` enforcement logic but provisions workers with different idle timeout configurations based on pool needs. When total running capacity falls below minCapacity, new workers are spawned with indefinite runtime (minCapacityIdleTimeoutSecs = 0) to serve as long-lived `minCapacity` workers. Additional workers beyond `minCapacity` are spawned with standard idle timeouts. This demand-based approach avoids over-provisioning and works within existing estimator logic.
+
+**Current Estimator Implementation:** [Estimator.simple() method](https://github.com/taskcluster/taskcluster/blob/d7fbf0ee9e0d93e079cc7ff069eaceecfd7d29ec/services/worker-manager/src/estimator.js#L7-L84) enforces `minCapacity` as a floor at [line 35-39](https://github.com/taskcluster/taskcluster/blob/d7fbf0ee9e0d93e079cc7ff069eaceecfd7d29ec/services/worker-manager/src/estimator.js#L35-L39)
+
+**Capacity Management:**
+- **minCapacity increases**: Spawn additional `minCapacity` workers with indefinite runtime
+- **minCapacity decreases**: Forcefully terminate excess `minCapacity` workers (see Capacity Reduction Trade-offs)
+- **Worker failures**: Failed workers are handled through normal provisioning cycles when total capacity falls below minCapacity
+- **Multi-region**: Pool-wide capacity management (any region can fulfill minCapacity)
+
+**Capacity Reduction Trade-offs:**
+When `minCapacity` decreases, excess `minCapacity` workers must be forcefully terminated since:
+- Worker-manager cannot communicate with running workers to reduce their idle timeouts
+- Waiting for natural idle timeout could take hours (defeating responsive capacity management)
+- Forceful termination provides immediate capacity adjustment but loses cache benefits
+
+#### 5. Forceful Termination for Capacity Management
+
+**Technical Capability:**
+Worker-manager has direct cloud provider API access and can forcefully terminate instances:
+- **GCP**: [`workerId = instanceId` mapping](https://github.com/taskcluster/taskcluster/blob/d7fbf0ee9e0d93e079cc7ff069eaceecfd7d29ec/services/worker-manager/src/providers/google.js#L341), uses [`compute.instances.delete()` API](https://github.com/taskcluster/taskcluster/blob/d7fbf0ee9e0d93e079cc7ff069eaceecfd7d29ec/services/worker-manager/src/providers/google.js#L181-L185)
+- **AWS**: [`workerId = instanceId` mapping](https://github.com/taskcluster/taskcluster/blob/d7fbf0ee9e0d93e079cc7ff069eaceecfd7d29ec/services/worker-manager/src/providers/aws.js#L243), uses [`TerminateInstancesCommand` API](https://github.com/taskcluster/taskcluster/blob/d7fbf0ee9e0d93e079cc7ff069eaceecfd7d29ec/services/worker-manager/src/providers/aws.js#L393-L395)
+- **Azure**: [Worker ID maps to VM name](https://github.com/taskcluster/taskcluster/blob/d7fbf0ee9e0d93e079cc7ff069eaceecfd7d29ec/services/worker-manager/src/providers/azure/index.js#L302), uses [VM deletion API](https://github.com/taskcluster/taskcluster/blob/d7fbf0ee9e0d93e079cc7ff069eaceecfd7d29ec/services/worker-manager/src/providers/azure/index.js#L1249-L1254)
+
+**Implementation:**
+Worker-manager has existing capability to forcefully terminate workers by updating the database to mark workers as STOPPING state and then making direct cloud provider API calls to terminate the instances. This process works across all supported cloud providers (GCP, AWS, Azure) using their respective termination APIs.
+
+**Implementation References:**
+- [Worker.states.STOPPING definition](https://github.com/taskcluster/taskcluster/blob/d7fbf0ee9e0d93e079cc7ff069eaceecfd7d29ec/services/worker-manager/src/data.js#L858-L863)
+- [AWS removeWorker implementation](https://github.com/taskcluster/taskcluster/blob/d7fbf0ee9e0d93e079cc7ff069eaceecfd7d29ec/services/worker-manager/src/providers/aws.js#L382-L418)
+- [Google removeWorker implementation](https://github.com/taskcluster/taskcluster/blob/d7fbf0ee9e0d93e079cc7ff069eaceecfd7d29ec/services/worker-manager/src/providers/google.js#L168-L192)
+- [Azure removeWorker implementation](https://github.com/taskcluster/taskcluster/blob/d7fbf0ee9e0d93e079cc7ff069eaceecfd7d29ec/services/worker-manager/src/providers/azure/index.js#L1221-L1233)
+
+**Trade-offs:**
+- ✅ **Immediate capacity response**: `minCapacity` changes take effect in ~30 seconds
+- ❌ **Cache loss**: Defeats primary optimization goal when terminating active workers
+- ⚠️ **Selective termination**: Prioritize terminating idle workers when possible
+
+#### 6. Enhanced Health Monitoring
+
+**MinCapacity Worker Health Checks:**
+- More aggressive health monitoring for `minCapacity` workers
+- Faster detection of unresponsive `minCapacity` workers
+- Automatic detection and fallback if `minCapacity` workers cause issues
+
+#### 7. Monitoring and Alerting
+
+**Implementation Location:**
+New metrics will be added to [services/worker-manager/src/monitor.js](https://github.com/taskcluster/taskcluster/blob/d7fbf0ee9e0d93e079cc7ff069eaceecfd7d29ec/services/worker-manager/src/monitor.js#L377) following the existing pattern established in [lines 246-377](https://github.com/taskcluster/taskcluster/blob/d7fbf0ee9e0d93e079cc7ff069eaceecfd7d29ec/services/worker-manager/src/monitor.js#L246-L377). These metrics will be displayed in the Taskcluster UI Status Dashboard for human operators and exposed via Prometheus `/metrics` endpoint for external monitoring systems to create custom dashboards and alerting.
+
+**New Prometheus Metrics:**
+
+1. **MinCapacity Worker Count** - Gauge metric tracking the number of workers fulfilling `minCapacity` requirements, labeled by worker state (running/stopping/requested)
+
+2. **Overflow Worker Count** - Gauge metric tracking the number of workers beyond `minCapacity` requirements, labeled by worker state
+
+3. **MinCapacity Worker Terminations** - Counter metric tracking total `minCapacity` workers terminated, labeled by termination reason (capacity-reduction/health-check/idle-timeout)
+
+4. **MinCapacity Deficit Duration** - Histogram metric measuring time spent below `minCapacity` threshold, with buckets for different time ranges (1s to 10 minutes)
+
+**Metrics Integration:**
+Metrics will be exposed in [estimator.js](https://github.com/taskcluster/taskcluster/blob/d7fbf0ee9e0d93e079cc7ff069eaceecfd7d29ec/services/worker-manager/src/estimator.js#L89-L92) using the existing `exposeMetrics()` pattern to report worker counts by pool and role during each provisioning cycle.
+
+**Alerting Rules:**
+Two key alerting rules will monitor `minCapacity` worker health: a MinCapacityDeficit alert triggered when worker pools fall below `minCapacity` for more than 30 seconds, and a HighMinCapacityTerminations alert triggered when termination rates exceed 0.1 workers per 5-minute window for more than 2 minutes.
+
+## Success Measurement
+
+**Primary Metrics:**
+
+1. **Task Wait Time Reduction** - Measure 95th percentile task pending-to-running duration before and after implementation by worker pool
+
+2. **Worker Spawn Frequency Reduction** - Track the rate of new worker requests relative to total capacity changes to measure churn reduction
+
+3. **MinCapacity Worker Stability** - Calculate percentage of time pools maintain `minCapacity` without deficits
+
+**Secondary Metrics:**
+
+4. **Cache Preservation Effectiveness** - Compare 90th percentile worker uptime between `minCapacity` and overflow workers to measure cache retention
+
+5. **Termination Frequency Impact** - Monitor rate of `minCapacity` worker terminations per hour
+
+6. **Capacity Response Time** - Average time to fulfill `minCapacity` requirements after capacity deficits occur
+
+**Success Criteria:**
+- **Task wait time**: 20% reduction in P95 pending→running time
+- **Worker churn**: 50% reduction in worker spawn frequency for `minCapacity` pools
+- **Cache effectiveness**: 80% of `minCapacity` workers have >1 hour uptime
+- **Deficit resolution**: <60 seconds average time to resolve `minCapacity` deficits
+- **Termination cost**: <5% of `minCapacity` workers terminated for capacity reduction per day
+
+**Trade-off Evaluation:**
+Cost-benefit analysis will compare the benefits of reduced task wait times (measured by time savings multiplied by task throughput) against the costs of cache loss from forced terminations (measured by termination rate multiplied by average cache warmup time).
+
+## Rollout Strategy
+
+**Phase 1**: Single Pool Testing
+- Enable on one stable test pool (e.g., `gecko-1/decision-gcp` for try branch)
+- Monitor success metrics, cache preservation, and termination frequency
+- Validate that pool benefits outweigh costs of occasional forceful termination
+
+**Phase 2**: Gradual Expansion
+- Roll out to remaining decision pools with stable `minCapacity` requirements
+- Prioritize pools where `minCapacity` changes are infrequent
+- Monitor trade-off between cache benefits and capacity management responsiveness
+
+**Phase 3**: Full Deployment
+- Enable for all eligible pools after validating net positive impact
+- Continue monitoring optimization effectiveness across different usage patterns
+- Consider disabling for pools where `minCapacity` changes are too frequent
+
+## Error Handling and Edge Cases
+
+**Worker Lifecycle Management:**
+- **Pool reconfiguration**: Capacity changes trigger worker replacement, not reconfiguration
+- **Graceful transitions**: When possible, only terminate idle workers to preserve active caches
+- **Resource allocation**: `minCapacity` workers mixed with overflow workers on same infrastructure
+
+**Capacity Reduction Strategy:**
+When `minCapacity` decreases, a hybrid approach minimizes cache loss by first prioritizing termination of idle excess workers to preserve active caches. If additional capacity reduction is needed, busy workers are terminated immediately, starting with the oldest workers first to maximize cache preservation for recently started workers.
+
+## Compatibility Considerations
+
+**Backward Compatibility:**
+- Opt-in via `minCapacityIdleTimeoutSecs` configuration
+- Existing pools continue current behavior unless explicitly enabled
+- No changes to generic-worker's core idle timeout mechanism
+
+**API Changes:**
+- Enhanced worker pool configuration schema
+- Existing cloud provider termination APIs used for capacity management
+
+**Security Implications:**
+- No security changes - same authentication and authorization flows
+- Longer-lived workers maintain same credential rotation schedule
+
+**Performance Implications:**
+- **Positive**: Reduced task wait times, preserved caches, fewer API calls
+- **Negative**: Slightly higher resource usage for idle `minCapacity` workers
+- **Net**: Expected improvement in overall system efficiency
+
+# Implementation
+
+<Once the RFC is decided, these links will provide readers a way to track the
+implementation through to completion, and to know if they are running a new
+enough version to take advantage of this change.  It's fine to update this
+section using short PRs or pushing directly to master after the RFC is
+decided>
+
+* <link to tracker bug, issue, etc.>
+* <...>
+* Implemented in Taskcluster version ...
diff --git a/rfcs/README.md b/rfcs/README.md
@@ -57,3 +57,4 @@
 | RFC#182 | [Allow remote references to .taskcluster.yml files processed by Taskcluster-GitHub](0182-taskcluster-yml-remote-references.md)                                                    |
 | RFC#189 | [Batch APIs for task definition, status and index path](0189-batch-task-apis.md)                                                                                                  |
 | RFC#191 | [Worker Manager launch configurations](0191-worker-manager-launch-configs.md)                                                                                                     |
+| RFC#192 | [`minCapacity` ensures workers do not get unnecessarily killed](0192-min-capacity-ensures-workers-do-not-get-unnecessarily-killed.md)                                             |