|
| 1 | +# RFC 192 - `minCapacity` ensures workers do not get unnecessarily killed |
| 2 | +* Comments: [#192](https://github.com/taskcluster/taskcluster-rfcs/pull/192) |
| 3 | +* Proposed by: [@JohanLorenzo](https://github.com/JohanLorenzo) |
| 4 | + |
| 5 | +# Summary |
| 6 | + |
| 7 | +Optimize worker pools with `minCapacity >= 1` by implementing minimum capacity workers that avoid unnecessary shutdown/restart cycles, preserving caches and reducing task wait times. |
| 8 | + |
| 9 | +## Motivation |
| 10 | + |
| 11 | +Currently, workers in pools with `minCapacity >= 1` exhibit wasteful behavior: |
| 12 | + |
| 13 | +1. **Cache Loss**: Workers shut down after idle timeout (600 seconds for decision pools), losing valuable caches: |
| 14 | + - VCS repositories and history |
| 15 | + - Package manager caches (npm, pip, cargo, etc.) |
| 16 | + - Container images and layers |
| 17 | + |
| 18 | +2. **Provisioning Delays**: New worker provisioning [takes ~75 seconds average for decision pools](https://taskcluster.github.io/mozilla-history/worker-metrics), during which tasks must wait |
| 19 | + |
| 20 | +3. **Resource Waste**: The current cycle of shutdown → detection → spawn → provision → register wastes compute resources and increases task latency |
| 21 | + |
| 22 | +4. **Violation of `minCapacity` Intent**: `minCapacity >= 1` suggests these pools should always have capacity available, but the current implementation allows temporary capacity gaps |
| 23 | + |
| 24 | +# Details |
| 25 | + |
| 26 | +## Current Behavior Analysis |
| 27 | + |
| 28 | +**Affected Worker Pools:** |
| 29 | +- Direct `minCapacity: 1`: [`infra/build-decision`](https://github.com/mozilla-releng/fxci-config/blob/43c18aab0826244e369b16a964637b6c411c7760/worker-pools.yml#L220), [`code-review/bot-gcp`](https://github.com/mozilla-releng/fxci-config/blob/43c18aab0826244e369b16a964637b6c411c7760/worker-pools.yml#L3320) |
| 30 | +- Keyed `minCapacity: 1`: [`gecko-1/decision-gcp`, `gecko-3/decision-gcp`](https://github.com/mozilla-releng/fxci-config/blob/43c18aab0826244e369b16a964637b6c411c7760/worker-pools.yml#L976), and [pools matching `(app-services|glean|mozillavpn|mobile|mozilla|translations)-1`](https://github.com/mozilla-releng/fxci-config/blob/43c18aab0826244e369b16a964637b6c411c7760/worker-pools.yml#L1088) |
| 31 | + |
| 32 | +**Current Implementation Issues:** |
| 33 | +- Worker-manager enforces `minCapacity` by spawning new workers when capacity drops below threshold |
| 34 | +- Generic-worker shuts down after `idleTimeoutSecs` regardless of `minCapacity` requirements |
| 35 | +- Gap exists between worker shutdown and replacement detection/provisioning |
| 36 | + |
| 37 | +## Proposed Solution |
| 38 | + |
| 39 | +### Core Concept: `minCapacity` Workers |
| 40 | + |
| 41 | +Workers fulfilling `minCapacity >= 1` requirements should never self-terminate. This is achieved through a two-phase implementation: Phase 1 prevents minCapacity workers from self-terminating, and Phase 2 makes worker-manager the central authority for all worker termination decisions. |
| 42 | + |
| 43 | +### Phase 1: Prevent MinCapacity Worker Self-Termination |
| 44 | + |
| 45 | +#### 1. Automatic Activation |
| 46 | + |
| 47 | +**Trigger:** Automatically enabled when `minCapacity >= 1` (no configuration flag needed) |
| 48 | + |
| 49 | +Workers spawned when `runningCapacity < minCapacity` receive `idleTimeoutSecs=0` and never self-terminate. |
| 50 | + |
| 51 | +#### 2. Worker Config Injection |
| 52 | + |
| 53 | +Worker-manager sets `idleTimeoutSecs=0` when spawning minCapacity workers. This makes [generic-worker never terminate by itself](https://github.com/taskcluster/taskcluster/blob/754938c53ba34aea5a50ce610272e7a275c11911/workers/generic-worker/main.go#L536-L542). |
| 54 | + |
| 55 | +#### 3. Capacity Management |
| 56 | + |
| 57 | +**Removing Excess Capacity:** |
| 58 | + |
| 59 | +When `runningCapacity > minCapacity`, [worker-manager scanner identifies](https://github.com/taskcluster/taskcluster/blob/754938c53ba34aea5a50ce610272e7a275c11911/services/worker-manager/src/worker-scanner.js#L69-L76) and terminates excess workers. |
| 60 | + |
| 61 | +**Termination logic:** |
| 62 | +- Query [Queue API client](https://github.com/taskcluster/taskcluster/blob/754938c53ba34aea5a50ce610272e7a275c11911/services/worker-manager/src/main.js#L176-L178) to check if worker's latest task is running |
| 63 | +- Select oldest idle workers first (by `worker.created` timestamp). |
| 64 | +- Use the existing `removeWorker()` methods to terminate worker ([Google `removeWorker()`](https://github.com/taskcluster/taskcluster/blob/754938c53ba34aea5a50ce610272e7a275c11911/services/worker-manager/src/providers/google.js#L168-L192), [AWS `removeWorker()`](https://github.com/taskcluster/taskcluster/blob/754938c53ba34aea5a50ce610272e7a275c11911/services/worker-manager/src/providers/aws.js#L382-L418), [Azure `removeWorker()`](https://github.com/taskcluster/taskcluster/blob/754938c53ba34aea5a50ce610272e7a275c11911/services/worker-manager/src/providers/azure/index.js#L1221-L1233)) |
| 65 | + |
| 66 | + |
| 67 | +## Error Handling and Edge Cases |
| 68 | + |
| 69 | +**Worker Lifecycle Management:** |
| 70 | +- **Pool reconfiguration**: Capacity changes trigger worker replacement, not reconfiguration |
| 71 | +- **Graceful transitions**: When possible, only terminate idle workers to preserve active caches |
| 72 | +- **Resource allocation**: minCapacity workers mixed with other workers on same infrastructure |
| 73 | + |
| 74 | +**Launch Configuration Changes:** |
| 75 | +When a launch configuration is changed, removed, or archived, all workers created from the old configuration must be terminated and replaced: |
| 76 | +- If a launch configuration is archived (not present in new configuration), identify all long-running workers created from it |
| 77 | +- Terminate these workers via cloud provider APIs after checking for running tasks |
| 78 | +- Worker-manager will spawn new workers using the updated launch configuration |
| 79 | +- This ensures workers always run with current configuration and prevents indefinite use of outdated configurations |
| 80 | + |
| 81 | +## Compatibility Considerations |
| 82 | + |
| 83 | +- Automatic activation when `minCapacity >= 1` (no opt-in flag needed) |
| 84 | +- Existing pools continue current behavior; minCapacity workers automatically stop self-terminating |
| 85 | +- No changes to generic-worker's idle timeout mechanism |
| 86 | + |
| 87 | +### Phase 2: Centralized Termination Authority |
| 88 | + |
| 89 | +**Goal:** Make worker-manager the sole authority for all worker termination decisions by removing worker self-termination entirely. |
| 90 | + |
| 91 | +#### Implementation Changes |
| 92 | + |
| 93 | +**1. Remove Worker Idle Timeout Code** |
| 94 | + |
| 95 | +Remove [idle timeout mechanism](https://github.com/taskcluster/taskcluster/blob/754938c53ba34aea5a50ce610272e7a275c11911/workers/generic-worker/main.go#L536-L542) from generic-worker: |
| 96 | + |
| 97 | +Workers run indefinitely until worker-manager terminates them. This mean, worker-managers stops sending `idleTimeoutSecs` to workers at spawn time. |
| 98 | + |
| 99 | +**2. Centralized Idle Enforcement** |
| 100 | + |
| 101 | +Worker-manager enforces idle timeout using existing [`queueInactivityTimeout` from lifecycle configuration](https://github.com/taskcluster/taskcluster/blob/754938c53ba34aea5a50ce610272e7a275c11911/services/worker-manager/schemas/v1/worker-lifecycle.yml#L33-L50) and through idleTimeout (as specified in phase 1). |
| 102 | + |
| 103 | +[Scanner polls](https://github.com/taskcluster/taskcluster/blob/754938c53ba34aea5a50ce610272e7a275c11911/services/worker-manager/src/worker-scanner.js#L69-L76) Queue API to track worker idle time: |
| 104 | +- Get latest task from `worker.recentTasks` |
| 105 | +- Call `queue.status(taskId)` to check if task is running |
| 106 | +- Calculate idle time from task's `resolved` timestamp |
| 107 | +- Terminate when idle time exceeds `queueInactivityTimeout` |
| 108 | + |
| 109 | +**3. Termination Decision Factors** |
| 110 | + |
| 111 | +Worker-manager terminates workers when: |
| 112 | +- Idle timeout exceeded (`queueInactivityTimeout`) |
| 113 | +- Capacity exceeds `maxCapacity` |
| 114 | +- Capacity exceeds `minCapacity` (terminate oldest first) |
| 115 | +- Launch configuration changed/archived |
| 116 | +- Worker is unhealthy (provider-specific check) |
| 117 | + |
| 118 | +All terminations check for running tasks before proceeding. |
| 119 | + |
| 120 | +**Migration:** |
| 121 | +Deploy as breaking change requiring simultaneous worker-manager and generic-worker updates. |
| 122 | + |
| 123 | +# Implementation |
| 124 | + |
| 125 | +<Once the RFC is decided, these links will provide readers a way to track the |
| 126 | +implementation through to completion, and to know if they are running a new |
| 127 | +enough version to take advantage of this change. It's fine to update this |
| 128 | +section using short PRs or pushing directly to master after the RFC is |
| 129 | +decided> |
| 130 | + |
| 131 | +* <link to tracker bug, issue, etc.> |
| 132 | +* <...> |
| 133 | +* Implemented in Taskcluster version ... |
0 commit comments