|
| 1 | +# RFC 192 - `minCapacity` ensures workers do not get unnecessarily killed |
| 2 | +* Comments: [#192](https://github.com/taskcluster/taskcluster-rfcs/pull/192) |
| 3 | +* Proposed by: @JohanLorenzo |
| 4 | + |
| 5 | +# Summary |
| 6 | + |
| 7 | +`worker-manager` allows `minCapacity` to be set, ensuring a certain number of workers |
| 8 | +are available at any given time. Unlike what current happens now, these workers |
| 9 | +shouldn't be killed unless `minCapacity` is exceeded. |
| 10 | + |
| 11 | +## Motivation - why now? |
| 12 | + |
| 13 | +As far as I can remember, the current behavior has always existed. This year, the |
| 14 | +Engineering Effectiveness org is optimizing the cost of the Firefox CI instance. |
| 15 | +[Bug 1899511](https://bugzilla.mozilla.org/show_bug.cgi?id=1899511) made a change that |
| 16 | +actually uncovered the problem with the current behavior: workers gets killed after 2 |
| 17 | +minuted and a new one gets spawned. |
| 18 | + |
| 19 | + |
| 20 | +# Details |
| 21 | + |
| 22 | +In the current implementation, workers are in charge of knowning when they have to shut |
| 23 | +down. Given the fact `docker-worker` is officially not supported anymore and we can't |
| 24 | +cut a new release and use it, let's change what config `worker-manager` gives to all |
| 25 | +workers, `docker-worker` included. |
| 26 | + |
| 27 | +## When `minCapacity` is exceeded |
| 28 | + |
| 29 | +In this case, nothing should change. `worker-manager` sends the same config to workers |
| 30 | +as it always did. |
| 31 | + |
| 32 | +## When `minCapacity` is not yet met |
| 33 | + |
| 34 | +Here, `worker-manager` should increase `afterIdleSeconds` to a much higher value (e.g.: |
| 35 | +24 hours). This way, workers remain online long enough and we don't kill them too often. |
| 36 | +In case one of these long-lived workers get killed by an external factor (say: the |
| 37 | +cloud provider reclaims the spot instance), then `minCapacity` won't be met an a new |
| 38 | +long-lived one will be created. |
| 39 | + |
| 40 | +### What if we deploy new worker images? |
| 41 | + |
| 42 | +Long-lived workers will have to be killed if there's a change in their config, including |
| 43 | +their image. |
| 44 | + |
| 45 | +### What if short-lived workers are taken into account in `minCapacity`? |
| 46 | + |
| 47 | +When this happens, the short-lived worker will eventually get killed, making the number |
| 48 | +of workers below `minCapacity`. Then, `worker-manager` will spawn a new long-lived one. |
| 49 | + |
| 50 | +## How to ensure these behaviors are correctly implemented? |
| 51 | + |
| 52 | +We should leverage telemetry to know how long workers live and what config they got |
| 53 | +from `worker-manager`. This will help us find any gaps in this plan. |
| 54 | + |
| 55 | + |
| 56 | +# Implementation |
| 57 | + |
| 58 | +TODO |
0 commit comments