Skip to content

Commit 21f1df3

Browse files
committed
RFC#0192 - ensures workers do not get unnecessarily killed
1 parent 802091b commit 21f1df3

File tree

3 files changed

+60
-0
lines changed

3 files changed

+60
-0
lines changed

README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -69,3 +69,4 @@ See [mechanics](mechanics.md) for more detail.
6969
| RFC#182 | [Allow remote references to .taskcluster.yml files processed by Taskcluster-GitHub](rfcs/0182-taskcluster-yml-remote-references.md) |
7070
| RFC#189 | [Batch APIs for task definition, status and index path](rfcs/0189-batch-task-apis.md) |
7171
| RFC#191 | [Worker Manager launch configurations](rfcs/0191-worker-manager-launch-configs.md) |
72+
| RFC#192 | [`minCapacity` ensures workers do not get unnecessarily killed](rfcs/0192-min-capacity-ensures-workers-do-not-get-unnecessarily-killed.md) |
Lines changed: 58 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,58 @@
1+
# RFC 192 - `minCapacity` ensures workers do not get unnecessarily killed
2+
* Comments: [#192](https://github.com/taskcluster/taskcluster-rfcs/pull/192)
3+
* Proposed by: @JohanLorenzo
4+
5+
# Summary
6+
7+
`worker-manager` allows `minCapacity` to be set, ensuring a certain number of workers
8+
are available at any given time. Unlike what current happens now, these workers
9+
shouldn't be killed unless `minCapacity` is exceeded.
10+
11+
## Motivation - why now?
12+
13+
As far as I can remember, the current behavior has always existed. This year, the
14+
Engineering Effectiveness org is optimizing the cost of the Firefox CI instance.
15+
[Bug 1899511](https://bugzilla.mozilla.org/show_bug.cgi?id=1899511) made a change that
16+
actually uncovered the problem with the current behavior: workers gets killed after 2
17+
minuted and a new one gets spawned.
18+
19+
20+
# Details
21+
22+
In the current implementation, workers are in charge of knowning when they have to shut
23+
down. Given the fact `docker-worker` is officially not supported anymore and we can't
24+
cut a new release and use it, let's change what config `worker-manager` gives to all
25+
workers, `docker-worker` included.
26+
27+
## When `minCapacity` is exceeded
28+
29+
In this case, nothing should change. `worker-manager` sends the same config to workers
30+
as it always did.
31+
32+
## When `minCapacity` is not yet met
33+
34+
Here, `worker-manager` should increase `afterIdleSeconds` to a much higher value (e.g.:
35+
24 hours). This way, workers remain online long enough and we don't kill them too often.
36+
In case one of these long-lived workers get killed by an external factor (say: the
37+
cloud provider reclaims the spot instance), then `minCapacity` won't be met an a new
38+
long-lived one will be created.
39+
40+
### What if we deploy new worker images?
41+
42+
Long-lived workers will have to be killed if there's a change in their config, including
43+
their image.
44+
45+
### What if short-lived workers are taken into account in `minCapacity`?
46+
47+
When this happens, the short-lived worker will eventually get killed, making the number
48+
of workers below `minCapacity`. Then, `worker-manager` will spawn a new long-lived one.
49+
50+
## How to ensure these behaviors are correctly implemented?
51+
52+
We should leverage telemetry to know how long workers live and what config they got
53+
from `worker-manager`. This will help us find any gaps in this plan.
54+
55+
56+
# Implementation
57+
58+
TODO

rfcs/README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -57,3 +57,4 @@
5757
| RFC#182 | [Allow remote references to .taskcluster.yml files processed by Taskcluster-GitHub](0182-taskcluster-yml-remote-references.md) |
5858
| RFC#189 | [Batch APIs for task definition, status and index path](0189-batch-task-apis.md) |
5959
| RFC#191 | [Worker Manager launch configurations](0191-worker-manager-launch-configs.md) |
60+
| RFC#192 | [`minCapacity` ensures workers do not get unnecessarily killed](0192-min-capacity-ensures-workers-do-not-get-unnecessarily-killed.md) |

0 commit comments

Comments
 (0)