Skip to content

Commit 071d107

Browse files
authored
Convert eks/cluster to aws-teams and aws-sso (cloudposse/terraform-aws-components#645)
1 parent 314c1f0 commit 071d107

File tree

1 file changed

+30
-28
lines changed

1 file changed

+30
-28
lines changed

src/README.md

Lines changed: 30 additions & 28 deletions
Original file line numberDiff line numberDiff line change
@@ -264,35 +264,29 @@ HRA being successfully notified about it, so as a safety measure, the `capacityR
264264
configurable amount of time, at which point it will be deleted without regard to the job being finished. This
265265
ensures that eventually an idle runner pool will scale down to `minReplicas`.
266266

267-
However, there are some problems with this scheme. In theory, `webhook_startup_timeout` should only need to be long
268-
enough to cover the delay between the time the HRA starts a scale up request and the time the runner actually starts,
269-
is allocated to the runner pool, and picks up a job to run. But there are edge cases that seem not to be covered
270-
properly (see [actions-runner-controller issue #2466](https://github.com/actions/actions-runner-controller/issues/2466)). As a result, we recommend setting `webhook_startup_timeout` to
271-
a period long enough to cover the full time a job may have to wait between the time it is queued and the time it
272-
actually starts. Consider this scenario:
273-
- You set `maxReplicas = 5`
274-
- Some trigger starts 20 jobs, each of which take 5 minutes to run
275-
- The replica pool scales up to 5, and the first 5 jobs run
276-
- 5 minutes later, the next 5 jobs run, and so on
277-
- The last set of 5 jobs will have to wait 15 minutes to start because of the previous jobs
278-
279-
The HRA is designed to handle this situation by updating the expiration time of the `capacityReservation` of any
280-
job stuck waiting because the pool has scaled up to `maxReplicas`, but as discussed in issue #2466 linked above,
281-
that does not seem to be working correctly as of version 0.27.2.
282-
283-
For now, our recommendation is to set `webhook_startup_timeout` to a duration long enough to cover the time the job
284-
may have to wait in the queue for a runner to become available due to there being more jobs than `maxReplicas`.
285-
Alternatively, you could set `maxReplicas` to a big enough number that there will always be a runner for every
286-
queued job, in which case the duration only needs to be long enough to allow for all the scale-up activities (such
287-
as launching new EKS nodes as well as starting new pods) to finish. Remember, when everything works properly, the
288-
HRA will scale down the pool as jobs finish, so there is little cost to setting a long duration.
267+
If it happens that the capacity reservation expires before the job is finished, the Horizontal Runner Autoscaler (HRA) will scale down the pool
268+
by 2 instead of 1: once because the capacity reservation expired, and once because the job finished. This will
269+
also cause starvation of waiting jobs, because the next in line will have its timeout timer started but will not
270+
actually start running because no runner is available. And if `minReplicas` is set to zero, the pool will scale down
271+
to zero before finishing all the jobs, leaving some waiting indefinitely. This is why it is important to set the
272+
`webhook_startup_timeout` to a time long enough to cover the full time a job may have to wait between the time it is
273+
queued and the time it finishes, assuming that the HRA scales up the pool by 1 and runs the job on the new runner.
274+
275+
:::info
276+
If there are more jobs queued than there are runners allowed by `maxReplicas`, the timeout timer does not start on the
277+
capacity reservation until enough reservations ahead of it are removed for it to be considered as representing
278+
and active job. Although there are some edge cases regarding `webhook_startup_timeout` that seem not to be covered
279+
properly (see [actions-runner-controller issue #2466](https://github.com/actions/actions-runner-controller/issues/2466)),
280+
they only merit adding a few extra minutes to the timeout.
281+
:::
282+
289283

290284
### Recommended `webhook_startup_timeout` Duration
291285

292286
#### Consequences of Too Short of a `webhook_startup_timeout` Duration
293287

294288
If you set `webhook_startup_timeout` to too short a duration, the Horizontal Runner Autoscaler will cancel capacity
295-
reservations for jobs that have not yet run, and the pool will be too small. This will be most serious if you have
289+
reservations for jobs that have not yet finished, and the pool will become too small. This will be most serious if you have
296290
set `minReplicas = 0` because in this case, jobs will be left in the queue indefinitely. With a higher value of
297291
`minReplicas`, the pool will eventually make it through all the queued jobs, but not as quickly as intended due to
298292
the incorrectly reduced capacity.
@@ -305,11 +299,19 @@ problem with this is the added expense of leaving the idle runner running.
305299

306300
#### Recommendation
307301

308-
Therefore we recommend that for lightly used runner pools, set `webhook_startup_timeout` to `"30m"`. For heavily
309-
used pools, find the typical or maximum length of a job, multiply by the number of jobs likely to be queued in an
310-
hour, and divide by `maxReplicas`, then round up. As a rule of thumb, we recommend setting `maxReplicas` high enough
311-
that jobs never wait on the queue more than an hour and setting `webhook_startup_timeout` to `"2h30m"`. Monitor your
312-
usage and adjust accordingly.
302+
As a result, we recommend setting `webhook_startup_timeout` to a period long enough to cover:
303+
- The time it takes for the HRA to scale up the pool and make a new runner available
304+
- The time it takes for the runner to pick up the job from GitHub
305+
- The time it takes for the job to start running on the new runner
306+
- The maximum time a job might take
307+
308+
Because the consequences of expiring a capacity reservation before the job is finished are so severe, we recommend
309+
setting `webhook_startup_timeout` to a period at least 30 minutes longer than you expect the longest job to take.
310+
Remember, when everything works properly, the HRA will scale down the pool as jobs finish, so there is little cost
311+
to setting a long duration, and the cost looks even smaller by comparison to the cost of having too short a duration.
312+
313+
For lightly used runner pools expecting only short jobs, you can set `webhook_startup_timeout` to `"30m"`.
314+
As a rule of thumb, we recommend setting `maxReplicas` high enough that jobs never wait on the queue more than an hour.
313315

314316
### Updating CRDs
315317

0 commit comments

Comments
 (0)