You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: src/README.md
+30-28Lines changed: 30 additions & 28 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -264,35 +264,29 @@ HRA being successfully notified about it, so as a safety measure, the `capacityR
264
264
configurable amount of time, at which point it will be deleted without regard to the job being finished. This
265
265
ensures that eventually an idle runner pool will scale down to `minReplicas`.
266
266
267
-
However, there are some problems with this scheme. In theory, `webhook_startup_timeout` should only need to be long
268
-
enough to cover the delay between the time the HRA starts a scale up request and the time the runner actually starts,
269
-
is allocated to the runner pool, and picks up a job to run. But there are edge cases that seem not to be covered
270
-
properly (see [actions-runner-controller issue #2466](https://github.com/actions/actions-runner-controller/issues/2466)). As a result, we recommend setting `webhook_startup_timeout` to
271
-
a period long enough to cover the full time a job may have to wait between the time it is queued and the time it
272
-
actually starts. Consider this scenario:
273
-
- You set `maxReplicas = 5`
274
-
- Some trigger starts 20 jobs, each of which take 5 minutes to run
275
-
- The replica pool scales up to 5, and the first 5 jobs run
276
-
- 5 minutes later, the next 5 jobs run, and so on
277
-
- The last set of 5 jobs will have to wait 15 minutes to start because of the previous jobs
278
-
279
-
The HRA is designed to handle this situation by updating the expiration time of the `capacityReservation` of any
280
-
job stuck waiting because the pool has scaled up to `maxReplicas`, but as discussed in issue #2466 linked above,
281
-
that does not seem to be working correctly as of version 0.27.2.
282
-
283
-
For now, our recommendation is to set `webhook_startup_timeout` to a duration long enough to cover the time the job
284
-
may have to wait in the queue for a runner to become available due to there being more jobs than `maxReplicas`.
285
-
Alternatively, you could set `maxReplicas` to a big enough number that there will always be a runner for every
286
-
queued job, in which case the duration only needs to be long enough to allow for all the scale-up activities (such
287
-
as launching new EKS nodes as well as starting new pods) to finish. Remember, when everything works properly, the
288
-
HRA will scale down the pool as jobs finish, so there is little cost to setting a long duration.
267
+
If it happens that the capacity reservation expires before the job is finished, the Horizontal Runner Autoscaler (HRA) will scale down the pool
268
+
by 2 instead of 1: once because the capacity reservation expired, and once because the job finished. This will
269
+
also cause starvation of waiting jobs, because the next in line will have its timeout timer started but will not
270
+
actually start running because no runner is available. And if `minReplicas` is set to zero, the pool will scale down
271
+
to zero before finishing all the jobs, leaving some waiting indefinitely. This is why it is important to set the
272
+
`webhook_startup_timeout` to a time long enough to cover the full time a job may have to wait between the time it is
273
+
queued and the time it finishes, assuming that the HRA scales up the pool by 1 and runs the job on the new runner.
274
+
275
+
:::info
276
+
If there are more jobs queued than there are runners allowed by `maxReplicas`, the timeout timer does not start on the
277
+
capacity reservation until enough reservations ahead of it are removed for it to be considered as representing
278
+
and active job. Although there are some edge cases regarding `webhook_startup_timeout` that seem not to be covered
279
+
properly (see [actions-runner-controller issue #2466](https://github.com/actions/actions-runner-controller/issues/2466)),
280
+
they only merit adding a few extra minutes to the timeout.
0 commit comments