You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
- A PAT with the scope outlined in [this document](https://github.com/actions-runner-controller/actions-runner-controller#deploying-using-pat-authentication).
198
199
Save this to the value specified by `ssm_github_token_path` using the following command, adjusting the
199
-
AWS_PROFILE to refer to the `admin` role in the account to which you are deploying the runner controller:
200
+
AWS\_PROFILE to refer to the `admin` role in the account to which you are deploying the runner controller:
To use the Webhook Driven autoscaling, you must also install the GitHub organization-level webhook after deploying the component
220
-
(specifically, the webhook server). The URL for the webhook is determined by the `webhook.hostname_template` and where
220
+
GitHub supports grouping runners into distinct [Runner Groups](https://docs.github.com/en/actions/hosting-your-own-runners/managing-access-to-self-hosted-runners-using-groups), which allow you to have different access controls
221
+
for different runners. Read the linked documentation about creating and configuring Runner Groups, which you must do
222
+
through the GitHub Web UI. If you choose to create Runner Groups, you can assign one or more Runner pools (from the
223
+
`runners` map) to groups (only one group per runner pool) by including `group: <Runner Group Name>` in the runner
224
+
configuration. We recommend including it immediately after `scope`.
225
+
226
+
### Using Webhook Driven Autoscaling (recommended)
227
+
228
+
We recommend using Webhook Driven Autoscaling until GitHub releases their own autoscaling solution (said to be "in the works" as of April 2023).
229
+
230
+
To use the Webhook Driven Autoscaling, in addition to setting `webhook_driven_scaling_enabled` to `true`, you must
231
+
also install the GitHub organization-level webhook after deploying the component (specifically, the webhook server).
232
+
The URL for the webhook is determined by the `webhook.hostname_template` and where
221
233
it is deployed. Recommended URL is `https://gha-webhook.[environment].[stage].[tenant].[service-discovery-domain]`.
222
234
223
235
As a GitHub organization admin, go to `https://github.com/organizations/[organization]/settings/hooks`, and then:
@@ -236,6 +248,68 @@ After the webhook is created, select "edit" for the webhook and go to the "Recen
236
248
(of a "ping" event) with a green check mark. If not, verify all the settings and consult
237
249
the logs of the `actions-runner-controller-github-webhook-server` pod.
238
250
251
+
### Configuring Webhook Driven Autoscaling
252
+
253
+
The `HorizontalRunnerAutoscaler scaleUpTriggers.duration` (see [Webhook Driven Scaling documentation](https://github. com/actions/actions-runner-controller/blob/master/docs/automatically-scaling-runners.md#webhook-driven-scaling)) is
254
+
controlled by the `webhook_startup_timeout` setting for each Runner. The purpose of this timeout is to ensure, in
255
+
case a job cancellation or termination event gets missed, that the resulting idle runner eventually gets terminated.
256
+
257
+
#### How the Autoscaler Determines the Desired Runner Pool Size
258
+
259
+
When a job is queued, a `capacityReservation` is created for it. The HRA (Horizontal Runner Autoscaler) sums up all
260
+
the capacity reservations to calculate the desired size of the runner pool, subject to the limits of `minReplicas`
261
+
and `maxReplicas`. The idea is that a `capacityReservation` is deleted when a job is completed or canceled, and the
262
+
pool size will be equal to `jobsStarted - jobsFinished`. However, it can happen that a job will finish without the
263
+
HRA being successfully notified about it, so as a safety measure, the `capacityReservation` will expire after a
264
+
configurable amount of time, at which point it will be deleted without regard to the job being finished. This
265
+
ensures that eventually an idle runner pool will scale down to `minReplicas`.
266
+
267
+
However, there are some problems with this scheme. In theory, `webhook_startup_timeout` should only need to be long
268
+
enough to cover the delay between the time the HRA starts a scale up request and the time the runner actually starts,
269
+
is allocated to the runner pool, and picks up a job to run. But there are edge cases that seem not to be covered
270
+
properly (see [actions-runner-controller issue #2466](https://github.com/actions/actions-runner-controller/issues/2466)). As a result, we recommend setting `webhook_startup_timeout` to
271
+
a period long enough to cover the full time a job may have to wait between the time it is queued and the time it
272
+
actually starts. Consider this scenario:
273
+
- You set `maxReplicas = 5`
274
+
- Some trigger starts 20 jobs, each of which take 5 minutes to run
275
+
- The replica pool scales up to 5, and the first 5 jobs run
276
+
- 5 minutes later, the next 5 jobs run, and so on
277
+
- The last set of 5 jobs will have to wait 15 minutes to start because of the previous jobs
278
+
279
+
The HRA is designed to handle this situation by updating the expiration time of the `capacityReservation` of any
280
+
job stuck waiting because the pool has scaled up to `maxReplicas`, but as discussed in issue #2466 linked above,
281
+
that does not seem to be working correctly as of version 0.27.2.
282
+
283
+
For now, our recommendation is to set `webhook_startup_timeout` to a duration long enough to cover the time the job
284
+
may have to wait in the queue for a runner to become available due to there being more jobs than `maxReplicas`.
285
+
Alternatively, you could set `maxReplicas` to a big enough number that there will always be a runner for every
286
+
queued job, in which case the duration only needs to be long enough to allow for all the scale-up activities (such
287
+
as launching new EKS nodes as well as starting new pods) to finish. Remember, when everything works properly, the
288
+
HRA will scale down the pool as jobs finish, so there is little cost to setting a long duration.
| <aname="input_regex_replace_chars"></a> [regex\_replace\_chars](#input\_regex\_replace\_chars)| Terraform regular expression (regex) string.<br>Characters matching the regex will be removed from the ID elements.<br>If not set, `"/[^a-zA-Z0-9-]/"` is used to remove all characters other than hyphens, letters and digits. |`string`|`null`| no |
| <aname="input_resources"></a> [resources](#input\_resources)| The cpu and memory of the deployment's limits and requests. | <pre>object({<br> limits = object({<br> cpu = string<br> memory = string<br> })<br> requests = object({<br> cpu = string<br> memory = string<br> })<br> })</pre> | n/a | yes |
340
-
| <a name="input_runners"></a> [runners](#input\_runners) | Map of Action Runner configurations, with the key being the name of the runner. Please note that the name must be in<br>kebab-case.<br><br>For example:<pre>hcl<br>organization_runner = {<br> type = "organization" # can be either 'organization' or 'repository'<br> dind_enabled: false # A Docker sidecar container will be deployed<br> image: summerwind/actions-runner # If dind_enabled=true, set this to 'summerwind/actions-runner-dind'<br> scope = "ACME" # org name for Organization runners, repo name for Repository runners<br> scale_down_delay_seconds = 300<br> min_replicas = 1<br> max_replicas = 5<br> busy_metrics = {<br> scale_up_threshold = 0.75<br> scale_down_threshold = 0.25<br> scale_up_factor = 2<br> scale_down_factor = 0.5<br> }<br> labels = [<br> "Ubuntu",<br> "core-automation",<br> ]<br>}</pre> | <pre>map(object({<br> type = string<br> scope = string<br> image = optional(string, "")<br> dind_enabled = bool<br> node_selector = optional(map(string), {})<br> tolerations = optional(list(object({<br> key = string<br> operator = string<br> value = optional(string, null)<br> effect = string<br> })), [])<br> scale_down_delay_seconds = number<br> min_replicas = number<br> max_replicas = number<br> busy_metrics = optional(object({<br> scale_up_threshold = string<br> scale_down_threshold = string<br> scale_up_adjustment = optional(string)<br> scale_down_adjustment = optional(string)<br> scale_up_factor = optional(string)<br> scale_down_factor = optional(string)<br> }))<br> webhook_driven_scaling_enabled = bool<br> webhook_startup_timeout = optional(string, null)<br> pull_driven_scaling_enabled = bool<br> labels = list(string)<br> storage = optional(string, null)<br> pvc_enabled = optional(string, false)<br> resources = object({<br> limits = object({<br> cpu = string<br> memory = string<br> ephemeral_storage = optional(string, null)<br> })<br> requests = object({<br> cpu = string<br> memory = string<br> })<br> })<br> }))</pre> | n/a | yes |
414
+
| <a name="input_runners"></a> [runners](#input\_runners) | Map of Action Runner configurations, with the key being the name of the runner. Please note that the name must be in<br>kebab-case.<br><br>For example:<pre>hcl<br>organization_runner = {<br> type = "organization" # can be either 'organization' or 'repository'<br> dind_enabled: false # A Docker sidecar container will be deployed<br> scope = "ACME" # org name for Organization runners, repo name for Repository runners<br> group = "core-automation" # Optional. Assigns the runners to a runner group, for access control.<br> image: summerwind/actions-runner # If dind_enabled=true, set this to 'summerwind/actions-runner-dind'<br> node_selector = {} # optional Kubernetes node selector map for the runner pods<br> tolerations = [] # optional Kubernetes tolerations list for the runner pods<br> scale_down_delay_seconds = 300<br> min_replicas = 1<br> max_replicas = 5<br> webhook_driven_scaling_enabled = bool # Recommended to be true to enable event-based scaling of runner pool<br> webhook_startup_timeout = optional(string, null) # Duration after which capacity for a queued job will be discarded<br><br> labels = [<br> "Ubuntu",<br> "core-automation",<br> ]<br>}</pre> | <pre>map(object({<br> type = string<br> scope = string<br> group = optional(string, null)<br> image = optional(string, "")<br> dind_enabled = bool<br> node_selector = optional(map(string), {})<br> tolerations = optional(list(object({<br> key = string<br> operator = string<br> value = optional(string, null)<br> effect = string<br> })), [])<br> scale_down_delay_seconds = number<br> min_replicas = number<br> max_replicas = number<br> busy_metrics = optional(object({<br> scale_up_threshold = string<br> scale_down_threshold = string<br> scale_up_adjustment = optional(string)<br> scale_down_adjustment = optional(string)<br> scale_up_factor = optional(string)<br> scale_down_factor = optional(string)<br> }))<br> webhook_driven_scaling_enabled = bool<br> webhook_startup_timeout = optional(string, null)<br> pull_driven_scaling_enabled = bool<br> labels = list(string)<br> storage = optional(string, null)<br> pvc_enabled = optional(string, false)<br> resources = object({<br> limits = object({<br> cpu = string<br> memory = string<br> ephemeral_storage = optional(string, null)<br> })<br> requests = object({<br> cpu = string<br> memory = string<br> })<br> })<br> }))</pre> | n/a | yes |
341
415
| <aname="input_s3_bucket_arns"></a> [s3\_bucket\_arns](#input\_s3\_bucket\_arns)| List of ARNs of S3 Buckets to which the runners will have read-write access to. |`list(string)`|`[]`| no |
342
416
| <aname="input_ssm_github_secret_path"></a> [ssm\_github\_secret\_path](#input\_ssm\_github\_secret\_path)| The path in SSM to the GitHub app private key file contents or GitHub PAT token. |`string`|`""`| no |
343
417
| <aname="input_ssm_github_webhook_secret_token_path"></a> [ssm\_github\_webhook\_secret\_token\_path](#input\_ssm\_github\_webhook\_secret\_token\_path)| The path in SSM to the GitHub Webhook Secret token. |`string`|`""`| no |
| <aname="input_tenant"></a> [tenant](#input\_tenant)| ID element \_(Rarely used, not included by default)\_. A customer identifier, indicating who this instance of a resource is for |`string`|`null`| no |
347
421
| <aname="input_timeout"></a> [timeout](#input\_timeout)| Time in seconds to wait for any individual kubernetes operation (like Jobs for hooks). Defaults to `300` seconds |`number`|`null`| no |
348
422
| <aname="input_wait"></a> [wait](#input\_wait)| Will wait until all resources are in a ready state before marking the release as successful. It will wait for as long as `timeout`. Defaults to `true`. |`bool`|`null`| no |
349
-
| <aname="input_webhook"></a> [webhook](#input\_webhook)| Configuration for the GitHub Webhook Server.<br>`hostname_template` is the `format()` string to use to generate the hostname via `format(var.hostname_template, var.tenant, var.stage, var.environment)`"<br>Typically something like `"echo.%[3]v.%[2]v.example.com"`.| <pre>object({<br> enabled = bool<br> hostname_template = string<br> })</pre> | <pre>{<br> "enabled": false,<br> "hostname_template": null<br>}</pre> | no |
423
+
| <aname="input_webhook"></a> [webhook](#input\_webhook)| Configuration for the GitHub Webhook Server.<br>`hostname_template` is the `format()` string to use to generate the hostname via `format(var.hostname_template, var.tenant, var.stage, var.environment)`"<br>Typically something like `"echo.%[3]v.%[2]v.example.com"`.<br>`queue_limit` is the maximum number of webhook events that can be queued up processing by the autoscaler.<br>When the queue gets full, webhook events will be dropped (status 500). | <pre>object({<br> enabled = bool<br> hostname_template = string<br> queue_limit = optional(number, 100)<br> })</pre> | <pre>{<br> "enabled": false,<br> "hostname_template": null,<br> "queue_limit": 100<br>}</pre> | no |
0 commit comments