Skip to content

Commit 314c1f0

Browse files
authored
[eks/actions-runner-controller]: support Runner Group, webhook queue size (cloudposse/terraform-aws-components#621)
1 parent 3b4eb92 commit 314c1f0

File tree

6 files changed

+118
-22
lines changed

6 files changed

+118
-22
lines changed

src/README.md

Lines changed: 82 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -74,7 +74,7 @@ components:
7474
cpu: 100m
7575
memory: 128Mi
7676
webhook_driven_scaling_enabled: true
77-
webhook_startup_timeout: "2m"
77+
webhook_startup_timeout: "30m"
7878
pull_driven_scaling_enabled: false
7979
# Labels are not case-sensitive to GitHub, but *are* case-sensitive
8080
# to the webhook based autoscaler, which requires exact matches
@@ -111,6 +111,7 @@ components:
111111
# image: summerwind/actions-runner-dind
112112
# # `scope` is org name for Organization runners, repo name for Repository runners
113113
# scope: "org/infra"
114+
# group: "ArmRunners"
114115
# min_replicas: 1
115116
# max_replicas: 20
116117
# scale_down_delay_seconds: 100
@@ -122,7 +123,7 @@ components:
122123
# cpu: 100m
123124
# memory: 128Mi
124125
# webhook_driven_scaling_enabled: true
125-
# webhook_startup_timeout: "2m"
126+
# webhook_startup_timeout: "30m"
126127
# pull_driven_scaling_enabled: false
127128
# # Labels are not case-sensitive to GitHub, but *are* case-sensitive
128129
# # to the webhook based autoscaler, which requires exact matches
@@ -196,7 +197,7 @@ github_app_installation_id: "12345"
196197
OR (obsolete)
197198
- A PAT with the scope outlined in [this document](https://github.com/actions-runner-controller/actions-runner-controller#deploying-using-pat-authentication).
198199
Save this to the value specified by `ssm_github_token_path` using the following command, adjusting the
199-
AWS_PROFILE to refer to the `admin` role in the account to which you are deploying the runner controller:
200+
AWS\_PROFILE to refer to the `admin` role in the account to which you are deploying the runner controller:
200201

201202
```
202203
AWS_PROFILE=acme-mgmt-use2-auto-admin chamber write github_runners controller_github_app_secret -- "<PAT>"
@@ -214,10 +215,21 @@ Store this key in AWS SSM under the same path specified by `ssm_github_webhook_s
214215
ssm_github_webhook_secret_token_path: "/github_runners/github_webhook_secret"
215216
```
216217

217-
### Using Webhook Driven Autoscaling
218+
### Using Runner Groups
218219

219-
To use the Webhook Driven autoscaling, you must also install the GitHub organization-level webhook after deploying the component
220-
(specifically, the webhook server). The URL for the webhook is determined by the `webhook.hostname_template` and where
220+
GitHub supports grouping runners into distinct [Runner Groups](https://docs.github.com/en/actions/hosting-your-own-runners/managing-access-to-self-hosted-runners-using-groups), which allow you to have different access controls
221+
for different runners. Read the linked documentation about creating and configuring Runner Groups, which you must do
222+
through the GitHub Web UI. If you choose to create Runner Groups, you can assign one or more Runner pools (from the
223+
`runners` map) to groups (only one group per runner pool) by including `group: <Runner Group Name>` in the runner
224+
configuration. We recommend including it immediately after `scope`.
225+
226+
### Using Webhook Driven Autoscaling (recommended)
227+
228+
We recommend using Webhook Driven Autoscaling until GitHub releases their own autoscaling solution (said to be "in the works" as of April 2023).
229+
230+
To use the Webhook Driven Autoscaling, in addition to setting `webhook_driven_scaling_enabled` to `true`, you must
231+
also install the GitHub organization-level webhook after deploying the component (specifically, the webhook server).
232+
The URL for the webhook is determined by the `webhook.hostname_template` and where
221233
it is deployed. Recommended URL is `https://gha-webhook.[environment].[stage].[tenant].[service-discovery-domain]`.
222234

223235
As a GitHub organization admin, go to `https://github.com/organizations/[organization]/settings/hooks`, and then:
@@ -236,6 +248,68 @@ After the webhook is created, select "edit" for the webhook and go to the "Recen
236248
(of a "ping" event) with a green check mark. If not, verify all the settings and consult
237249
the logs of the `actions-runner-controller-github-webhook-server` pod.
238250

251+
### Configuring Webhook Driven Autoscaling
252+
253+
The `HorizontalRunnerAutoscaler scaleUpTriggers.duration` (see [Webhook Driven Scaling documentation](https://github. com/actions/actions-runner-controller/blob/master/docs/automatically-scaling-runners.md#webhook-driven-scaling)) is
254+
controlled by the `webhook_startup_timeout` setting for each Runner. The purpose of this timeout is to ensure, in
255+
case a job cancellation or termination event gets missed, that the resulting idle runner eventually gets terminated.
256+
257+
#### How the Autoscaler Determines the Desired Runner Pool Size
258+
259+
When a job is queued, a `capacityReservation` is created for it. The HRA (Horizontal Runner Autoscaler) sums up all
260+
the capacity reservations to calculate the desired size of the runner pool, subject to the limits of `minReplicas`
261+
and `maxReplicas`. The idea is that a `capacityReservation` is deleted when a job is completed or canceled, and the
262+
pool size will be equal to `jobsStarted - jobsFinished`. However, it can happen that a job will finish without the
263+
HRA being successfully notified about it, so as a safety measure, the `capacityReservation` will expire after a
264+
configurable amount of time, at which point it will be deleted without regard to the job being finished. This
265+
ensures that eventually an idle runner pool will scale down to `minReplicas`.
266+
267+
However, there are some problems with this scheme. In theory, `webhook_startup_timeout` should only need to be long
268+
enough to cover the delay between the time the HRA starts a scale up request and the time the runner actually starts,
269+
is allocated to the runner pool, and picks up a job to run. But there are edge cases that seem not to be covered
270+
properly (see [actions-runner-controller issue #2466](https://github.com/actions/actions-runner-controller/issues/2466)). As a result, we recommend setting `webhook_startup_timeout` to
271+
a period long enough to cover the full time a job may have to wait between the time it is queued and the time it
272+
actually starts. Consider this scenario:
273+
- You set `maxReplicas = 5`
274+
- Some trigger starts 20 jobs, each of which take 5 minutes to run
275+
- The replica pool scales up to 5, and the first 5 jobs run
276+
- 5 minutes later, the next 5 jobs run, and so on
277+
- The last set of 5 jobs will have to wait 15 minutes to start because of the previous jobs
278+
279+
The HRA is designed to handle this situation by updating the expiration time of the `capacityReservation` of any
280+
job stuck waiting because the pool has scaled up to `maxReplicas`, but as discussed in issue #2466 linked above,
281+
that does not seem to be working correctly as of version 0.27.2.
282+
283+
For now, our recommendation is to set `webhook_startup_timeout` to a duration long enough to cover the time the job
284+
may have to wait in the queue for a runner to become available due to there being more jobs than `maxReplicas`.
285+
Alternatively, you could set `maxReplicas` to a big enough number that there will always be a runner for every
286+
queued job, in which case the duration only needs to be long enough to allow for all the scale-up activities (such
287+
as launching new EKS nodes as well as starting new pods) to finish. Remember, when everything works properly, the
288+
HRA will scale down the pool as jobs finish, so there is little cost to setting a long duration.
289+
290+
### Recommended `webhook_startup_timeout` Duration
291+
292+
#### Consequences of Too Short of a `webhook_startup_timeout` Duration
293+
294+
If you set `webhook_startup_timeout` to too short a duration, the Horizontal Runner Autoscaler will cancel capacity
295+
reservations for jobs that have not yet run, and the pool will be too small. This will be most serious if you have
296+
set `minReplicas = 0` because in this case, jobs will be left in the queue indefinitely. With a higher value of
297+
`minReplicas`, the pool will eventually make it through all the queued jobs, but not as quickly as intended due to
298+
the incorrectly reduced capacity.
299+
300+
#### Consequences of Too Long of a `webhook_startup_timeout` Duration
301+
302+
If the Horizontal Runner Autoscaler misses a scale-down event (which can happen because events do not have delivery
303+
guarantees), a runner may be left running idly for as long as the `webhook_startup_timeout` duration. The only
304+
problem with this is the added expense of leaving the idle runner running.
305+
306+
#### Recommendation
307+
308+
Therefore we recommend that for lightly used runner pools, set `webhook_startup_timeout` to `"30m"`. For heavily
309+
used pools, find the typical or maximum length of a job, multiply by the number of jobs likely to be queued in an
310+
hour, and divide by `maxReplicas`, then round up. As a rule of thumb, we recommend setting `maxReplicas` high enough
311+
that jobs never wait on the queue more than an hour and setting `webhook_startup_timeout` to `"2h30m"`. Monitor your
312+
usage and adjust accordingly.
239313

240314
### Updating CRDs
241315

@@ -337,7 +411,7 @@ Consult [actions-runner-controller](https://github.com/actions-runner-controller
337411
| <a name="input_regex_replace_chars"></a> [regex\_replace\_chars](#input\_regex\_replace\_chars) | Terraform regular expression (regex) string.<br>Characters matching the regex will be removed from the ID elements.<br>If not set, `"/[^a-zA-Z0-9-]/"` is used to remove all characters other than hyphens, letters and digits. | `string` | `null` | no |
338412
| <a name="input_region"></a> [region](#input\_region) | AWS Region. | `string` | n/a | yes |
339413
| <a name="input_resources"></a> [resources](#input\_resources) | The cpu and memory of the deployment's limits and requests. | <pre>object({<br> limits = object({<br> cpu = string<br> memory = string<br> })<br> requests = object({<br> cpu = string<br> memory = string<br> })<br> })</pre> | n/a | yes |
340-
| <a name="input_runners"></a> [runners](#input\_runners) | Map of Action Runner configurations, with the key being the name of the runner. Please note that the name must be in<br>kebab-case.<br><br>For example:<pre>hcl<br>organization_runner = {<br> type = "organization" # can be either 'organization' or 'repository'<br> dind_enabled: false # A Docker sidecar container will be deployed<br> image: summerwind/actions-runner # If dind_enabled=true, set this to 'summerwind/actions-runner-dind'<br> scope = "ACME" # org name for Organization runners, repo name for Repository runners<br> scale_down_delay_seconds = 300<br> min_replicas = 1<br> max_replicas = 5<br> busy_metrics = {<br> scale_up_threshold = 0.75<br> scale_down_threshold = 0.25<br> scale_up_factor = 2<br> scale_down_factor = 0.5<br> }<br> labels = [<br> "Ubuntu",<br> "core-automation",<br> ]<br>}</pre> | <pre>map(object({<br> type = string<br> scope = string<br> image = optional(string, "")<br> dind_enabled = bool<br> node_selector = optional(map(string), {})<br> tolerations = optional(list(object({<br> key = string<br> operator = string<br> value = optional(string, null)<br> effect = string<br> })), [])<br> scale_down_delay_seconds = number<br> min_replicas = number<br> max_replicas = number<br> busy_metrics = optional(object({<br> scale_up_threshold = string<br> scale_down_threshold = string<br> scale_up_adjustment = optional(string)<br> scale_down_adjustment = optional(string)<br> scale_up_factor = optional(string)<br> scale_down_factor = optional(string)<br> }))<br> webhook_driven_scaling_enabled = bool<br> webhook_startup_timeout = optional(string, null)<br> pull_driven_scaling_enabled = bool<br> labels = list(string)<br> storage = optional(string, null)<br> pvc_enabled = optional(string, false)<br> resources = object({<br> limits = object({<br> cpu = string<br> memory = string<br> ephemeral_storage = optional(string, null)<br> })<br> requests = object({<br> cpu = string<br> memory = string<br> })<br> })<br> }))</pre> | n/a | yes |
414+
| <a name="input_runners"></a> [runners](#input\_runners) | Map of Action Runner configurations, with the key being the name of the runner. Please note that the name must be in<br>kebab-case.<br><br>For example:<pre>hcl<br>organization_runner = {<br> type = "organization" # can be either 'organization' or 'repository'<br> dind_enabled: false # A Docker sidecar container will be deployed<br> scope = "ACME" # org name for Organization runners, repo name for Repository runners<br> group = "core-automation" # Optional. Assigns the runners to a runner group, for access control.<br> image: summerwind/actions-runner # If dind_enabled=true, set this to 'summerwind/actions-runner-dind'<br> node_selector = {} # optional Kubernetes node selector map for the runner pods<br> tolerations = [] # optional Kubernetes tolerations list for the runner pods<br> scale_down_delay_seconds = 300<br> min_replicas = 1<br> max_replicas = 5<br> webhook_driven_scaling_enabled = bool # Recommended to be true to enable event-based scaling of runner pool<br> webhook_startup_timeout = optional(string, null) # Duration after which capacity for a queued job will be discarded<br><br> labels = [<br> "Ubuntu",<br> "core-automation",<br> ]<br>}</pre> | <pre>map(object({<br> type = string<br> scope = string<br> group = optional(string, null)<br> image = optional(string, "")<br> dind_enabled = bool<br> node_selector = optional(map(string), {})<br> tolerations = optional(list(object({<br> key = string<br> operator = string<br> value = optional(string, null)<br> effect = string<br> })), [])<br> scale_down_delay_seconds = number<br> min_replicas = number<br> max_replicas = number<br> busy_metrics = optional(object({<br> scale_up_threshold = string<br> scale_down_threshold = string<br> scale_up_adjustment = optional(string)<br> scale_down_adjustment = optional(string)<br> scale_up_factor = optional(string)<br> scale_down_factor = optional(string)<br> }))<br> webhook_driven_scaling_enabled = bool<br> webhook_startup_timeout = optional(string, null)<br> pull_driven_scaling_enabled = bool<br> labels = list(string)<br> storage = optional(string, null)<br> pvc_enabled = optional(string, false)<br> resources = object({<br> limits = object({<br> cpu = string<br> memory = string<br> ephemeral_storage = optional(string, null)<br> })<br> requests = object({<br> cpu = string<br> memory = string<br> })<br> })<br> }))</pre> | n/a | yes |
341415
| <a name="input_s3_bucket_arns"></a> [s3\_bucket\_arns](#input\_s3\_bucket\_arns) | List of ARNs of S3 Buckets to which the runners will have read-write access to. | `list(string)` | `[]` | no |
342416
| <a name="input_ssm_github_secret_path"></a> [ssm\_github\_secret\_path](#input\_ssm\_github\_secret\_path) | The path in SSM to the GitHub app private key file contents or GitHub PAT token. | `string` | `""` | no |
343417
| <a name="input_ssm_github_webhook_secret_token_path"></a> [ssm\_github\_webhook\_secret\_token\_path](#input\_ssm\_github\_webhook\_secret\_token\_path) | The path in SSM to the GitHub Webhook Secret token. | `string` | `""` | no |
@@ -346,7 +420,7 @@ Consult [actions-runner-controller](https://github.com/actions-runner-controller
346420
| <a name="input_tenant"></a> [tenant](#input\_tenant) | ID element \_(Rarely used, not included by default)\_. A customer identifier, indicating who this instance of a resource is for | `string` | `null` | no |
347421
| <a name="input_timeout"></a> [timeout](#input\_timeout) | Time in seconds to wait for any individual kubernetes operation (like Jobs for hooks). Defaults to `300` seconds | `number` | `null` | no |
348422
| <a name="input_wait"></a> [wait](#input\_wait) | Will wait until all resources are in a ready state before marking the release as successful. It will wait for as long as `timeout`. Defaults to `true`. | `bool` | `null` | no |
349-
| <a name="input_webhook"></a> [webhook](#input\_webhook) | Configuration for the GitHub Webhook Server.<br>`hostname_template` is the `format()` string to use to generate the hostname via `format(var.hostname_template, var.tenant, var.stage, var.environment)`"<br>Typically something like `"echo.%[3]v.%[2]v.example.com"`. | <pre>object({<br> enabled = bool<br> hostname_template = string<br> })</pre> | <pre>{<br> "enabled": false,<br> "hostname_template": null<br>}</pre> | no |
423+
| <a name="input_webhook"></a> [webhook](#input\_webhook) | Configuration for the GitHub Webhook Server.<br>`hostname_template` is the `format()` string to use to generate the hostname via `format(var.hostname_template, var.tenant, var.stage, var.environment)`"<br>Typically something like `"echo.%[3]v.%[2]v.example.com"`.<br>`queue_limit` is the maximum number of webhook events that can be queued up processing by the autoscaler.<br>When the queue gets full, webhook events will be dropped (status 500). | <pre>object({<br> enabled = bool<br> hostname_template = string<br> queue_limit = optional(number, 100)<br> })</pre> | <pre>{<br> "enabled": false,<br> "hostname_template": null,<br> "queue_limit": 100<br>}</pre> | no |
350424

351425
## Outputs
352426

src/charts/actions-runner/Chart.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,7 @@ type: application
1515
# This is the chart version. This version number should be incremented each time you make changes
1616
# to the chart and its templates, including the app version.
1717
# Versions are expected to follow Semantic Versioning (https://semver.org/)
18-
version: 0.1.1
18+
version: 0.1.2
1919

2020
# This chart only deploys Resources for actions-runner-controller, so app version does not really apply.
2121
# We use Resource API version instead.

src/charts/actions-runner/templates/runnerdeployment.yaml

Lines changed: 16 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -29,6 +29,17 @@ spec:
2929
# replicas: 1
3030
template:
3131
spec:
32+
# As of 2023-03-31
33+
# Recommended by https://github.com/actions/actions-runner-controller/blob/master/docs/automatically-scaling-runners.md
34+
terminationGracePeriodSeconds: 100
35+
env:
36+
# RUNNER_GRACEFUL_STOP_TIMEOUT is the time the runner will give itself to try to finish
37+
# a job before it gracefully cancels itself in response to a pod termination signal.
38+
# It should be less than the terminationGracePeriodSeconds above so that it has time
39+
# to report its status and deregister itself from the runner pool.
40+
- name: RUNNER_GRACEFUL_STOP_TIMEOUT
41+
value: "90"
42+
3243
# You could reserve nodes for runners by labeling and tainting nodes with
3344
# node-role.kubernetes.io/actions-runner
3445
# and then adding the following to this RunnerDeployment
@@ -43,10 +54,13 @@ spec:
4354

4455
{{ if eq .Values.type "organization" }}
4556
organization: {{ .Values.scope }}
46-
{{ end }}
57+
{{- end }}
4758
{{ if eq .Values.type "repository" }}
4859
repository: {{ .Values.scope }}
49-
{{ end }}
60+
{{- end }}
61+
{{ if index .Values "group" }}
62+
group: {{ .Values.group }}
63+
{{- end }}
5064
# You can use labels to create subsets of runners.
5165
# See https://github.com/summerwind/actions-runner-controller#runner-labels
5266
# and https://docs.github.com/en/free-pro-team@latest/actions/hosting-your-own-runners/using-self-hosted-runners-in-a-workflow

0 commit comments

Comments
 (0)