Skip to content

Commit de0aade

Browse files
authored
ARC enhancement, aws-config bugfix, DNS documentation (cloudposse/terraform-aws-components#655)
1 parent 071d107 commit de0aade

File tree

6 files changed

+90
-27
lines changed

6 files changed

+90
-27
lines changed

src/README.md

Lines changed: 50 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -63,6 +63,12 @@ components:
6363
image: summerwind/actions-runner-dind
6464
# `scope` is org name for Organization runners, repo name for Repository runners
6565
scope: "org/infra"
66+
# We can trade the fast-start behavior of min_replicas > 0 for the better guarantee
67+
# that Karpenter will not terminate the runner while it is running a job.
68+
# # Tell Karpenter not to evict this pod. This is only safe when min_replicas is 0.
69+
# # If we do not set this, Karpenter will feel free to terminate the runner while it is running a job.
70+
# pod_annotations:
71+
# karpenter.sh/do-not-evict: "true"
6672
min_replicas: 1
6773
max_replicas: 20
6874
scale_down_delay_seconds: 100
@@ -112,7 +118,11 @@ components:
112118
# # `scope` is org name for Organization runners, repo name for Repository runners
113119
# scope: "org/infra"
114120
# group: "ArmRunners"
115-
# min_replicas: 1
121+
# # Tell Karpenter not to evict this pod. This is only safe when min_replicas is 0.
122+
# # If we do not set this, Karpenter will feel free to terminate the runner while it is running a job.
123+
# pod_annotations:
124+
# karpenter.sh/do-not-evict: "true"
125+
# min_replicas: 0
116126
# max_replicas: 20
117127
# scale_down_delay_seconds: 100
118128
# resources:
@@ -313,6 +323,44 @@ to setting a long duration, and the cost looks even smaller by comparison to the
313323
For lightly used runner pools expecting only short jobs, you can set `webhook_startup_timeout` to `"30m"`.
314324
As a rule of thumb, we recommend setting `maxReplicas` high enough that jobs never wait on the queue more than an hour.
315325

326+
### Interaction with Karpenter or other EKS autoscaling solutions
327+
328+
Kubernetes cluster autoscaling solutions generally expect that a Pod runs a service that can be terminated on one
329+
Node and restarted on another with only a short duration needed to finish processing any in-flight requests. When
330+
the cluster is resized, the cluster autoscaler will do just that. However, GitHub Action Runner Jobs do not fit this
331+
model. If a Pod is terminated in the middle of a job, the job is lost. The likelihood of this happening is increased
332+
by the fact that the Action Runner Controller Autoscaler is expanding and contracting the size of the Runner Pool on
333+
a regular basis, causing the cluster autoscaler to more frequently want to scale up or scale down the EKS cluster,
334+
and, consequently, to move Pods around.
335+
336+
To handle these kinds of situations, Karpenter respects an annotation on the Pod:
337+
338+
```yaml
339+
spec:
340+
template:
341+
metadata:
342+
annotations:
343+
karpenter.sh/do-not-evict: "true"
344+
```
345+
346+
When you set this annotation on the Pod, Karpenter will not evict it. This means that the Pod will stay on the Node
347+
it is on, and the Node it is on will not be considered for eviction. This is good because it means that the Pod
348+
will not be terminated in the middle of a job. However, it also means that the Node the Pod is on will not be considered
349+
for termination, which means that the Node will not be removed from the cluster, which means that the cluster will
350+
not shrink in size when you would like it to.
351+
352+
Since the Runner Pods terminate at the end of the job, this is not a problem for the Pods actually running jobs.
353+
However, if you have set `minReplicas > 0`, then you have some Pods that are just idling, waiting for jobs to be
354+
assigned to them. These Pods are exactly the kind of Pods you want terminated and moved when the cluster is underutilized.
355+
Therefore, when you set `minReplicas > 0`, you should **NOT** set `karpenter.sh/do-not-evict: "true"` on the Pod.
356+
357+
We have [requested a feature](https://github.com/actions/actions-runner-controller/issues/2562)
358+
that will allow you to set `karpenter.sh/do-not-evict: "true"` and `minReplicas > 0` at the same time by only
359+
annotating Pods running jobs. Meanwhile, another option is to set `minReplicas = 0` on a schedule using an ARC
360+
Autoscaler [scheduled override](https://github.com/actions/actions-runner-controller/blob/master/docs/automatically-scaling-runners.md#scheduled-overrides).
361+
At present, this component does not support that option, but it could be added in the future if our preferred
362+
solution is not implemented.
363+
316364
### Updating CRDs
317365

318366
When updating the chart or application version of `actions-runner-controller`, it is possible you will need to install
@@ -413,7 +461,7 @@ Consult [actions-runner-controller](https://github.com/actions-runner-controller
413461
| <a name="input_regex_replace_chars"></a> [regex\_replace\_chars](#input\_regex\_replace\_chars) | Terraform regular expression (regex) string.<br>Characters matching the regex will be removed from the ID elements.<br>If not set, `"/[^a-zA-Z0-9-]/"` is used to remove all characters other than hyphens, letters and digits. | `string` | `null` | no |
414462
| <a name="input_region"></a> [region](#input\_region) | AWS Region. | `string` | n/a | yes |
415463
| <a name="input_resources"></a> [resources](#input\_resources) | The cpu and memory of the deployment's limits and requests. | <pre>object({<br> limits = object({<br> cpu = string<br> memory = string<br> })<br> requests = object({<br> cpu = string<br> memory = string<br> })<br> })</pre> | n/a | yes |
416-
| <a name="input_runners"></a> [runners](#input\_runners) | Map of Action Runner configurations, with the key being the name of the runner. Please note that the name must be in<br>kebab-case.<br><br>For example:<pre>hcl<br>organization_runner = {<br> type = "organization" # can be either 'organization' or 'repository'<br> dind_enabled: false # A Docker sidecar container will be deployed<br> scope = "ACME" # org name for Organization runners, repo name for Repository runners<br> group = "core-automation" # Optional. Assigns the runners to a runner group, for access control.<br> image: summerwind/actions-runner # If dind_enabled=true, set this to 'summerwind/actions-runner-dind'<br> node_selector = {} # optional Kubernetes node selector map for the runner pods<br> tolerations = [] # optional Kubernetes tolerations list for the runner pods<br> scale_down_delay_seconds = 300<br> min_replicas = 1<br> max_replicas = 5<br> webhook_driven_scaling_enabled = bool # Recommended to be true to enable event-based scaling of runner pool<br> webhook_startup_timeout = optional(string, null) # Duration after which capacity for a queued job will be discarded<br><br> labels = [<br> "Ubuntu",<br> "core-automation",<br> ]<br>}</pre> | <pre>map(object({<br> type = string<br> scope = string<br> group = optional(string, null)<br> image = optional(string, "")<br> dind_enabled = bool<br> node_selector = optional(map(string), {})<br> tolerations = optional(list(object({<br> key = string<br> operator = string<br> value = optional(string, null)<br> effect = string<br> })), [])<br> scale_down_delay_seconds = number<br> min_replicas = number<br> max_replicas = number<br> busy_metrics = optional(object({<br> scale_up_threshold = string<br> scale_down_threshold = string<br> scale_up_adjustment = optional(string)<br> scale_down_adjustment = optional(string)<br> scale_up_factor = optional(string)<br> scale_down_factor = optional(string)<br> }))<br> webhook_driven_scaling_enabled = bool<br> webhook_startup_timeout = optional(string, null)<br> pull_driven_scaling_enabled = bool<br> labels = list(string)<br> storage = optional(string, null)<br> pvc_enabled = optional(string, false)<br> resources = object({<br> limits = object({<br> cpu = string<br> memory = string<br> ephemeral_storage = optional(string, null)<br> })<br> requests = object({<br> cpu = string<br> memory = string<br> })<br> })<br> }))</pre> | n/a | yes |
464+
| <a name="input_runners"></a> [runners](#input\_runners) | Map of Action Runner configurations, with the key being the name of the runner. Please note that the name must be in<br>kebab-case.<br><br>For example:<pre>hcl<br>organization_runner = {<br> type = "organization" # can be either 'organization' or 'repository'<br> dind_enabled: false # A Docker sidecar container will be deployed<br> image: summerwind/actions-runner # If dind_enabled=true, set this to 'summerwind/actions-runner-dind'<br> scope = "ACME" # org name for Organization runners, repo name for Repository runners<br> group = "core-automation" # Optional. Assigns the runners to a runner group, for access control.<br> scale_down_delay_seconds = 300<br> min_replicas = 1<br> max_replicas = 5<br> busy_metrics = {<br> scale_up_threshold = 0.75<br> scale_down_threshold = 0.25<br> scale_up_factor = 2<br> scale_down_factor = 0.5<br> }<br> labels = [<br> "Ubuntu",<br> "core-automation",<br> ]<br>}</pre> | <pre>map(object({<br> type = string<br> scope = string<br> group = optional(string, null)<br> image = optional(string, "")<br> dind_enabled = bool<br> node_selector = optional(map(string), {})<br> pod_annotations = optional(map(string), {})<br> tolerations = optional(list(object({<br> key = string<br> operator = string<br> value = optional(string, null)<br> effect = string<br> })), [])<br> scale_down_delay_seconds = number<br> min_replicas = number<br> max_replicas = number<br> busy_metrics = optional(object({<br> scale_up_threshold = string<br> scale_down_threshold = string<br> scale_up_adjustment = optional(string)<br> scale_down_adjustment = optional(string)<br> scale_up_factor = optional(string)<br> scale_down_factor = optional(string)<br> }))<br> webhook_driven_scaling_enabled = bool<br> webhook_startup_timeout = optional(string, null)<br> pull_driven_scaling_enabled = bool<br> labels = list(string)<br> storage = optional(string, null)<br> pvc_enabled = optional(string, false)<br> resources = object({<br> limits = object({<br> cpu = string<br> memory = string<br> ephemeral_storage = optional(string, null)<br> })<br> requests = object({<br> cpu = string<br> memory = string<br> })<br> })<br> }))</pre> | n/a | yes |
417465
| <a name="input_s3_bucket_arns"></a> [s3\_bucket\_arns](#input\_s3\_bucket\_arns) | List of ARNs of S3 Buckets to which the runners will have read-write access to. | `list(string)` | `[]` | no |
418466
| <a name="input_ssm_github_secret_path"></a> [ssm\_github\_secret\_path](#input\_ssm\_github\_secret\_path) | The path in SSM to the GitHub app private key file contents or GitHub PAT token. | `string` | `""` | no |
419467
| <a name="input_ssm_github_webhook_secret_token_path"></a> [ssm\_github\_webhook\_secret\_token\_path](#input\_ssm\_github\_webhook\_secret\_token\_path) | The path in SSM to the GitHub Webhook Secret token. | `string` | `""` | no |

src/charts/actions-runner/templates/horizontalrunnerautoscaler.yaml

100644100755
Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -31,5 +31,7 @@ spec:
3131
- githubEvent:
3232
workflowJob: {}
3333
amount: 1
34+
{{- if .Values.webhook_startup_timeout }}
3435
duration: "{{ .Values.webhook_startup_timeout }}"
36+
{{- end }}
3537
{{- end }}

src/charts/actions-runner/templates/runnerdeployment.yaml

100644100755
Lines changed: 9 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -28,6 +28,11 @@ spec:
2828
# See https://github.com/actions-runner-controller/actions-runner-controller/issues/206#issuecomment-748601907
2929
# replicas: 1
3030
template:
31+
{{- with index .Values "pod_annotations" }}
32+
metadata:
33+
annotations:
34+
{{- toYaml . | nindent 8 }}
35+
{{- end }}
3136
spec:
3237
# As of 2023-03-31
3338
# Recommended by https://github.com/actions/actions-runner-controller/blob/master/docs/automatically-scaling-runners.md
@@ -89,12 +94,15 @@ spec:
8994
limits:
9095
cpu: {{ .Values.resources.limits.cpu }}
9196
memory: {{ .Values.resources.limits.memory }}
92-
{{- if and .Values.dind_enabled .Values.resources.limits.ephemeral_storage }}
97+
{{- if index .Values.resources.limits "ephemeral_storage" }}
9398
ephemeral-storage: {{ .Values.resources.limits.ephemeral_storage }}
9499
{{- end }}
95100
requests:
96101
cpu: {{ .Values.resources.requests.cpu }}
97102
memory: {{ .Values.resources.requests.memory }}
103+
{{- if index .Values.resources.requests "ephemeral_storage" }}
104+
ephemeral-storage: {{ .Values.resources.requests.ephemeral_storage }}
105+
{{- end }}
98106
{{- if and .Values.dind_enabled .Values.storage }}
99107
dockerVolumeMounts:
100108
- mountPath: /var/lib/docker

src/charts/actions-runner/values.yaml

100644100755
Lines changed: 13 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -5,28 +5,30 @@ image: summerwind/actions-runner-dind
55
node_selector:
66
kubernetes.io/os: "linux"
77
kubernetes.io/arch: "amd64"
8-
scope: "example/app"
8+
#scope: "example/app"
99
scale_down_delay_seconds: 300
1010
min_replicas: 1
1111
max_replicas: 2
12-
busy_metrics:
13-
scale_up_threshold: 0.75
14-
scale_down_threshold: 0.25
15-
scale_up_factor: 2
16-
scale_down_factor: 0.5
12+
#busy_metrics:
13+
# scale_up_threshold: 0.75
14+
# scale_down_threshold: 0.25
15+
# scale_up_factor: 2
16+
# scale_down_factor: 0.5
1717
resources:
1818
limits:
1919
cpu: 1.5
2020
memory: 4Gi
21-
ephemeral_storage: "10Gi"
21+
# ephemeral_storage: "10Gi"
2222
requests:
2323
cpu: 0.5
2424
memory: 1Gi
25+
# ephemeral_storage: "10Gi"
26+
2527
storage: "10Gi"
2628
pvc_enabled: false
27-
webhook_driven_scaling_enabled: false
29+
webhook_driven_scaling_enabled: true
2830
webhook_startup_timeout: "30m"
2931
pull_driven_scaling_enabled: false
30-
labels:
31-
- "Ubuntu"
32-
- "core-example"
32+
#labels:
33+
# - "Ubuntu"
34+
# - "core-example"

src/main.tf

100644100755
Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -206,6 +206,7 @@ module "actions_runner" {
206206
values = compact([
207207
yamlencode({
208208
release_name = each.key
209+
pod_annotations = lookup(each.value, "pod_annotations", "")
209210
service_account_name = module.actions_runner_controller.service_account_name
210211
type = each.value.type
211212
scope = each.value.scope
@@ -219,7 +220,7 @@ module "actions_runner" {
219220
min_replicas = each.value.min_replicas
220221
max_replicas = each.value.max_replicas
221222
webhook_driven_scaling_enabled = each.value.webhook_driven_scaling_enabled
222-
webhook_startup_timeout = coalesce(each.value.webhook_startup_timeout, "${each.value.scale_down_delay_seconds}s") # if webhook_startup_timeout isnt defined, use scale_down_delay_seconds
223+
webhook_startup_timeout = lookup(each.value, "webhook_startup_timeout", "")
223224
pull_driven_scaling_enabled = each.value.pull_driven_scaling_enabled
224225
pvc_enabled = each.value.pvc_enabled
225226
node_selector = each.value.node_selector

src/variables.tf

100644100755
Lines changed: 14 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -142,17 +142,18 @@ variable "runners" {
142142
organization_runner = {
143143
type = "organization" # can be either 'organization' or 'repository'
144144
dind_enabled: false # A Docker sidecar container will be deployed
145+
image: summerwind/actions-runner # If dind_enabled=true, set this to 'summerwind/actions-runner-dind'
145146
scope = "ACME" # org name for Organization runners, repo name for Repository runners
146147
group = "core-automation" # Optional. Assigns the runners to a runner group, for access control.
147-
image: summerwind/actions-runner # If dind_enabled=true, set this to 'summerwind/actions-runner-dind'
148-
node_selector = {} # optional Kubernetes node selector map for the runner pods
149-
tolerations = [] # optional Kubernetes tolerations list for the runner pods
150148
scale_down_delay_seconds = 300
151149
min_replicas = 1
152150
max_replicas = 5
153-
webhook_driven_scaling_enabled = bool # Recommended to be true to enable event-based scaling of runner pool
154-
webhook_startup_timeout = optional(string, null) # Duration after which capacity for a queued job will be discarded
155-
151+
busy_metrics = {
152+
scale_up_threshold = 0.75
153+
scale_down_threshold = 0.25
154+
scale_up_factor = 2
155+
scale_down_factor = 0.5
156+
}
156157
labels = [
157158
"Ubuntu",
158159
"core-automation",
@@ -162,12 +163,13 @@ variable "runners" {
162163
EOT
163164

164165
type = map(object({
165-
type = string
166-
scope = string
167-
group = optional(string, null)
168-
image = optional(string, "")
169-
dind_enabled = bool
170-
node_selector = optional(map(string), {})
166+
type = string
167+
scope = string
168+
group = optional(string, null)
169+
image = optional(string, "")
170+
dind_enabled = bool
171+
node_selector = optional(map(string), {})
172+
pod_annotations = optional(map(string), {})
171173
tolerations = optional(list(object({
172174
key = string
173175
operator = string

0 commit comments

Comments
 (0)