Skip to content

Commit 0c1f214

Browse files
committed
Initial commit
1 parent 3dfb8e3 commit 0c1f214

File tree

16 files changed

+2083
-60
lines changed

16 files changed

+2083
-60
lines changed

.github/settings.yml

Lines changed: 2 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,7 @@
11
# Upstream changes from _extends are only recognized when modifications are made to this file in the default branch.
22
_extends: .github
33
repository:
4-
name: template
5-
description: Template for Terraform Components
4+
name: aws-eks-actions-runner-controller
5+
description: This component creates a Helm release for [actions-runner-controller](https://github
66
homepage: https://cloudposse.com/accelerate
77
topics: terraform, terraform-component
8-
9-
10-
11-

CHANGELOG.md

Lines changed: 137 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,137 @@
1+
## Release 1.470.1
2+
3+
Components PR [#1077](https://github.com/cloudposse/terraform-aws-components/pull/1077)
4+
5+
Bugfix:
6+
7+
- Fix templating of document separators in Helm chart template. Affects users who are not using
8+
`running_pod_annotations`.
9+
10+
## Release 1.470.0
11+
12+
Components PR [#1075](https://github.com/cloudposse/terraform-aws-components/pull/1075)
13+
14+
New Features:
15+
16+
- Add support for
17+
[scheduled overrides](https://github.com/actions/actions-runner-controller/blob/master/docs/automatically-scaling-runners.md#scheduled-overrides)
18+
of Runner Autoscaler min and max replicas.
19+
- Add option `tmpfs_enabled` to have runners use RAM-backed ephemeral storage (`tmpfs`, `emptyDir.medium: Memory`)
20+
instead of disk-backed storage.
21+
- Add `wait_for_docker_seconds` to allow configuration of the time to wait for the Docker daemon to be ready before
22+
starting the runner.
23+
- Add the ability to have the runner Pods add annotations to themselves once they start running a job. (Actually
24+
released in release 1.454.0, but not documented until now.)
25+
26+
Changes:
27+
28+
- Previously, `syncPeriod`, which sets the period in which the controller reconciles the desired runners count, was set
29+
to 120 seconds in `resources/values.yaml`. This setting has been removed, reverting to the default value of 1 minute.
30+
You can still set this value by setting the `syncPeriod` value in the `values.yaml` file or by setting `syncPeriod` in
31+
`var.chart_values`.
32+
- Previously, `RUNNER_GRACEFUL_STOP_TIMEOUT` was hardcoded to 90 seconds. That has been reduced to 80 seconds to expand
33+
the buffer between that and forceful termination from 10 seconds to 20 seconds, increasing the chances the runner will
34+
successfully deregister itself.
35+
- The inaccurately named `webhook_startup_timeout` has been replaced with `max_duration`. `webhook_startup_timeout` is
36+
still supported for backward compatibility, but is deprecated.
37+
38+
Bugfixes:
39+
40+
- Create and deploy the webhook secret when an existing secret is not supplied
41+
- Restore proper order of operations in creating resources (broken in release 1.454.0 (PR #1055))
42+
- If `docker_storage` is set and `dockerdWithinRunnerContainer` is `true` (which is hardcoded to be the case), properly
43+
mount the docker storage volume into the runner container rather than the (non-existent) docker sidecar container.
44+
45+
### Discussion
46+
47+
#### Scheduled overrides
48+
49+
Scheduled overrides allow you to set different min and max replica values for the runner autoscaler at different times.
50+
This can be useful if you have predictable patterns of load on your runners. For example, you might want to scale down
51+
to zero at night and scale up during the day. This feature is implemented by adding a `scheduled_overrides` field to the
52+
`var.runners` map.
53+
54+
See the
55+
[Actions Runner Controller documentation](https://github.com/actions/actions-runner-controller/blob/master/docs/automatically-scaling-runners.md#scheduled-overrides)
56+
for details on how they work and how to set them up.
57+
58+
#### Use RAM instead of Disk via `tmpfs_enabled`
59+
60+
The standard `gp3` EBS volume used for EC2 instance's disk storage is limited (unless you pay extra) to 3000 IOPS and
61+
125 MB/s throughput. This is fine for average workloads, but it does not scale with instance size. A `.48xlarge`
62+
instance could host 90 Pods, but all 90 would still be sharing the same single 3000 IOPS and 125 MB/s throughput EBS
63+
volume attached to the host. This can lead to severe performance issues, as the whole Node gets locked up waiting for
64+
disk I/O.
65+
66+
To mitigate this issue, we have added the `tmpfs_enabled` option to the `runners` map. When set to `true`, the runner
67+
Pods will use RAM-backed ephemeral storage (`tmpfs`, `emptyDir.medium: Memory`) instead of disk-backed storage. This
68+
means the Pod's impact on the Node's disk I/O is limited to the overhead required to launch and manage the Pod (e.g.
69+
downloading the container image and writing logs to the disk). This can be a significant performance improvement,
70+
allowing you to run more Pods on a single Node without running into disk I/O bottlenecks. Without this feature enabled,
71+
you may be limited to running something like 14 Runners on an instance, regardless of instance size, due to disk I/O
72+
limits. With this feature enabled, you may be able to run 50-100 Runners on a single instance.
73+
74+
The trade-off is that the Pod's data is stored in RAM, which increases its memory usage. Be sure to increase the amount
75+
of memory allocated to the runner Pod to account for this. This is generally not a problem, as Runners typically use a
76+
small enough amount of disk space that it can be reasonably stored in the RAM allocated to a single CPU in an EC2
77+
instance, so it is the CPU that remains the limiting factor in how many Runners can be run on an instance.
78+
79+
> [!WARNING]
80+
>
81+
> #### You must configure a memory request for the runner Pod
82+
>
83+
> When using `tmpfs_enabled`, you must configure a memory request for the runner Pod. If you do not, a single Pod would
84+
> be allowed to consume half the Node's memory just for its disk storage.
85+
86+
#### Configure startup timeout via `wait_for_docker_seconds`
87+
88+
When the runner starts and Docker-in-Docker is enabled, the runner waits for the Docker daemon to be ready before
89+
registering marking itself ready to run jobs. This is done by polling the Docker daemon every second until it is ready.
90+
The default timeout for this is 120 seconds. If the Docker daemon is not ready within that time, the runner will exit
91+
with an error. You can configure this timeout by setting `wait_for_docker_seconds` in the `runners` map.
92+
93+
As a general rule, the Docker daemon should be ready within a few seconds of the runner starting. However, particularly
94+
when there are disk I/O issues (see the `tmpfs_enabled` feature above), the Docker daemon may take longer to respond.
95+
96+
#### Add annotations to runner Pods once they start running a job
97+
98+
You can now configure the runner Pods to add annotations to themselves once they start running a job. The idea is to
99+
allow you to have idle pods allow themselves to be interrupted, but then mark themselves as uninterruptible once they
100+
start running a job. This is done by setting the `running_pod_annotations` field in the `runners` map. For example:
101+
102+
```yaml
103+
running_pod_annotations:
104+
# Prevent Karpenter from evicting or disrupting the worker pods while they are running jobs
105+
# As of 0.37.0, is not 100% effective due to race conditions.
106+
"karpenter.sh/do-not-disrupt": "true"
107+
```
108+
109+
As noted in the comments above, this was intended to prevent Karpenter from evicting or disrupting the worker pods while
110+
they are running jobs, while leaving Karpenter free to interrupt idle Runners. However, as of Karpenter 0.37.0, this is
111+
not 100% effective due to race conditions: Karpenter may decide to terminate the Node the Pod is running on but not
112+
signal the Pod before it accepts a job and starts running it. Without the availability of transactions or atomic
113+
operations, this is a difficult problem to solve, and will probably require a more complex solution than just adding
114+
annotations to the Pods. Nevertheless, this feature remains available for use in other contexts, as well as in the hope
115+
that it will eventually work with Karpenter.
116+
117+
#### Bugfix: Deploy webhook secret when existing secret is not supplied
118+
119+
Because deploying secrets with Terraform causes the secrets to be stored unencrypted in the Terraform state file, we
120+
give users the option of creating the configuration secret externally (e.g. via
121+
[SOPS](https://github.com/getsops/sops)). Unfortunately, at some distant time in the past, when we enabled this option,
122+
we broke this component insofar as the webhook secret was no longer being deployed when the user did not supply an
123+
existing secret. This PR fixes that.
124+
125+
The consequence of this bug was that, since the webhook secret was not being deployed, the webhook did not reject
126+
unauthorized requests. This could have allowed an attacker to trigger the webhook and perform a DOS attack by killing
127+
jobs as soon as they were accepted from the queue. A more practical and unintentional consequence was if a repo webhook
128+
was installed alongside an org webhook, it would not keep guard against the webhook receiving the same payload twice if
129+
one of the webhooks was missing the secret or had the wrong secret.
130+
131+
#### Bugfix: Restore proper order of operations in creating resources
132+
133+
In release 1.454.0 (PR [#1055](https://github.com/cloudposse/terraform-aws-components/pull/1055)), we reorganized the
134+
RunnerDeployment template in the Helm chart to put the RunnerDeployment resource first, since it is the most important
135+
resource, merely to improve readability. Unfortunately, the order of operations in creating resources is important, and
136+
this change broke the deployment by deploying the RunnerDeployment before creating the resources it depends on. This PR
137+
restores the proper order of operations.

0 commit comments

Comments
 (0)