Skip to content

Commit da161e7

Browse files
authored
[eks/actions-runner-controller] Multiple bug fixes and enhancements (cloudposse/terraform-aws-components#1075)
1 parent 246cb7f commit da161e7

File tree

10 files changed

+410
-146
lines changed

10 files changed

+410
-146
lines changed

src/CHANGELOG.md

Lines changed: 126 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,126 @@
1+
## PR [#1075](https://github.com/cloudposse/terraform-aws-components/pull/1075)
2+
3+
New Features:
4+
5+
- Add support for
6+
[scheduled overrides](https://github.com/actions/actions-runner-controller/blob/master/docs/automatically-scaling-runners.md#scheduled-overrides)
7+
of Runner Autoscaler min and max replicas.
8+
- Add option `tmpfs_enabled` to have runners use RAM-backed ephemeral storage (`tmpfs`, `emptyDir.medium: Memory`)
9+
instead of disk-backed storage.
10+
- Add `wait_for_docker_seconds` to allow configuration of the time to wait for the Docker daemon to be ready before
11+
starting the runner.
12+
- Add the ability to have the runner Pods add annotations to themselves once they start running a job. (Actually
13+
released in release 1.454.0, but not documented until now.)
14+
15+
Changes:
16+
17+
- Previously, `syncPeriod`, which sets the period in which the controller reconciles the desired runners count, was set
18+
to 120 seconds in `resources/values.yaml`. This setting has been removed, reverting to the default value of 1 minute.
19+
You can still set this value by setting the `syncPeriod` value in the `values.yaml` file or by setting `syncPeriod` in
20+
`var.chart_values`.
21+
- Previously, `RUNNER_GRACEFUL_STOP_TIMEOUT` was hardcoded to 90 seconds. That has been reduced to 80 seconds to expand
22+
the buffer between that and forceful termination from 10 seconds to 20 seconds, increasing the chances the runner will
23+
successfully deregister itself.
24+
- The inaccurately named `webhook_startup_timeout` has been replaced with `max_duration`. `webhook_startup_timeout` is
25+
still supported for backward compatibility, but is deprecated.
26+
27+
Bugfixes:
28+
29+
- Create and deploy the webhook secret when an existing secret is not supplied
30+
- Restore proper order of operations in creating resources (broken in release 1.454.0 (PR #1055))
31+
- If `docker_storage` is set and `dockerdWithinRunnerContainer` is `true` (which is hardcoded to be the case), properly
32+
mount the docker storage volume into the runner container rather than the (non-existent) docker sidecar container.
33+
34+
### Discussion
35+
36+
#### Scheduled overrides
37+
38+
Scheduled overrides allow you to set different min and max replica values for the runner autoscaler at different times.
39+
This can be useful if you have predictable patterns of load on your runners. For example, you might want to scale down
40+
to zero at night and scale up during the day. This feature is implemented by adding a `scheduled_overrides` field to the
41+
`var.runners` map.
42+
43+
See the
44+
[Actions Runner Controller documentation](https://github.com/actions/actions-runner-controller/blob/master/docs/automatically-scaling-runners.md#scheduled-overrides)
45+
for details on how they work and how to set them up.
46+
47+
#### Use RAM instead of Disk via `tmpfs_enabled`
48+
49+
The standard `gp3` EBS volume used for EC2 instance's disk storage is limited (unless you pay extra) to 3000 IOPS and
50+
125 MB/s throughput. This is fine for average workloads, but it does not scale with instance size. A `.48xlarge`
51+
instance could host 90 Pods, but all 90 would still be sharing the same single 3000 IOPS and 125 MB/s throughput EBS
52+
volume attached to the host. This can lead to severe performance issues, as the whole Node gets locked up waiting for
53+
disk I/O.
54+
55+
To mitigate this issue, we have added the `tmpfs_enabled` option to the `runners` map. When set to `true`, the runner
56+
Pods will use RAM-backed ephemeral storage (`tmpfs`, `emptyDir.medium: Memory`) instead of disk-backed storage. This
57+
means the Pod's impact on the Node's disk I/O is limited to the overhead required to launch and manage the Pod (e.g.
58+
downloading the container image and writing logs to the disk). This can be a significant performance improvement,
59+
allowing you to run more Pods on a single Node without running into disk I/O bottlenecks. Without this feature enabled,
60+
you may be limited to running something like 14 Runners on an instance, regardless of instance size, due to disk I/O
61+
limits. With this feature enabled, you may be able to run 50-100 Runners on a single instance.
62+
63+
The trade-off is that the Pod's data is stored in RAM, which increases its memory usage. Be sure to increase the amount
64+
of memory allocated to the runner Pod to account for this. This is generally not a problem, as Runners typically use a
65+
small enough amount of disk space that it can be reasonably stored in the RAM allocated to a single CPU in an EC2
66+
instance, so it is the CPU that remains the limiting factor in how many Runners can be run on an instance.
67+
68+
:::warning You must configure a memory request for the runner Pod
69+
70+
When using `tmpfs_enabled`, you must configure a memory request for the runner Pod. If you do not, a single Pod would be
71+
allowed to consume half the Node's memory just for its disk storage.
72+
73+
:::
74+
75+
#### Configure startup timeout via `wait_for_docker_seconds`
76+
77+
When the runner starts and Docker-in-Docker is enabled, the runner waits for the Docker daemon to be ready before
78+
registering marking itself ready to run jobs. This is done by polling the Docker daemon every second until it is ready.
79+
The default timeout for this is 120 seconds. If the Docker daemon is not ready within that time, the runner will exit
80+
with an error. You can configure this timeout by setting `wait_for_docker_seconds` in the `runners` map.
81+
82+
As a general rule, the Docker daemon should be ready within a few seconds of the runner starting. However, particularly
83+
when there are disk I/O issues (see the `tmpfs_enabled` feature above), the Docker daemon may take longer to respond.
84+
85+
#### Add annotations to runner Pods once they start running a job
86+
87+
You can now configure the runner Pods to add annotations to themselves once they start running a job. The idea is to
88+
allow you to have idle pods allow themselves to be interrupted, but then mark themselves as uninterruptible once they
89+
start running a job. This is done by setting the `running_pod_annotations` field in the `runners` map. For example:
90+
91+
```yaml
92+
running_pod_annotations:
93+
# Prevent Karpenter from evicting or disrupting the worker pods while they are running jobs
94+
# As of 0.37.0, is not 100% effective due to race conditions.
95+
"karpenter.sh/do-not-disrupt": "true"
96+
```
97+
98+
As noted in the comments above, this was intended to prevent Karpenter from evicting or disrupting the worker pods while
99+
they are running jobs, while leaving Karpenter free to interrupt idle Runners. However, as of Karpenter 0.37.0, this is
100+
not 100% effective due to race conditions: Karpenter may decide to terminate the Node the Pod is running on but not
101+
signal the Pod before it accepts a job and starts running it. Without the availability of transactions or atomic
102+
operations, this is a difficult problem to solve, and will probably require a more complex solution than just adding
103+
annotations to the Pods. Nevertheless, this feature remains available for use in other contexts, as well as in the hope
104+
that it will eventually work with Karpenter.
105+
106+
#### Bugfix: Deploy webhook secret when existing secret is not supplied
107+
108+
Because deploying secrets with Terraform causes the secrets to be stored unencrypted in the Terraform state file, we
109+
give users the option of creating the configuration secret externally (e.g. via
110+
[SOPS](https://github.com/getsops/sops)). Unfortunately, at some distant time in the past, when we enabled this option,
111+
we broke this component insofar as the webhook secret was no longer being deployed when the user did not supply an
112+
existing secret. This PR fixes that.
113+
114+
The consequence of this bug was that, since the webhook secret was not being deployed, the webhook did not reject
115+
unauthorized requests. This could have allowed an attacker to trigger the webhook and perform a DOS attack by killing
116+
jobs as soon as they were accepted from the queue. A more practical and unintentional consequence was if a repo webhook
117+
was installed alongside an org webhook, it would not keep guard against the webhook receiving the same payload twice if
118+
one of the webhooks was missing the secret or had the wrong secret.
119+
120+
#### Bugfix: Restore proper order of operations in creating resources
121+
122+
In release 1.454.0 (PR [#1055](https://github.com/cloudposse/terraform-aws-components/pull/1055)), we reorganized the
123+
RunnerDeployment template in the Helm chart to put the RunnerDeployment resource first, since it is the most important
124+
resource, merely to improve readability. Unfortunately, the order of operations in creating resources is important, and
125+
this change broke the deployment by deploying the RunnerDeployment before creating the resources it depends on. This PR
126+
restores the proper order of operations.

0 commit comments

Comments
 (0)