|
| 1 | +## Release 1.470.1 |
| 2 | + |
| 3 | +Components PR [#1077](https://github.com/cloudposse/terraform-aws-components/pull/1077) |
| 4 | + |
| 5 | +Bugfix: |
| 6 | + |
| 7 | +- Fix templating of document separators in Helm chart template. Affects users who are not using |
| 8 | + `running_pod_annotations`. |
| 9 | + |
| 10 | +## Release 1.470.0 |
| 11 | + |
| 12 | +Components PR [#1075](https://github.com/cloudposse/terraform-aws-components/pull/1075) |
| 13 | + |
| 14 | +New Features: |
| 15 | + |
| 16 | +- Add support for |
| 17 | + [scheduled overrides](https://github.com/actions/actions-runner-controller/blob/master/docs/automatically-scaling-runners.md#scheduled-overrides) |
| 18 | + of Runner Autoscaler min and max replicas. |
| 19 | +- Add option `tmpfs_enabled` to have runners use RAM-backed ephemeral storage (`tmpfs`, `emptyDir.medium: Memory`) |
| 20 | + instead of disk-backed storage. |
| 21 | +- Add `wait_for_docker_seconds` to allow configuration of the time to wait for the Docker daemon to be ready before |
| 22 | + starting the runner. |
| 23 | +- Add the ability to have the runner Pods add annotations to themselves once they start running a job. (Actually |
| 24 | + released in release 1.454.0, but not documented until now.) |
| 25 | + |
| 26 | +Changes: |
| 27 | + |
| 28 | +- Previously, `syncPeriod`, which sets the period in which the controller reconciles the desired runners count, was set |
| 29 | + to 120 seconds in `resources/values.yaml`. This setting has been removed, reverting to the default value of 1 minute. |
| 30 | + You can still set this value by setting the `syncPeriod` value in the `values.yaml` file or by setting `syncPeriod` in |
| 31 | + `var.chart_values`. |
| 32 | +- Previously, `RUNNER_GRACEFUL_STOP_TIMEOUT` was hardcoded to 90 seconds. That has been reduced to 80 seconds to expand |
| 33 | + the buffer between that and forceful termination from 10 seconds to 20 seconds, increasing the chances the runner will |
| 34 | + successfully deregister itself. |
| 35 | +- The inaccurately named `webhook_startup_timeout` has been replaced with `max_duration`. `webhook_startup_timeout` is |
| 36 | + still supported for backward compatibility, but is deprecated. |
| 37 | + |
| 38 | +Bugfixes: |
| 39 | + |
| 40 | +- Create and deploy the webhook secret when an existing secret is not supplied |
| 41 | +- Restore proper order of operations in creating resources (broken in release 1.454.0 (PR #1055)) |
| 42 | +- If `docker_storage` is set and `dockerdWithinRunnerContainer` is `true` (which is hardcoded to be the case), properly |
| 43 | + mount the docker storage volume into the runner container rather than the (non-existent) docker sidecar container. |
| 44 | + |
| 45 | +### Discussion |
| 46 | + |
| 47 | +#### Scheduled overrides |
| 48 | + |
| 49 | +Scheduled overrides allow you to set different min and max replica values for the runner autoscaler at different times. |
| 50 | +This can be useful if you have predictable patterns of load on your runners. For example, you might want to scale down |
| 51 | +to zero at night and scale up during the day. This feature is implemented by adding a `scheduled_overrides` field to the |
| 52 | +`var.runners` map. |
| 53 | + |
| 54 | +See the |
| 55 | +[Actions Runner Controller documentation](https://github.com/actions/actions-runner-controller/blob/master/docs/automatically-scaling-runners.md#scheduled-overrides) |
| 56 | +for details on how they work and how to set them up. |
| 57 | + |
| 58 | +#### Use RAM instead of Disk via `tmpfs_enabled` |
| 59 | + |
| 60 | +The standard `gp3` EBS volume used for EC2 instance's disk storage is limited (unless you pay extra) to 3000 IOPS and |
| 61 | +125 MB/s throughput. This is fine for average workloads, but it does not scale with instance size. A `.48xlarge` |
| 62 | +instance could host 90 Pods, but all 90 would still be sharing the same single 3000 IOPS and 125 MB/s throughput EBS |
| 63 | +volume attached to the host. This can lead to severe performance issues, as the whole Node gets locked up waiting for |
| 64 | +disk I/O. |
| 65 | + |
| 66 | +To mitigate this issue, we have added the `tmpfs_enabled` option to the `runners` map. When set to `true`, the runner |
| 67 | +Pods will use RAM-backed ephemeral storage (`tmpfs`, `emptyDir.medium: Memory`) instead of disk-backed storage. This |
| 68 | +means the Pod's impact on the Node's disk I/O is limited to the overhead required to launch and manage the Pod (e.g. |
| 69 | +downloading the container image and writing logs to the disk). This can be a significant performance improvement, |
| 70 | +allowing you to run more Pods on a single Node without running into disk I/O bottlenecks. Without this feature enabled, |
| 71 | +you may be limited to running something like 14 Runners on an instance, regardless of instance size, due to disk I/O |
| 72 | +limits. With this feature enabled, you may be able to run 50-100 Runners on a single instance. |
| 73 | + |
| 74 | +The trade-off is that the Pod's data is stored in RAM, which increases its memory usage. Be sure to increase the amount |
| 75 | +of memory allocated to the runner Pod to account for this. This is generally not a problem, as Runners typically use a |
| 76 | +small enough amount of disk space that it can be reasonably stored in the RAM allocated to a single CPU in an EC2 |
| 77 | +instance, so it is the CPU that remains the limiting factor in how many Runners can be run on an instance. |
| 78 | + |
| 79 | +> [!WARNING] |
| 80 | +> |
| 81 | +> #### You must configure a memory request for the runner Pod |
| 82 | +> |
| 83 | +> When using `tmpfs_enabled`, you must configure a memory request for the runner Pod. If you do not, a single Pod would |
| 84 | +> be allowed to consume half the Node's memory just for its disk storage. |
| 85 | +
|
| 86 | +#### Configure startup timeout via `wait_for_docker_seconds` |
| 87 | + |
| 88 | +When the runner starts and Docker-in-Docker is enabled, the runner waits for the Docker daemon to be ready before |
| 89 | +registering marking itself ready to run jobs. This is done by polling the Docker daemon every second until it is ready. |
| 90 | +The default timeout for this is 120 seconds. If the Docker daemon is not ready within that time, the runner will exit |
| 91 | +with an error. You can configure this timeout by setting `wait_for_docker_seconds` in the `runners` map. |
| 92 | + |
| 93 | +As a general rule, the Docker daemon should be ready within a few seconds of the runner starting. However, particularly |
| 94 | +when there are disk I/O issues (see the `tmpfs_enabled` feature above), the Docker daemon may take longer to respond. |
| 95 | + |
| 96 | +#### Add annotations to runner Pods once they start running a job |
| 97 | + |
| 98 | +You can now configure the runner Pods to add annotations to themselves once they start running a job. The idea is to |
| 99 | +allow you to have idle pods allow themselves to be interrupted, but then mark themselves as uninterruptible once they |
| 100 | +start running a job. This is done by setting the `running_pod_annotations` field in the `runners` map. For example: |
| 101 | + |
| 102 | +```yaml |
| 103 | +running_pod_annotations: |
| 104 | + # Prevent Karpenter from evicting or disrupting the worker pods while they are running jobs |
| 105 | + # As of 0.37.0, is not 100% effective due to race conditions. |
| 106 | + "karpenter.sh/do-not-disrupt": "true" |
| 107 | +``` |
| 108 | +
|
| 109 | +As noted in the comments above, this was intended to prevent Karpenter from evicting or disrupting the worker pods while |
| 110 | +they are running jobs, while leaving Karpenter free to interrupt idle Runners. However, as of Karpenter 0.37.0, this is |
| 111 | +not 100% effective due to race conditions: Karpenter may decide to terminate the Node the Pod is running on but not |
| 112 | +signal the Pod before it accepts a job and starts running it. Without the availability of transactions or atomic |
| 113 | +operations, this is a difficult problem to solve, and will probably require a more complex solution than just adding |
| 114 | +annotations to the Pods. Nevertheless, this feature remains available for use in other contexts, as well as in the hope |
| 115 | +that it will eventually work with Karpenter. |
| 116 | +
|
| 117 | +#### Bugfix: Deploy webhook secret when existing secret is not supplied |
| 118 | +
|
| 119 | +Because deploying secrets with Terraform causes the secrets to be stored unencrypted in the Terraform state file, we |
| 120 | +give users the option of creating the configuration secret externally (e.g. via |
| 121 | +[SOPS](https://github.com/getsops/sops)). Unfortunately, at some distant time in the past, when we enabled this option, |
| 122 | +we broke this component insofar as the webhook secret was no longer being deployed when the user did not supply an |
| 123 | +existing secret. This PR fixes that. |
| 124 | +
|
| 125 | +The consequence of this bug was that, since the webhook secret was not being deployed, the webhook did not reject |
| 126 | +unauthorized requests. This could have allowed an attacker to trigger the webhook and perform a DOS attack by killing |
| 127 | +jobs as soon as they were accepted from the queue. A more practical and unintentional consequence was if a repo webhook |
| 128 | +was installed alongside an org webhook, it would not keep guard against the webhook receiving the same payload twice if |
| 129 | +one of the webhooks was missing the secret or had the wrong secret. |
| 130 | +
|
| 131 | +#### Bugfix: Restore proper order of operations in creating resources |
| 132 | +
|
| 133 | +In release 1.454.0 (PR [#1055](https://github.com/cloudposse/terraform-aws-components/pull/1055)), we reorganized the |
| 134 | +RunnerDeployment template in the Helm chart to put the RunnerDeployment resource first, since it is the most important |
| 135 | +resource, merely to improve readability. Unfortunately, the order of operations in creating resources is important, and |
| 136 | +this change broke the deployment by deploying the RunnerDeployment before creating the resources it depends on. This PR |
| 137 | +restores the proper order of operations. |
0 commit comments