Skip to content

Conversation

@npalm
Copy link
Member

@npalm npalm commented Dec 6, 2025

  • Process events batch wise (scale-up)
  • Update AWS provider to 6.21+ and NodeJS to 24
  • Remove deprecated variables from terraform

@npalm npalm requested review from a team as code owners December 6, 2025 18:37
@github-actions
Copy link
Contributor

github-actions bot commented Dec 6, 2025

Dependency Review

✅ No vulnerabilities or license issues or OpenSSF Scorecard issues found.

Scanned Files

None

…ingle invocation (#4603)

Currently we restrict the `scale-up` Lambda to only handle a single
event at a time. In very busy environments this can prove to be a
bottleneck: there are calls to GitHub and AWS APIs that happen each
time, and they can end up taking long enough that we can't process job
queued events faster than they arrive.

In our environment we are also using a pool, and typically we have
responded to the alerts generated by this (SQS queue length growing) by
expanding the size of the pool. This helps because we will more
frequently find that we don't need to scale up, which allows the lambdas
to exit a bit earlier, so we can get through the queue faster. But it
makes the environment much less responsive to changes in usage patterns.

At its core, this Lambda's task is to construct an EC2 `CreateFleet`
call to create instances, after working out how many are needed. This is
a job that can be batched. We can take any number of events, calculate
the diff between our current state and the number of jobs we have,
capping at the maximum, and then issue a single call.

The thing to be careful about is how to handle partial failures, if EC2
creates some of the instances we wanted but not all of them. Lambda has
a configurable function response type which can be set to
`ReportBatchItemFailures`. In this mode, we return a list of failed
messages from our handler and those are retried. We can make use of this
to give back as many events as we failed to process.

Now we're potentially processing multiple events in a single Lambda, one
thing we should optimise for is not recreating GitHub API clients. We
need one client for the app itself, which we use to find out
installation IDs, and then one client for each installation which is
relevant to the batch of events we are processing. This is done by
creating a new client the first time we see an event for a given
installation.

We also remove the same `batch_size = 1` constraint from the `job-retry`
Lambda. This Lambda is used to retry events that previously failed.
However, instead of reporting failures to be retried, here we maintain
the pre-existing fault-tolerant behaviour where errors are logged but
explicitly do not cause message retries, avoiding infinite loops from
persistent GitHub API issues or malformed events.

Tests are added for all of this.

Tests in a private repo (sorry) look good. This was running ephemeral
runners with no pool, SSM high throughput enabled, the job queued check
\_dis_abled, batch size of 200, wait time of 10 seconds. The workflow
runs are each a matrix with 250 jobs.


![image](https://github.com/user-attachments/assets/0a656e99-8f1e-45e2-924b-0d5c1b6d6afb)

---------

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: Niek Palm <[email protected]>
Co-authored-by: Niek Palm <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Upgrade to Lambda runtime Node24.x
- Upgrade minimal AWS Terraform provider
- Upgrade all lambda runtimes by default to 24x
- Breaking change!


## Dependency and environment upgrades:

* Updated all references to Node.js from version 22 to version 24 in
GitHub Actions workflows (`.github/workflows/lambda.yml`,
`.github/workflows/release.yml`) and Dockerfiles (`.ci/Dockerfile`,
`.devcontainer/Dockerfile`).
[[1]](diffhunk://#diff-b0732b88b9e5a3df48561602571a10179d2b28cbb21ba8032025932bc9606426L23-R23)
[[2]](diffhunk://#diff-87db21a973eed4fef5f32b267aa60fcee5cbdf03c67fafdc2a9b553bb0b15f34L33-R33)
[[3]](diffhunk://#diff-fd0c8401badda82156f9e7bd621fa3a0e586d8128e4a80af17c7cbff70bee11eL2-R2)
[[4]](diffhunk://#diff-13bd9d7a30bf46656bc81f1ad5b908a627f9247be3f7d76df862b0578b534fc6L1-R1)
* Upgraded the base Docker images for both the CI and devcontainer
environments to use newer Node.js image digests.
[[1]](diffhunk://#diff-fd0c8401badda82156f9e7bd621fa3a0e586d8128e4a80af17c7cbff70bee11eL2-R2)
[[2]](diffhunk://#diff-13bd9d7a30bf46656bc81f1ad5b908a627f9247be3f7d76df862b0578b534fc6L1-R1)

Terraform provider updates:

* Increased the minimum required version for the AWS Terraform provider
to `>= 6.21` in all example `versions.tf` files.
[[1]](diffhunk://#diff-61160e0ae9e70de675b98889710708e1a9edcccd5194a4a580aa234caa5103aeL5-R5)
[[2]](diffhunk://#diff-debb96ea7aef889f9deb4de51c61ca44a7e23832098e1c9d8b09fa54b1a96582L5-R5)
* Updated the `.terraform.lock.hcl` files in all examples to lock the
AWS provider at version `6.22.1`, the local provider at `2.6.1`, and the
null provider at `3.2.4` where applicable, along with updated hash
values and constraints.
[[1]](diffhunk://#diff-101dfb4a445c2ab4a46c53cbc92db3a43f3423ba1e8ee7b8a11b393ebe835539L5-R43)
[[2]](diffhunk://#diff-2a8b3082767f86cfdb88b00e963894a8cdd2ebcf705c8d757d46b55a98452a6cL5-R43)

---------

Co-authored-by: github-aws-runners-pr|bot <github-aws-runners-pr[bot]@users.noreply.github.com>
Co-authored-by: Guilherme Caulada <[email protected]>
Copy link
Contributor

@guicaulada guicaulada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi! Is there any other commits you plan to add to this PR besides those 2? Or it's ready for review?

@npalm
Copy link
Member Author

npalm commented Dec 8, 2025

Hi! Is there any commits you plan to add to this PR besides those 2? Or it's ready for review?

Potentially will merge #4929 But that one can also go later this week. And I prefer we get rid of deprecated variables. But was not able to get the PR out.

npalm and others added 2 commits December 9, 2025 18:46
This pull request removes support for several deprecated AMI-related
variables across all modules, fully migrating the configuration to the
consolidated `ami` object. This change simplifies how AMI settings are
managed, improves consistency, and reduces confusion for users. All
references to the old variables (`ami_filter`, `ami_owners`,
`ami_id_ssm_parameter_name`, `ami_kms_key_arn`) have been removed from
module inputs, outputs, templates, documentation, and internal logic.

**Migration to consolidated AMI configuration:**

* Removed all deprecated AMI variables (`ami_filter`, `ami_owners`,
`ami_id_ssm_parameter_name`, `ami_kms_key_arn`) from module variable
definitions, outputs, and internal usage in `variables.tf`,
`outputs.tf`, and related files.
[[1]](diffhunk://#diff-05b5a57c136b6ff596500bcbfdcff145ef6cddea2a0e86d184d9daa9a65a288eL396-L424)
[[2]](diffhunk://#diff-23e8f44c0f21971190244acdb8a35eaa21af7578ed5f1b97bef83f1a566d979cL138-L165)
[[3]](diffhunk://#diff-de6c47c2496bd028a84d55ab12d8a4f90174ebfb6544b8b5c7b07a7ee4f27ec7L78-L90)
[[4]](diffhunk://#diff-2daea3e8167ce5d859f6f1bee08138dbe216003262325490e8b90477277c104aL70-L89)
[[5]](diffhunk://#diff-52d0673ff466b6445542e17038ea73a1cf41b8112f49ee57da4cebf8f0cb99c5L73-R73)
[[6]](diffhunk://#diff-52d0673ff466b6445542e17038ea73a1cf41b8112f49ee57da4cebf8f0cb99c5L186-L187)
[[7]](diffhunk://#diff-951f6bd1e32c3d27dd90e2dfb1f5232a704ef01fd925f3ee4323d6adc2dcdf5aL15-L20)
[[8]](diffhunk://#diff-3937b99021390c0192952207dd2e26a409e0c03446478fb09ac3cd360bb60ee5L9-L14)
* Updated example and documentation files to use the new `ami` object
structure, replacing previous usage of the deprecated variables.
[[1]](diffhunk://#diff-ef2038e9f8d807236d2acebe3c3a191039f8021cc4a0188f4778de908f0d453bL36-R40)
[[2]](diffhunk://#diff-ef2038e9f8d807236d2acebe3c3a191039f8021cc4a0188f4778de908f0d453bL52-R57)
[[3]](diffhunk://#diff-0a0d2ecd774e69a1397a913b6230f45692b49c0b17ccb103d318f6ab078353e2L48-R51)
[[4]](diffhunk://#diff-b2b9df08c45240d599f6260d246bad6e67129932174131db209341d8464247a8L18-R19)
[[5]](diffhunk://#diff-61032a0bb5f9d7ae65ba5155b2e58e12901f39bb7068f16b419d94c6f7a5b922L86-R89)
* Refactored module runner logic to only use the new `ami` object,
removing all fallback and compatibility code for the old variables.
[[1]](diffhunk://#diff-dc46acf24afd63ef8c556b77c126ccc6e578bc87e3aa09a931f33d9bf2532fbbL182-L185)
[[2]](diffhunk://#diff-57f00cdf57feef92ffcb35d6618e62e14ad652d5d520f331068332d5f3cade51L30-L32)
[[3]](diffhunk://#diff-e9624b388e62ca51cf1fe073eb0919588e8c36a1143ecdb49580996a89f13bebL40-R51)
* Updated internal references to AMI SSM parameter names and related
policies to use the new configuration, ensuring all resource and
environment variable logic is aligned with the consolidated approach.
[[1]](diffhunk://#diff-bc00c0efa92f360635d026350da2fb775718514e2b1ae718281400e661b7469bL13-R13)
[[2]](diffhunk://#diff-bc00c0efa92f360635d026350da2fb775718514e2b1ae718281400e661b7469bL28-R28)
[[3]](diffhunk://#diff-5921c9e3315946068538b290966e7e4e51b6e49d04c2466b0bdd4b298629b29dL56-R57)
[[4]](diffhunk://#diff-a598ba79d09e4770d55ed09e6b1d51e68c4a54562a3e3cbb46619d625a609d23L28-R28)
[[5]](diffhunk://#diff-a598ba79d09e4770d55ed09e6b1d51e68c4a54562a3e3cbb46619d625a609d23L151-R151)

With these updates, all AMI configuration is now handled through the
unified `ami` object, making runner setup more straightforward and
future-proof.


## Tested

- [x] default example
- [x] multi runner example

---------

Co-authored-by: github-aws-runners-pr|bot <github-aws-runners-pr[bot]@users.noreply.github.com>
Co-authored-by: Copilot <[email protected]>
## Description

This PR adds support for configuring EC2 placement groups for GitHub
Actions runners in the multi-runner module. It plumbs a new placement
option from the runner configuration through to the underlying EC2
runner module.

## Details

Updated modules/multi-runner/runners.tf to pass placement =
each.value.runner_config.placement into the runners module.
This allows specifying AWS placement groups for EC2 runners, enabling
tighter control over instance placement.
The change is backwards compatible: if placement is unset in
runner_config, behavior remains unchanged.

## Motivation / Future work

Placement groups are a prerequisite for supporting macOS runners, which
require a host_id.
A follow-up PR will add explicit macOS support leveraging this new
placement wiring.

---------

Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Niek Palm <[email protected]>
@npalm npalm requested a review from guicaulada December 10, 2025 15:42
Copy link
Contributor

@guicaulada guicaulada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR description is missing:

  • Add support to use placement group in launch template

But given each PR was individually reviewed, I'm confident this can be approved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants