feat(lambda): publish retry message from scale up if runner creation fails #4605
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Note
I suggest reviewing the PR with whitespace changes hidden. It makes the diff much smaller and it highlights the actual change being proposed here better.
See https://github.com/github-aws-runners/terraform-aws-github-runner/pull/4605/files?diff=split&w=1
Overview
Currently, the retry messages are only published by the scale up lambda when the runner creation succeeds. I propose to also publish them when the runner creation fails and when the runner creation is skipped due to the maximum runner cap being reached.
Motivation
The runner creation, i.e. a call to
createRunners
, can fail for a number of reasons. For example, it can fail due to exceeding the AWS service quotas. In such a case, we'd like the retry message to be published so that the runner creation can be retried in the future. Otherwise, in the ephemeral runners mode, we risk facing runner starvation.Similarly, when runners are not created due to their limits being reached, we'd like to retry the creation in the future when some of the running runners will have had a chance to shut down and make space for the new ones.
Testing
I added unit tests to test the new behaviour and run them locally. I have not run the updated lambda in production.