Skip to content

Conversation

@courtneypacheco
Copy link
Contributor

@courtneypacheco courtneypacheco commented Feb 27, 2025

Integration Testing Evidence

Checklist:

  • Commit Message Formatting: Commit titles and messages follow guidelines in the conventional commits.
  • Changelog updated with breaking and/or notable changes for the next minor release.
  • Documentation has been added and/or updated, if applicable.
  • Unit tests have been added and/or updated. (If this is not applicable, please provide a justification.)
  • Integration testing has been performed, if applicable

Description of this Change

This action is used to launch an EC2 instance (either as a spot instance or a dedicated instance) in a desired availability zone. If no availability, then backup availability zones will be tried until one is successful.

I've tested this action in the instructlab/instructlab repo by creating a draft PR there that modifies the Large E2E job workflow to reference this proposed action. (See link above.)

Design Considerations

I thought about creating a "single" CI action that is capable of both launching and stopping instances. After all, the base machulav/ec2-runner GitHub action that I leverage under the hood can be used to launch and terminate instances. In the case of that particular GitHub action, users can simply set the mode parameter to either start or stop to indicate if they want to launch or terminate an instance.

Ultimately, I decided against implementing a mode in this PR because I felt that going that route would add complexity to the existing code in this PR, increase the testing surface required, etc.. I do feel that adding a mode would certainly be a useful enhancement for this CI action, but I see that as an enhancement and not a necessity for the initial work. I can certainly file a follow-up issue to add this enhancement, though.

Integration Testing Results

Dedicated Instance Testing

To test the dedicated instance logic, I set try_spot_instance_first = false. This variable tells the CI action to not bother trying to launch as a spot instance. (This is useful if you can't afford to have your instance reclaimed by AWS.

Below are screenshots from the output of a successful run where the CI job launches a dedicated EC2 instance. You can see the failures the CI action encounters when trying to launch in an availability zone that lacks capacity for the desired instance type, but that those failures don't abort the workflow
Screenshot 2025-03-04 at 8 25 26 PM

Screenshot 2025-03-04 at 8 27 15 PM

I then confirmed that the instance was actually cleaned up in AWS. (i.e., we don't have any dangling resources)

Spot Instance Testing

To test the spot instance logic, I set try_spot_instance_first = true, which tells the action to try launching an EC2 spot instance before attempting to launch it as a dedicated instance.

Below is a screenshot of the test run. The spot instance was reclaimed by AWS because AWS needed the capacity back, hence the failures:
Screenshot 2025-03-04 at 8 35 50 PM

See the ! symbols next to the E2E steps that indicate the build was aborted. (I forgot to screenshot evidence on the AWS side).
Screenshot 2025-03-04 at 8 39 13 PM

For clarity, the Stop EC2 Runner step failed because the spot instance was deleted (reclaimed) in AWS, and therefore, there was nothing to actually "stop":
Screenshot 2025-03-04 at 8 40 03 PM

@courtneypacheco courtneypacheco force-pushed the launch-ec2-runner-with-fallback branch 4 times, most recently from 292166b to bcafdf8 Compare February 27, 2025 14:57
@mergify mergify bot added CI/CD Affects CI/CD configuration documentation Improvements or additions to documentation ci-failure labels Feb 27, 2025
@courtneypacheco courtneypacheco force-pushed the launch-ec2-runner-with-fallback branch from bcafdf8 to a5896d5 Compare February 27, 2025 15:05
@mergify mergify bot added ci-failure and removed ci-failure labels Feb 27, 2025
@courtneypacheco courtneypacheco force-pushed the launch-ec2-runner-with-fallback branch from a5896d5 to f41dfe8 Compare February 27, 2025 15:22
@mergify mergify bot removed the ci-failure label Feb 27, 2025
@courtneypacheco courtneypacheco force-pushed the launch-ec2-runner-with-fallback branch from f41dfe8 to d96da09 Compare February 27, 2025 15:32
@mergify mergify bot added the ci-failure label Feb 27, 2025
@courtneypacheco courtneypacheco force-pushed the launch-ec2-runner-with-fallback branch from d96da09 to 4cfb809 Compare February 27, 2025 15:38
@mergify mergify bot added testing Relates to testing ci-failure and removed ci-failure labels Feb 27, 2025
@courtneypacheco courtneypacheco force-pushed the launch-ec2-runner-with-fallback branch from 4cfb809 to f9f4e55 Compare February 27, 2025 15:40
@mergify mergify bot added ci-failure and removed ci-failure labels Feb 27, 2025
@courtneypacheco courtneypacheco force-pushed the launch-ec2-runner-with-fallback branch from f9f4e55 to 28d5400 Compare February 27, 2025 15:51
@mergify mergify bot added ci-failure and removed ci-failure labels Feb 27, 2025
@courtneypacheco courtneypacheco force-pushed the launch-ec2-runner-with-fallback branch from 28d5400 to 8908abd Compare February 27, 2025 15:55
@mergify mergify bot added ci-failure and removed ci-failure labels Feb 27, 2025
@courtneypacheco courtneypacheco force-pushed the launch-ec2-runner-with-fallback branch from 8908abd to 692cc80 Compare February 27, 2025 16:13
@mergify mergify bot removed the ci-failure label Feb 27, 2025
@courtneypacheco courtneypacheco force-pushed the launch-ec2-runner-with-fallback branch from 692cc80 to 86dac70 Compare February 27, 2025 17:22
@courtneypacheco courtneypacheco force-pushed the launch-ec2-runner-with-fallback branch from f2bd398 to 8f2471a Compare March 4, 2025 00:13
@mergify mergify bot removed the ci-failure label Mar 4, 2025
@courtneypacheco courtneypacheco force-pushed the launch-ec2-runner-with-fallback branch 7 times, most recently from a01eabd to c009879 Compare March 4, 2025 23:20
@courtneypacheco courtneypacheco marked this pull request as ready for review March 5, 2025 02:02
Copy link
Member

@danmcp danmcp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work!


| Name | Description | Example value |
| --- | --- | --- |
| `try_spot_instance_first` | If set to "true", then the EC2 instance will be launched as a spot instance rather than a dedicated EC2 instance. If a spot instance cannot be launched in any of the desired availability zones (e.g., due to insufficient capacity on AWS), then a dedicated instance will be tried next. (Note: This option is not always desirable for certain instance types.) | `true` |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a request to change this now, but some cases might want the opposite of use spot instances as a fallback. This can be helpful when cost isn't an issue but availability is. So on demand is preferred, but if spot instances are all that's available, they are better than nothing.

@courtneypacheco courtneypacheco force-pushed the launch-ec2-runner-with-fallback branch from c009879 to 950ed3e Compare March 5, 2025 11:18
@mergify mergify bot added the one-approval label Mar 5, 2025
Copy link
Contributor

@ktdreyer ktdreyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some jq notes that do not block merging this:

You don't have to cat a file into jq, because jq can read files directly.

jq has a "-r" flag to print strings to stdout without quoting, so you don't need to pipe to tr -d '"' at the end.

@ktdreyer ktdreyer added the hold In-progress PR. Tag should be removed before merge. label Mar 5, 2025
@mergify mergify bot removed the one-approval label Mar 5, 2025
Copy link
Contributor

@ktdreyer ktdreyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think Courtney was going to update those changes I referenced, so I'll rescind my approval so that Mergify does not auto-merge.

@ktdreyer ktdreyer removed the hold In-progress PR. Tag should be removed before merge. label Mar 5, 2025
@mergify mergify bot added the one-approval label Mar 5, 2025
This action is used to launch an EC2 instance (either as a spot instance or a dedicated instance) in a desired availability zone. If no availability, then backup availability zones will be tried until one is successful.

Signed-off-by: Courtney Pacheco <[email protected]>
@courtneypacheco courtneypacheco force-pushed the launch-ec2-runner-with-fallback branch from 950ed3e to 38de5b6 Compare March 5, 2025 18:48
@mergify mergify bot removed the one-approval label Mar 5, 2025
@courtneypacheco courtneypacheco merged commit 1cb24fa into main Mar 5, 2025
11 checks passed
@courtneypacheco courtneypacheco deleted the launch-ec2-runner-with-fallback branch March 5, 2025 20:51
@courtneypacheco courtneypacheco added this to the v0.1.0 milestone Mar 6, 2025
@courtneypacheco courtneypacheco added the enhancement New feature or request label Mar 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CI/CD Affects CI/CD configuration documentation Improvements or additions to documentation enhancement New feature or request testing Relates to testing

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants