Skip to content
This repository was archived by the owner on Aug 6, 2025. It is now read-only.

Initial implementation: auto-shutdown EC2 GHA runner#1

Closed
ryan-williams wants to merge 13 commits intomainfrom
dev
Closed

Initial implementation: auto-shutdown EC2 GHA runner#1
ryan-williams wants to merge 13 commits intomainfrom
dev

Conversation

@ryan-williams
Copy link
Copy Markdown
Member

@ryan-williams ryan-williams commented Jul 21, 2025

Uses Open-Athena/ec2-gha#1, Open-Athena/gha-runner#1.

Speeds up Ocean Emulator test-gpu job by 2x (14mins → 7mins, thanks to skipping standalone 7min "shut down instance" job; see #308), and removes need to explicitly declare a shutdown job.

Copy link
Copy Markdown

@mihasya mihasya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One question

3. Create a label named `gpu` (or your custom name)
4. Maintainers can apply this label to PRs to authorize GPU runs

## Minimal Example
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not add a link to ec2-runner-demo? I assume there was a reason for putting the demo in a separate repo, but in case there wasn't.. can we just move it in here? That would obviate the duplicative AI-generated README as well.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good pt/q.

  • OA/ec2-runner-demo is currently private.
  • I would like to make it public (or inline its example .ymls in this repo, as you suggest).
  • However, there are security implications re: having our AWS_ROLE secret exist on public repos (it currently only exists on 2 private OA repos).
    • I'm pretty confident it's safe, provided we also set "Require approval for all external contributors" under /settings/actions.
    • I want you / @jder / @alxmrs to confirm you agree, though.

The problem with relying solely on "Require approval for all external contributors" is that contributors are granted GHA RCE, indefinitely, once approval has been granted for one commit on one PR.

Instead, we should never approve workflows on external contributors' PRs, but instead trigger workflow runs on them ourselves:

  1. Push a tmp branch pointing at user's commit.
  2. workflow_dispatch from that branch.

This could even be automated / GitOps'd.

LMK your thoughts about this! 🙏

If curious: rejected label-based approach

I also prototyped a version where we'd require a label (e.g. "gpu-ok") to be set on a PR, before the jobs would run. I even had this job remove the label just before it ran (after the check), so we'd have to add the label each time we wanted to run workflows on an external contributor's PR. I decided that was overkill for now, and realized the "preferred approach" above is possible/better.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that it's fine to have them with a public repo so long as we require approvals on workflow runs. But I think there are a few different options for that:
Screenshot 2025-07-24 at 12 47 27 PM

The second one sounds like the indefinite permission you describe, but the last one sounds like it requires approval every time. Is that not the case?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per offline discussion, I think you're right that the 3rd option seems to require permission every time an external contributor attempts to run a workflow. I will verify that using my personal account but, until I do, the protocol can be to never grant approval (and instead workflow_dispatch ourselves, from external contributors' branches, if it ever comes up).

Copy link
Copy Markdown
Member

@alxmrs alxmrs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What a cool piece of infrastructure! Just added one bash idea. This PR was a pleasure to read, good work.

Copy link
Copy Markdown
Member

@jder jder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I am excited to get this landed. Mostly 🐑 nits from me and some questions/suggestions about how this is configured.

# EC2_KEY_NAME - Default SSH key pair name
# EC2_SECURITY_GROUP_ID - Default security group ID
#
# Priority: inputs > vars > defaults
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🐑 🐑 🐑 just my 2 cents that having these passed explicitly in action inputs is probably better than having them be read from vars. (In the same way that I preferred passing secrets explicitly)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds reasonable to me; my thought was this priority-cascade leaves flexibility for callers to specify defaults and overrides, e.g.:

  • Other orgs might use this action, and set a different org-wide default EC2_INSTANCE_TYPE
  • SECURITY_GROUP_ID might be best set as an org-level variable (by ops/Pulumi, in our case).
    • Otherwise, people wanting to SSH into an EC2 runner instance can just look up the sg ID and pass it as an input; that can be fine too.

Do you think we should remove "variables" as part of the cascade?

required: false
type: number
default: 15
ssh_pubkey:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems a bit surprising that this doesn't override the secret like the other inputs + env vars? Do we need both ways of passing this? Or do we need this at all given EC2_KEY_NAME/aws_key_name?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems a bit surprising that this doesn't override the secret like the other inputs + env vars?

Good catch, this should be cleaned up now that secrets.SSH_PUBKEY is gone, we just have {inputs,vars}.ssh_pubkey.

Do we need both ways of passing this? Or do we need this at all given EC2_KEY_NAME/aws_key_name?

They are essentially redundant, but I can imagine some users preferring one or the other, so implemented both. If you think one is clearly what we should funnel users to, lmk, otherwise it's a convenience/feature to support both.

Copy link
Copy Markdown
Member Author

@ryan-williams ryan-williams left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe I responded to everything, going to mark a few as "resolved", feel free to do so on any others as appropriate, ty!

# EC2_KEY_NAME - Default SSH key pair name
# EC2_SECURITY_GROUP_ID - Default security group ID
#
# Priority: inputs > vars > defaults
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds reasonable to me; my thought was this priority-cascade leaves flexibility for callers to specify defaults and overrides, e.g.:

  • Other orgs might use this action, and set a different org-wide default EC2_INSTANCE_TYPE
  • SECURITY_GROUP_ID might be best set as an org-level variable (by ops/Pulumi, in our case).
    • Otherwise, people wanting to SSH into an EC2 runner instance can just look up the sg ID and pass it as an input; that can be fine too.

Do you think we should remove "variables" as part of the cascade?

required: false
type: number
default: 15
ssh_pubkey:
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems a bit surprising that this doesn't override the secret like the other inputs + env vars?

Good catch, this should be cleaned up now that secrets.SSH_PUBKEY is gone, we just have {inputs,vars}.ssh_pubkey.

Do we need both ways of passing this? Or do we need this at all given EC2_KEY_NAME/aws_key_name?

They are essentially redundant, but I can imagine some users preferring one or the other, so implemented both. If you think one is clearly what we should funnel users to, lmk, otherwise it's a convenience/feature to support both.

3. Create a label named `gpu` (or your custom name)
4. Maintainers can apply this label to PRs to authorize GPU runs

## Minimal Example
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per offline discussion, I think you're right that the 3rd option seems to require permission every time an external contributor attempts to run a workflow. I will verify that using my personal account but, until I do, the protocol can be to never grant approval (and instead workflow_dispatch ourselves, from external contributors' branches, if it ever comes up).

README.md Outdated

In your GitHub organization settings, create these secrets:
- `AWS_ROLE`: ARN of your AWS IAM role (e.g., `arn:aws:iam::123456789012:role/GitHubActionsRole`)
- `GH_SA_TOKEN`: GitHub token with permissions to manage self-hosted runners
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good points. I've reworked this to remove the implication that GH_SA_TOKEN and AWS_ROLE (now a variable) need to be set at the org level.

I've also included in this README some sample Pulumi+Python code for creating the OIDC connection and AWS_ROLE, based on our internal code.

Somewhat frustratingly, "self-hosted runners (read and write)" permission isn't enough for gha-runner, the token needs "repo admin (read and write)." Per latest README language:

2. Configure Secrets and Variables

Required Secret: GH_SA_TOKEN

This workflow requires a GitHub token with admin permissions to the repo it's run within, because the underlying gha-runner calls /actions/runners/registration-token, whose docs state:

Authenticated users must have admin access to the repository to use this endpoint.

Required Variable (or pass as input):

  • AWS_ROLE: ARN of your AWS IAM role (e.g. arn:aws:iam::123456789012:role/GitHubActionsRole)

Set this in your GitHub organization or repository settings by e.g.:

gh variable set AWS_ROLE --body "arn:aws:iam::123456789012:role/GitHubActionsRole"

@ryan-williams ryan-williams requested a review from jder July 28, 2025 15:57
ryan-williams added a commit to Open-Athena/ec2-gha-demo that referenced this pull request Jul 28, 2025
Will revert to `main` or `v1` before merging (once Open-Athena/ec2#1 lands)
@ryan-williams ryan-williams requested a review from alxmrs July 28, 2025 18:51
@ryan-williams
Copy link
Copy Markdown
Member Author

Per discussion with @alxmrs yesterday, @jder wdyt about renaming this repo and workflow, so that usage would change to:

-  uses: Open-Athena/ec2/.github/workflows/runner.yml@v1
+  uses: Open-Athena/runners/.github/workflows/ec2.yml@v1

Pros:

  • Runners for other clouds can go alongside this one (e.g. {sky,lambda,gce,…}.yml)
  • More descriptive repo basename, without redundancy in calling path

@ryan-williams
Copy link
Copy Markdown
Member Author

My new plan is to fold this repo (and functionality from this PR) into OA/ec2-gha (see Open-Athena/ec2-gha#2).

That repo was formerly "start-aws-gha-runner", but I've made it more general (and renamed it accordingly). I don't think we need this separate repo now, I'll probably archive it.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants