Initial implementation: auto-shutdown EC2 GHA runner by ryan-williams · Pull Request #1 · Open-Athena/ec2

ryan-williams · 2025-07-21T17:52:36Z

Uses Open-Athena/ec2-gha#1, Open-Athena/gha-runner#1.

Speeds up Ocean Emulator test-gpu job by 2x (14mins → 7mins, thanks to skipping standalone 7min "shut down instance" job; see #308), and removes need to explicitly declare a shutdown job.

mihasya

One question

.github/workflows/runner.yml

mihasya · 2025-07-22T17:46:38Z

README.md

+3. Create a label named `gpu` (or your custom name)
+4. Maintainers can apply this label to PRs to authorize GPU runs
+
+## Minimal Example


Why not add a link to ec2-runner-demo? I assume there was a reason for putting the demo in a separate repo, but in case there wasn't.. can we just move it in here? That would obviate the duplicative AI-generated README as well.

Good pt/q.

OA/ec2-runner-demo is currently private.

I would like to make it public (or inline its example .ymls in this repo, as you suggest).

However, there are security implications re: having our AWS_ROLE secret exist on public repos (it currently only exists on 2 private OA repos).

I'm pretty confident it's safe, provided we also set "Require approval for all external contributors" under /settings/actions.

I want you / @jder / @alxmrs to confirm you agree, though.

The problem with relying solely on "Require approval for all external contributors" is that contributors are granted GHA RCE, indefinitely, once approval has been granted for one commit on one PR.

Instead, we should never approve workflows on external contributors' PRs, but instead trigger workflow runs on them ourselves:

Push a tmp branch pointing at user's commit.

workflow_dispatch from that branch.

This could even be automated / GitOps'd.

LMK your thoughts about this! 🙏

If curious: rejected label-based approach

I also prototyped a version where we'd require a label (e.g. "gpu-ok") to be set on a PR, before the jobs would run. I even had this job remove the label just before it ran (after the check), so we'd have to add the label each time we wanted to run workflows on an external contributor's PR. I decided that was overkill for now, and realized the "preferred approach" above is possible/better.

I agree that it's fine to have them with a public repo so long as we require approvals on workflow runs. But I think there are a few different options for that:

The second one sounds like the indefinite permission you describe, but the last one sounds like it requires approval every time. Is that not the case?

Per offline discussion, I think you're right that the 3rd option seems to require permission every time an external contributor attempts to run a workflow. I will verify that using my personal account but, until I do, the protocol can be to never grant approval (and instead workflow_dispatch ourselves, from external contributors' branches, if it ever comes up).

.github/workflows/runner.yml

alxmrs

What a cool piece of infrastructure! Just added one bash idea. This PR was a pleasure to read, good work.

.github/workflows/runner.yml

README.md

jder

Thanks, I am excited to get this landed. Mostly 🐑 nits from me and some questions/suggestions about how this is configured.

jder · 2025-07-23T23:24:33Z

.github/workflows/runner.yml

+#   EC2_KEY_NAME - Default SSH key pair name
+#   EC2_SECURITY_GROUP_ID - Default security group ID
+#
+# Priority: inputs > vars > defaults


🐑 🐑 🐑 just my 2 cents that having these passed explicitly in action inputs is probably better than having them be read from vars. (In the same way that I preferred passing secrets explicitly)

Sounds reasonable to me; my thought was this priority-cascade leaves flexibility for callers to specify defaults and overrides, e.g.:

Other orgs might use this action, and set a different org-wide default EC2_INSTANCE_TYPE

SECURITY_GROUP_ID might be best set as an org-level variable (by ops/Pulumi, in our case).

Otherwise, people wanting to SSH into an EC2 runner instance can just look up the sg ID and pass it as an input; that can be fine too.

Do you think we should remove "variables" as part of the cascade?

README.md

.github/workflows/runner.yml

jder · 2025-07-23T23:36:25Z

.github/workflows/runner.yml

+        required: false
+        type: number
+        default: 15
+      ssh_pubkey:


Seems a bit surprising that this doesn't override the secret like the other inputs + env vars? Do we need both ways of passing this? Or do we need this at all given EC2_KEY_NAME/aws_key_name?

Seems a bit surprising that this doesn't override the secret like the other inputs + env vars?

Good catch, this should be cleaned up now that secrets.SSH_PUBKEY is gone, we just have {inputs,vars}.ssh_pubkey.

Do we need both ways of passing this? Or do we need this at all given EC2_KEY_NAME/aws_key_name?

They are essentially redundant, but I can imagine some users preferring one or the other, so implemented both. If you think one is clearly what we should funnel users to, lmk, otherwise it's a convenience/feature to support both.

README.md

For more specificity, and better consistency with corresponding env var names

ryan-williams

I believe I responded to everything, going to mark a few as "resolved", feel free to do so on any others as appropriate, ty!

ryan-williams · 2025-07-24T15:36:42Z

.github/workflows/runner.yml

+#   EC2_KEY_NAME - Default SSH key pair name
+#   EC2_SECURITY_GROUP_ID - Default security group ID
+#
+# Priority: inputs > vars > defaults


Sounds reasonable to me; my thought was this priority-cascade leaves flexibility for callers to specify defaults and overrides, e.g.:

Other orgs might use this action, and set a different org-wide default EC2_INSTANCE_TYPE

SECURITY_GROUP_ID might be best set as an org-level variable (by ops/Pulumi, in our case).

Otherwise, people wanting to SSH into an EC2 runner instance can just look up the sg ID and pass it as an input; that can be fine too.

Do you think we should remove "variables" as part of the cascade?

.github/workflows/runner.yml

ryan-williams · 2025-07-24T19:53:04Z

.github/workflows/runner.yml

+        required: false
+        type: number
+        default: 15
+      ssh_pubkey:


Seems a bit surprising that this doesn't override the secret like the other inputs + env vars?

Good catch, this should be cleaned up now that secrets.SSH_PUBKEY is gone, we just have {inputs,vars}.ssh_pubkey.

Do we need both ways of passing this? Or do we need this at all given EC2_KEY_NAME/aws_key_name?

They are essentially redundant, but I can imagine some users preferring one or the other, so implemented both. If you think one is clearly what we should funnel users to, lmk, otherwise it's a convenience/feature to support both.

README.md

ryan-williams · 2025-07-28T15:27:19Z

README.md

+3. Create a label named `gpu` (or your custom name)
+4. Maintainers can apply this label to PRs to authorize GPU runs
+
+## Minimal Example


Per offline discussion, I think you're right that the 3rd option seems to require permission every time an external contributor attempts to run a workflow. I will verify that using my personal account but, until I do, the protocol can be to never grant approval (and instead workflow_dispatch ourselves, from external contributors' branches, if it ever comes up).

ryan-williams · 2025-07-28T15:38:42Z

README.md

+
+In your GitHub organization settings, create these secrets:
+- `AWS_ROLE`: ARN of your AWS IAM role (e.g., `arn:aws:iam::123456789012:role/GitHubActionsRole`)
+- `GH_SA_TOKEN`: GitHub token with permissions to manage self-hosted runners


Good points. I've reworked this to remove the implication that GH_SA_TOKEN and AWS_ROLE (now a variable) need to be set at the org level.

I've also included in this README some sample Pulumi+Python code for creating the OIDC connection and AWS_ROLE, based on our internal code.

Somewhat frustratingly, "self-hosted runners (read and write)" permission isn't enough for gha-runner, the token needs "repo admin (read and write)." Per latest README language:

2. Configure Secrets and Variables

Required Secret: GH_SA_TOKEN

This workflow requires a GitHub token with admin permissions to the repo it's run within, because the underlying gha-runner calls /actions/runners/registration-token, whose docs state:

Authenticated users must have admin access to the repository to use this endpoint.

Required Variable (or pass as input):

AWS_ROLE: ARN of your AWS IAM role (e.g. arn:aws:iam::123456789012:role/GitHubActionsRole)

Set this in your GitHub organization or repository settings by e.g.:

gh variable set AWS_ROLE --body "arn:aws:iam::123456789012:role/GitHubActionsRole"

README.md

Will revert to `main` or `v1` before merging (once Open-Athena/ec2#1 lands)

ryan-williams · 2025-07-29T13:32:30Z

Per discussion with @alxmrs yesterday, @jder wdyt about renaming this repo and workflow, so that usage would change to:

-  uses: Open-Athena/ec2/.github/workflows/runner.yml@v1
+  uses: Open-Athena/runners/.github/workflows/ec2.yml@v1

Pros:

Runners for other clouds can go alongside this one (e.g. {sky,lambda,gce,…}.yml)
More descriptive repo basename, without redundancy in calling path

ryan-williams · 2025-08-04T14:19:22Z

My new plan is to fold this repo (and functionality from this PR) into OA/ec2-gha (see Open-Athena/ec2-gha#2).

That repo was formerly "start-aws-gha-runner", but I've made it more general (and renamed it accordingly). I don't think we need this separate repo now, I'll probably archive it.

This was referenced Jul 21, 2025

Add auto-shutdown, userdata/key_name inputs, single-instance convenience outputs Open-Athena/ec2-gha#1

Merged

logging improvements Open-Athena/gha-runner#1

Merged

ryan-williams force-pushed the dev branch 3 times, most recently from 81063f4 to 96e5f60 Compare July 22, 2025 06:04

initial implementation

94e815a

ryan-williams force-pushed the dev branch from 96e5f60 to 94e815a Compare July 22, 2025 06:09

This was referenced Jul 22, 2025

Demo GHAs: {minimal,test-gpu}.yml Open-Athena/ec2-gha-demo#1

Closed

Demo GHAs: {minimal,test-gpu,multi-job}.yml Open-Athena/ec2-gha-demo#2

Open

ryan-williams requested review from alxmrs, jder and mihasya July 22, 2025 06:29

mihasya reviewed Jul 22, 2025

View reviewed changes

.github/workflows/runner.yml Outdated Show resolved Hide resolved

mihasya reviewed Jul 22, 2025

View reviewed changes

alxmrs reviewed Jul 22, 2025

View reviewed changes

.github/workflows/runner.yml Show resolved Hide resolved

alxmrs approved these changes Jul 22, 2025

View reviewed changes

.github/workflows/runner.yml Outdated Show resolved Hide resolved

README.md Outdated Show resolved Hide resolved

README.md Outdated Show resolved Hide resolved

README.md Show resolved Hide resolved

update README

7119576

jder approved these changes Jul 24, 2025

View reviewed changes

ryan-williams added 10 commits July 24, 2025 00:07

CR: case, start_aws_gha_runner/start.py xref

373c2fd

update security section

b2cdbd8

convert AWS_ROLE, SSH_PUBKEY to inputs/variables (rather than secrets)

0813864

Mention vars fallbacks in inputs.descriptions

ab0b087

rm redundant default-input-value passing

1640266

update readme

afea2c1

rename some aws_ inputs to ec2_

0c9c716

For more specificity, and better consistency with corresponding env var names

add MIT LICENSE

a89d8fe

rm README reference to aws ssm

31b7aff

CR: more README updates

1d1dcc6

ryan-williams commented Jul 28, 2025

View reviewed changes

ryan-williams requested a review from jder July 28, 2025 15:57

move to Apache 2 License

2a9e0f2

ryan-williams added a commit to Open-Athena/ec2-gha-demo that referenced this pull request Jul 28, 2025

point readme at Open-Athena/ec2@dev

ce1fb12

Will revert to `main` or `v1` before merging (once Open-Athena/ec2#1 lands)

ryan-williams requested a review from alxmrs July 28, 2025 18:51

ryan-williams mentioned this pull request Jul 31, 2025

Multi-job support (using job start/end hooks), optional CloudWatch logging, demo workflows, module/repo rename Open-Athena/ec2-gha#2

Closed

ryan-williams closed this Aug 4, 2025

Conversation

ryan-williams commented Jul 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mihasya left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

alxmrs left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jder left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ryan-williams left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

2. Configure Secrets and Variables

Required Secret: GH_SA_TOKEN

Required Variable (or pass as input):

Uh oh!

Uh oh!

Uh oh!

ryan-williams commented Jul 29, 2025

Uh oh!

ryan-williams commented Aug 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ryan-williams commented Jul 21, 2025 •

edited

Loading

Required Secret: `GH_SA_TOKEN`