Initial implementation: auto-shutdown EC2 GHA runner#1
Initial implementation: auto-shutdown EC2 GHA runner#1ryan-williams wants to merge 13 commits intomainfrom
Conversation
81063f4 to
96e5f60
Compare
| 3. Create a label named `gpu` (or your custom name) | ||
| 4. Maintainers can apply this label to PRs to authorize GPU runs | ||
|
|
||
| ## Minimal Example |
There was a problem hiding this comment.
Why not add a link to ec2-runner-demo? I assume there was a reason for putting the demo in a separate repo, but in case there wasn't.. can we just move it in here? That would obviate the duplicative AI-generated README as well.
There was a problem hiding this comment.
Good pt/q.
- OA/ec2-runner-demo is currently private.
- I would like to make it public (or inline its example
.ymls in this repo, as you suggest). - However, there are security implications re: having our
AWS_ROLEsecret exist on public repos (it currently only exists on 2 private OA repos).
The problem with relying solely on "Require approval for all external contributors" is that contributors are granted GHA RCE, indefinitely, once approval has been granted for one commit on one PR.
Instead, we should never approve workflows on external contributors' PRs, but instead trigger workflow runs on them ourselves:
- Push a tmp branch pointing at user's commit.
workflow_dispatchfrom that branch.
This could even be automated / GitOps'd.
LMK your thoughts about this! 🙏
If curious: rejected label-based approach
I also prototyped a version where we'd require a label (e.g. "gpu-ok") to be set on a PR, before the jobs would run. I even had this job remove the label just before it ran (after the check), so we'd have to add the label each time we wanted to run workflows on an external contributor's PR. I decided that was overkill for now, and realized the "preferred approach" above is possible/better.
There was a problem hiding this comment.
I agree that it's fine to have them with a public repo so long as we require approvals on workflow runs. But I think there are a few different options for that:

The second one sounds like the indefinite permission you describe, but the last one sounds like it requires approval every time. Is that not the case?
There was a problem hiding this comment.
Per offline discussion, I think you're right that the 3rd option seems to require permission every time an external contributor attempts to run a workflow. I will verify that using my personal account but, until I do, the protocol can be to never grant approval (and instead workflow_dispatch ourselves, from external contributors' branches, if it ever comes up).
alxmrs
left a comment
There was a problem hiding this comment.
What a cool piece of infrastructure! Just added one bash idea. This PR was a pleasure to read, good work.
jder
left a comment
There was a problem hiding this comment.
Thanks, I am excited to get this landed. Mostly 🐑 nits from me and some questions/suggestions about how this is configured.
| # EC2_KEY_NAME - Default SSH key pair name | ||
| # EC2_SECURITY_GROUP_ID - Default security group ID | ||
| # | ||
| # Priority: inputs > vars > defaults |
There was a problem hiding this comment.
🐑 🐑 🐑 just my 2 cents that having these passed explicitly in action inputs is probably better than having them be read from vars. (In the same way that I preferred passing secrets explicitly)
There was a problem hiding this comment.
Sounds reasonable to me; my thought was this priority-cascade leaves flexibility for callers to specify defaults and overrides, e.g.:
- Other orgs might use this action, and set a different org-wide default
EC2_INSTANCE_TYPE SECURITY_GROUP_IDmight be best set as an org-level variable (byops/Pulumi, in our case).- Otherwise, people wanting to SSH into an EC2 runner instance can just look up the sg ID and pass it as an input; that can be fine too.
Do you think we should remove "variables" as part of the cascade?
| required: false | ||
| type: number | ||
| default: 15 | ||
| ssh_pubkey: |
There was a problem hiding this comment.
Seems a bit surprising that this doesn't override the secret like the other inputs + env vars? Do we need both ways of passing this? Or do we need this at all given EC2_KEY_NAME/aws_key_name?
There was a problem hiding this comment.
Seems a bit surprising that this doesn't override the secret like the other inputs + env vars?
Good catch, this should be cleaned up now that secrets.SSH_PUBKEY is gone, we just have {inputs,vars}.ssh_pubkey.
Do we need both ways of passing this? Or do we need this at all given EC2_KEY_NAME/aws_key_name?
They are essentially redundant, but I can imagine some users preferring one or the other, so implemented both. If you think one is clearly what we should funnel users to, lmk, otherwise it's a convenience/feature to support both.
For more specificity, and better consistency with corresponding env var names
ryan-williams
left a comment
There was a problem hiding this comment.
I believe I responded to everything, going to mark a few as "resolved", feel free to do so on any others as appropriate, ty!
| # EC2_KEY_NAME - Default SSH key pair name | ||
| # EC2_SECURITY_GROUP_ID - Default security group ID | ||
| # | ||
| # Priority: inputs > vars > defaults |
There was a problem hiding this comment.
Sounds reasonable to me; my thought was this priority-cascade leaves flexibility for callers to specify defaults and overrides, e.g.:
- Other orgs might use this action, and set a different org-wide default
EC2_INSTANCE_TYPE SECURITY_GROUP_IDmight be best set as an org-level variable (byops/Pulumi, in our case).- Otherwise, people wanting to SSH into an EC2 runner instance can just look up the sg ID and pass it as an input; that can be fine too.
Do you think we should remove "variables" as part of the cascade?
| required: false | ||
| type: number | ||
| default: 15 | ||
| ssh_pubkey: |
There was a problem hiding this comment.
Seems a bit surprising that this doesn't override the secret like the other inputs + env vars?
Good catch, this should be cleaned up now that secrets.SSH_PUBKEY is gone, we just have {inputs,vars}.ssh_pubkey.
Do we need both ways of passing this? Or do we need this at all given EC2_KEY_NAME/aws_key_name?
They are essentially redundant, but I can imagine some users preferring one or the other, so implemented both. If you think one is clearly what we should funnel users to, lmk, otherwise it's a convenience/feature to support both.
| 3. Create a label named `gpu` (or your custom name) | ||
| 4. Maintainers can apply this label to PRs to authorize GPU runs | ||
|
|
||
| ## Minimal Example |
There was a problem hiding this comment.
Per offline discussion, I think you're right that the 3rd option seems to require permission every time an external contributor attempts to run a workflow. I will verify that using my personal account but, until I do, the protocol can be to never grant approval (and instead workflow_dispatch ourselves, from external contributors' branches, if it ever comes up).
README.md
Outdated
|
|
||
| In your GitHub organization settings, create these secrets: | ||
| - `AWS_ROLE`: ARN of your AWS IAM role (e.g., `arn:aws:iam::123456789012:role/GitHubActionsRole`) | ||
| - `GH_SA_TOKEN`: GitHub token with permissions to manage self-hosted runners |
There was a problem hiding this comment.
Good points. I've reworked this to remove the implication that GH_SA_TOKEN and AWS_ROLE (now a variable) need to be set at the org level.
I've also included in this README some sample Pulumi+Python code for creating the OIDC connection and AWS_ROLE, based on our internal code.
Somewhat frustratingly, "self-hosted runners (read and write)" permission isn't enough for gha-runner, the token needs "repo admin (read and write)." Per latest README language:
2. Configure Secrets and Variables
Required Secret:
GH_SA_TOKENThis workflow requires a GitHub token with admin permissions to the repo it's run within, because the underlying
gha-runnercalls/actions/runners/registration-token, whose docs state:Authenticated users must have admin access to the repository to use this endpoint.
Required Variable (or pass as input):
AWS_ROLE: ARN of your AWS IAM role (e.g.arn:aws:iam::123456789012:role/GitHubActionsRole)Set this in your GitHub organization or repository settings by e.g.:
gh variable set AWS_ROLE --body "arn:aws:iam::123456789012:role/GitHubActionsRole"
Will revert to `main` or `v1` before merging (once Open-Athena/ec2#1 lands)
|
Per discussion with @alxmrs yesterday, @jder wdyt about renaming this repo and workflow, so that usage would change to: - uses: Open-Athena/ec2/.github/workflows/runner.yml@v1
+ uses: Open-Athena/runners/.github/workflows/ec2.yml@v1Pros:
|
|
My new plan is to fold this repo (and functionality from this PR) into OA/ec2-gha (see Open-Athena/ec2-gha#2). That repo was formerly "start-aws-gha-runner", but I've made it more general (and renamed it accordingly). I don't think we need this separate repo now, I'll probably archive it. |
Uses Open-Athena/ec2-gha#1, Open-Athena/gha-runner#1.
Speeds up Ocean Emulator test-gpu job by 2x (14mins → 7mins, thanks to skipping standalone 7min "shut down instance" job; see #308), and removes need to explicitly declare a shutdown job.