Skip to content

Failure during orphan termination processing #4786

@sdarwin

Description

@sdarwin

The scale-down lambda fails to terminate orphan AWS self-hosted runner instances, and throws errors.

Background:

The latest versions of start-runner.sh and start-runner.ps1 both intended to terminate the EC2 instance after the job has finished. You can see this by searching for "terminate" in the files. If those succeed in terminating the instance after the job is done, then everything is fine. No orphans. But what happens if that termination doesn't occur?

Termination will not occur if...

  • you are using an older version of start-runner.ps1, which did not yet call "terminate". The problem would always happen.
  • currently I am trying to re-design start-runner.ps1 (to set the user) and if the script doesn't yet manage to call "terminate", the problem happens.
  • most importantly - the whole terraform system should be robust to unexpected failures. If start-runner.sh or start-runner.ps1 glitch, or github itself has problems, and you may end up with an orphan instance. Once you get an orphaned instance, the scale-down should handle the situation gracefully, and terminate it. Any time you encounter an "orphan", scale-down should terminate it correctly. But that isn't happening now, it seems.

A stack trace from cloudwatch is shown below. Consider:

GET /repos/myorg/myrepo/actions/runners/653 - 404

This api call was added recently in https://github.com/github-aws-runners/terraform-aws-github-runner/pull/4595/files . It's a new feature, right?

type OrgRunnerList = Endpoints['GET /orgs/{org}/actions/runners']['response']['data']['runners'];
type RepoRunnerList = Endpoints['GET /repos/{owner}/{repo}/actions/runners']['response']['data']['runners'];
type RunnerState = OrgRunnerList[number] | RepoRunnerList[number];

By basic code inspection, ask yourself, how will this function react if the runner doesn't exist. If the job finished, Github Actions doesn't know about the ephemeral runner any more. That means, the endpoint "/orgs/{org}/actions/runners" returns nothing, because Github thinks the runner isn't there. Will that cause the 404? Then, if the scale-down process gets a 404, is it able to handle it, and terminate the orphan?


Cloudwatch log message 1:

2025-09-22T20:31:03.167Z	e1066279-d200-4500-bc75-3f7a6274b4e9	ERROR	GET /repos/myorg/myrepo/actions/runners/653 - 404 with id B82E:10BDA9:D22A39:D6CC52:68D1B206 in 246ms

Cloudwatch log message 2:

{
    "level": "WARN",
    "message": "Failure during orphan termination processing.",
    "timestamp": "2025-09-22T20:32:03.223Z",
    "service": "runners-scale-down",
    "sampling_rate": 0,
    "xray_trace_id": "1-68d1b242-4965478d1f42877e3fcd62a6",
    "region": "us-west-2",
    "environment": "gha-ubuntu-noble",
    "module": "scale-down",
    "aws-request-id": "7212ccf6-8d5c-4ce8-98af-493f59b0eda3",
    "function-name": "gha-ubuntu-noble-scale-down",
    "error": {
        "name": "HttpError",
        "location": "file:///var/task/index.js:158569",
        "message": "Not Found - https://docs.github.com/rest/actions/self-hosted-runners#get-a-self-hosted-runner-for-a-repository",
        "stack": "HttpError: Not Found - https://docs.github.com/rest/actions/self-hosted-runners#get-a-self-hosted-runner-for-a-repository\n    at fetchWrapper (file:///var/task/index.js:158569:11)\n    at process.processTicksAndRejections (node:internal/process/task_queues:105:5)\n    at async Job.doExecute (file:///var/task/index.js:126562:18)",
        "status": 404,
        "request": {
            "method": "GET",
            "url": "https://api.github.com/repos/myorg/myrepo/actions/runners/653",
            "headers": {
                "accept": "application/vnd.github.v3+json",
                "user-agent": "github-aws-runners octokit-rest.js/22.0.0 octokit-core.js/7.0.2 Node.js/22",
                "authorization": "token [REDACTED]"
            },
            "request": {}
        },
        "response": {
            "url": "https://api.github.com/repos/myorg/myrepo/actions/runners/653",
            "status": 404,
            "headers": {
                "access-control-allow-origin": "*",
                "access-control-expose-headers": "ETag, Link, Location, Retry-After, X-GitHub-OTP, X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Used, X-RateLimit-Resource, X-RateLimit-Reset, X-OAuth-Scopes, X-Accepted-OAuth-Scopes, X-Poll-Interval, X-GitHub-Media-Type, X-GitHub-SSO, X-GitHub-Request-Id, Deprecation, Sunset",
                "content-encoding": "gzip",
                "content-security-policy": "default-src 'none'",
                "content-type": "application/json; charset=utf-8",
                "date": "Mon, 22 Sep 2025 20:32:03 GMT",
                "referrer-policy": "origin-when-cross-origin, strict-origin-when-cross-origin",
                "server": "github.com",
                "strict-transport-security": "max-age=31536000; includeSubdomains; preload",
                "transfer-encoding": "chunked",
                "vary": "Accept-Encoding, Accept, X-Requested-With",
                "x-accepted-github-permissions": "administration=read",
                "x-content-type-options": "nosniff",
                "x-frame-options": "deny",
                "x-github-api-version-selected": "2022-11-28",
                "x-github-media-type": "github.v3; format=json",
                "x-github-request-id": "833A:2C77BC:D42741:D8CCE8:68D1B242",
                "x-ratelimit-limit": "5000",
                "x-ratelimit-remaining": "4840",
                "x-ratelimit-reset": "1758574803",
                "x-ratelimit-resource": "core",
                "x-ratelimit-used": "160",
                "x-xss-protection": "0"
            },
            "data": {
                "message": "Not Found",
                "documentation_url": "https://docs.github.com/rest/actions/self-hosted-runners#get-a-self-hosted-runner-for-a-repository",
                "status": "404"
            }
        }
    }
}

How to replicate the issue:

Modify /modules/runners/templates/start-runner.sh , comment out trap 'cleanup $? $LINENO $BASH_LINENO' EXIT so that cleanup doesn't happen.
Modify /modules/runners/templates/start-runner.ps1 , comment out aws ec2 terminate-instances --instance-ids "$InstanceId" --region "$Region" so that cleanup doesn't happen.

Run github actions jobs. Instances will become "orphans". And then scale-down will fail to terminate the orphans.

What should happen:

In the past, orphans would get scaled-down without errors, and that was the right result.

@stuartp44 , @npalm

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions