Skip to content

Runners not terminating after job completion – blocked queue due to token expiry (v0.12.1) #4183

@kpinarci

Description

@kpinarci

Checks

Controller Version

0.12.1

Deployment Method

Helm

Checks

  • This isn't a question or user support case (For Q&A and community support, go to Discussions).
  • I've read the Changelog before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes

To Reproduce

Project: Actions-Runner-Controller (not Summerwind)
Version: 0.12.1
Deployment Method: Helm
Kubernetes Version: v1.32.5
---
1.	Deploy Actions-Runner-Controller version 0.11.0 via Helm on a Kubernetes cluster (v1.32.5) with GitHub Enterprise integration.
2.	Verify that runners operate correctly under normal load.
3.	Upgrade to version 0.12.1 by fully removing all ARC-related resources, including CustomResourceDefinitions (CRDs), and perform a clean installation using Helm.
4.	Reconfigure and deploy runners as before.
5.	Execute various GitHub Actions workflows across multiple repositories.
6.	After some time, observe that:
	•	Certain jobs appear completed or failed on GitHub Enterprise.
	•	Some runner pods remain active indefinitely and do not exit.
	•	Logs within those pods show repeated registration failures with messages like:
"Registration was not found or is not medium trusted."
	•	The issue affects different runners at different times with no identifiable pattern (i.e., across various repos and workflows).
7.	As a result, the runner pool becomes blocked, and new jobs are not executed until affected pods are manually terminated.

Describe the bug

Hello team,

After upgrading to ARC version 0.11.0, we noticed that some runners enter a state where they run indefinitely and block new jobs from being picked up. Inside the runner containers, we observed registration failures due to expired tokens.

On the GitHub Enterprise side, those jobs appear to have already completed or failed, but the corresponding runners keep running. It seems that the listener is unable to properly clean up the runner after a job finishes and continuously attempts to re-register it with GitHub.

We were hoping this issue would be resolved in version 0.12.1, but unfortunately, it still persists. In one instance, a pod even ended up in an evicted state.

As a temporary workaround to prevent the job queue from stalling, we’ve implemented a cron job that monitors runner logs and forcefully terminates any pod where the log contains:
"Registration was not found or is not medium trusted."
This helps keep the runners processing jobs but doesn’t address the root cause.

Is this a known issue, and do you have any recommendations or a potential fix?

Describe the expected behavior

Runners should terminate properly after job completion or failure. They should not attempt to re-register if the job has already ended and the registration token has expired. Additionally, the controller should ensure that expired or stuck runners are cleaned up automatically to avoid blocking the job queue.

Additional Context

githubConfigUrl: "https://github.enterprise.example.com/enterprises/***"
githubConfigSecret: "github-token"
proxy:
  http:
    url: http://**********
  https:
    url: http://**********
  noProxy:
    - localhost
    - 127.0.0.1
    - 10.0.0.0/8
    - 172.16.0.0/12
    - 192.168.0.0/16
maxRunners: 5
minRunners: 1
runnerGroup: "enterprise-gpr-m-02"
runnerScaleSetName: "enterprise-gpr-m"
labels:
  group: enterprise-runners
githubServerTLS:
  certificateFrom:
    configMapKeyRef:
      name: ca
      key: ca.crt
  runnerMountPath: /usr/local/share/ca-certificates/
template:
  spec:
    initContainers:
      - name: init-dind-externals
        image: actions/actions-runner/full:2.322.1
        imagePullPolicy: Always
        command: ["cp", "-r", "-v", "/home/runner/externals/.", "/home/runner/tmpDir/"]
        volumeMounts:
          - name: dind-externals
            mountPath: /home/runner/tmpDir
        resources:
          requests:
            cpu: "50m"
            memory: "200Mi"
          limits:
            memory: "250Mi"
      - name: init-dind-rootless
        image: docker:27.3.1-dind-rootless
        imagePullPolicy: IfNotPresent
        command:
          - sh
          - -c
          - |
            set -x
            cp -a /etc/. /dind-etc/
            echo 'runner:x:1001:1001:runner:/home/runner:/bin/ash' >> /dind-etc/passwd
            echo 'runner:x:1001:' >> /dind-etc/group
            echo 'runner:100000:65536' >> /dind-etc/subgid
            echo 'runner:100000:65536' >>  /dind-etc/subuid
            chmod 755 /dind-etc;
            chmod u=rwx,g=rx+s,o=rx /dind-home
            chown 1001:1001 /dind-home
            mkdir -p /var/lib/docker
            chmod u=rwx,g=rx+s,o=rx /var/lib/docker
            chown -R 1001:1001 /var/lib/docker
        securityContext:
          runAsUser: 0
        volumeMounts:
          - mountPath: /dind-etc
            name: dind-etc
          - mountPath: /dind-home
            name: dind-home
          - name: docker-data-root
            mountPath: /var/lib/docker
        resources:
          requests:
            cpu: "50m"
            memory: "200Mi"
          limits:
            memory: "250Mi"
      - name: init-qemu-registrar
        image: tonistiigi/binfmt:latest
        command: [ "/usr/bin/binfmt", "--install", "all" ]
        imagePullPolicy: Always
        securityContext:
          runAsUser: 0
          privileged: true
        resources:
          requests:
            cpu: "25m"
            memory: "50Mi"
          limits:
            memory: "100Mi"
    containers:
      - name: runner
        image: actions/actions-runner/full:2.322.1
        imagePullPolicy: Always
        command: ["/home/runner/run.sh"]
        env:
          - name: DOCKER_HOST
            value: unix:///run/user/1001/docker.sock
        volumeMounts:
          - name: work
            mountPath: /home/runner/_work
          - name: dind-sock
            mountPath: /var/run
          - mountPath: /tmp
            name: tmpdir
          - name: sysfs
            mountPath: /sys
            readOnly: false
        resources:
          requests:
            cpu: "100m"
            memory: "500Mi"
          limits:
            memory: "500Mi"
        securityContext:
          capabilities:
            add:
              - SYS_ADMIN
              - SYS_PTRACE
              - DAC_OVERRIDE
              - FOWNER
              - CHOWN
              - SETUID
              - SETGID
          runAsUser: 1001
          runAsGroup: 1001
          privileged: false
      - name: dind
        image: docker:27.3.1-dind-rootless
        imagePullPolicy: IfNotPresent
        args:
          - dockerd
          - --config-file=/etc/docker/daemon.json
        securityContext:
          privileged: true
          runAsUser: 1001
          runAsGroup: 1001
          capabilities:
            add:
              - SYS_ADMIN
              - MKNOD
              - CHOWN
              - SETUID
              - SETGID
        resources:
          requests:
            cpu: "200m"
            memory: "650Mi"
          limits:
            memory: "3346Mi"
        volumeMounts:
          - name: work
            mountPath: /home/runner/_work
          - name: dind-sock
            mountPath: /var/run
          - name: dind-externals
            mountPath: /home/runner/externals
          - name: dind-etc
            mountPath: /etc
          - name: dind-home
            mountPath: /home/runner
          - name: docker-data-root
            mountPath: /var/lib/docker
          - name: sysfs
            mountPath: /sys
            readOnly: false
    volumes:
      - name: work
        emptyDir: {}
      - name: dind-externals
        emptyDir: {}
      - name: dind-sock
        emptyDir: {}
      - name: dind-etc
        emptyDir: {}
      - name: dind-home
        emptyDir: {}
      - name: tmpdir
        emptyDir: {}
      - name: docker-data-root
        emptyDir: {}
      - name: sysfs
        hostPath:
          path: /sys
          type: Directory

Controller Logs

-

Runner Pod Logs

√ Connected to GitHub
[RUNNER 2025-07-16 07:45:34Z INFO Terminal] WRITE LINE: 

[RUNNER 2025-07-16 07:45:34Z INFO RSAFileKeyManager] Loading RSA key parameters from file /home/runner/.credentials_rsaparams
[RUNNER 2025-07-16 07:45:35Z ERR  GitHubActionsService] POST request to https://github.enterprise.example.com/_services/vstoken/_apis/oauth2/token/eb530d92-6032-4cac-8ece-acf7fa59845f failed. HTTP Status: BadRequest
[RUNNER 2025-07-16 07:45:35Z INFO GitHubActionsService] AAD Correlation ID for this token request: Unknown
[RUNNER 2025-07-16 07:45:35Z ERR  MessageListener] Catch exception during create session.
[RUNNER 2025-07-16 07:45:35Z ERR  MessageListener] GitHub.Services.OAuth.VssOAuthTokenRequestException: Registration was not found or is not medium trust. ClientType: 
[RUNNER 2025-07-16 07:45:35Z ERR  MessageListener]    at GitHub.Services.OAuth.VssOAuthTokenProvider.OnGetTokenAsync(IssuedToken failedToken, CancellationToken cancellationToken)
[RUNNER 2025-07-16 07:45:35Z ERR  MessageListener]    at GitHub.Services.Common.IssuedTokenProvider.GetTokenOperation.GetTokenAsync(VssTraceActivity traceActivity)
[RUNNER 2025-07-16 07:45:35Z ERR  MessageListener]    at GitHub.Services.Common.IssuedTokenProvider.GetTokenAsync(IssuedToken failedToken, CancellationToken cancellationToken)
[RUNNER 2025-07-16 07:45:35Z ERR  MessageListener]    at GitHub.Services.Common.VssHttpMessageHandler.SendAsync(HttpRequestMessage request, CancellationToken cancellationToken)
[RUNNER 2025-07-16 07:45:35Z ERR  MessageListener]    at GitHub.Services.Common.VssHttpRetryMessageHandler.SendAsync(HttpRequestMessage request, CancellationToken cancellationToken)
[RUNNER 2025-07-16 07:45:35Z ERR  MessageListener]    at System.Net.Http.HttpClient.<SendAsync>g__Core|83_0(HttpRequestMessage request, HttpCompletionOption completionOption, CancellationTokenSource cts, Boolean disposeCts, CancellationTokenSource pendingRequestsCts, CancellationToken originalCancellationToken)
[RUNNER 2025-07-16 07:45:35Z ERR  MessageListener]    at GitHub.Services.WebApi.VssHttpClientBase.SendAsync(HttpRequestMessage message, HttpCompletionOption completionOption, Object userState, CancellationToken cancellationToken)
[RUNNER 2025-07-16 07:45:35Z ERR  MessageListener[]    at GitHub.Services.WebApi.VssHttpClientBase.SendAsync[T](HttpRequestMessage message, Object userState, CancellationToken cancellationToken)
[RUNNER 2025-07-16 07:45:35Z ERR  MessageListener[]    at GitHub.Services.WebApi.VssHttpClientBase.SendAsync[T](HttpMethod method, IEnumerable`1 additionalHeaders, Guid locationId, Object routeValues, ApiResourceVersion version, HttpContent content, IEnumerable`1 queryParameters, Object userState, CancellationToken cancellationToken)
[RUNNER 2025-07-16 07:45:35Z ERR  MessageListener]    at GitHub.Runner.Listener.MessageListener.CreateSessionAsync(CancellationToken token)
[RUNNER 2025-07-16 07:45:35Z ERR  MessageListener] Test oauth app registration.
[RUNNER 2025-07-16 07:45:35Z INFO RSAFileKeyManager] Loading RSA key parameters from file /home/runner/.credentials_rsaparams
[RUNNER 2025-07-16 07:45:35Z ERR  GitHubActionsService] POST request to https://github.enterprise.example.com/_services/vstoken/_apis/oauth2/token/eb530d92-6032-4cac-8ece-acf7fa59845f failed. HTTP Status: BadRequest
[RUNNER 2025-07-16 07:45:35Z INFO GitHubActionsService] AAD Correlation ID for this token request: Unknown
[RUNNER 2025-07-16 07:45:35Z INFO MessageListener] Retriable exception: Registration was not found or is not medium trust. ClientType: 
[RUNNER 2025-07-16 07:45:35Z INFO MessageListener] Sleeping for 30 seconds before retrying.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workinggha-runner-scale-setRelated to the gha-runner-scale-set modeneeds triageRequires review from the maintainers

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions