Skip to content

Long-running container steps "succeed" prematurely when running on AKSΒ #4190

@jharlow1

Description

@jharlow1

Checks

Controller Version

0.6.1

Deployment Method

Helm

Checks

  • This isn't a question or user support case (For Q&A and community support, go to Discussions).
  • I've read the Changelog before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes

To Reproduce

Have a scale set like:


spec:
  githubConfigSecret: gha-runner-scale-set
  githubConfigUrl: xxxxxx
  maxRunners: 20
  minRunners: 2
  runnerGroup: k8s-standard
  runnerScaleSetName: arc-k8s-standard-azure
  template:
    spec:
      containers:
      - command:
        - /home/runner/run.sh
        env:
        - name: ACTIONS_RUNNER_REQUIRE_JOB_CONTAINER
          value: "false"
        - name: ACTIONS_RUNNER_CONTAINER_HOOKS
          value: /home/runner/k8s/index.js
        - name: ACTIONS_RUNNER_POD_NAME
          valueFrom:
            fieldRef:
              fieldPath: metadata.name
        image: actions/actions-runner:latest
        name: runner
        resources:
          limits:
            memory: 2G
          requests:
            cpu: 1
            memory: 2G
        volumeMounts:
        - mountPath: /home/runner/_work
          name: work
      restartPolicy: Never
      securityContext:
        fsGroup: 123
      serviceAccountName: arc-k8s-standard-azure-gha-rs-kube-mode
      volumes:
      - ephemeral:
          volumeClaimTemplate:
            spec:
              accessModes:
              - ReadWriteOnce
              resources:
                requests:
                  storage: 25Gi
              storageClassName: managed-csi-premium
        name: work


Run a simple workflow that sleeps for 10 minutes and then exits with a 1:


name: "Azure test"
on:
  workflow_dispatch:

Using a runner scale-set like:

`
jobs:
  test:
    name: "Test Long-running Fail job"
    runs-on: arc-k8s-standard-azure
    container:
      image: ubuntu-22.04-0.44.0
    steps:
      - name: "Run Test"
        run: |
          echo "This is a test job that will sleep and then fail after 10 minutes."
          sleep 600
          exit 1


Run the job

Describe the bug

The step runs for just over 5 minutes, then "succeeds" as if the script returned a 0. (Logs even report that the script process exited 0)

[WORKER 2025-07-28 15:30:52Z INFO ProcessInvokerWrapper] Finished process 159 with exit code 0, and elapsed time 00:05:02.5167359.

Oddly, this only seems to occur if we run in AKS. The same scale set and job configuration, when running in Amazon EKS works as expected (job runs for 10 minutes then fails with exit 1)

Describe the expected behavior

Job should run for 10 minutes then fail

Additional Context

n/a

Controller Logs

https://gist.github.com/jharlow1/b83430815bb3c4bd6fe8a2efae25cb93

Runner Pod Logs

https://gist.github.com/jharlow1/6080b1b26e4e16c3c175ab07c9f2e95c

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workinggha-runner-scale-setRelated to the gha-runner-scale-set modeneeds triageRequires review from the maintainers

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions