Skip to content

macos-15 (only) runner hangs, or not, depending on the Provisioner? #12725

@pcolby

Description

@pcolby

Description

Overview

I have a build workflow that has begun hanging sporadically on a specify step, on some macos-15 runners only. The difference between hanging or not seems to be dependant on how the image is started - ie the output of the "Runner Image Provisioner" (see below).

Specifically, if the Runner Image Provisioner reports as:

  2.0.449.1+6d46f4794670abc677653e70ad733fc4b7478209

the workflow will work every time. Whereas, if the Runner Image Provisioner reports as:

  Hosted Compute Agent
  Version: 20250711.363
  Commit: 6785254374ce925a23743850c1cb91912ce5c14c
  Build Date: 2025-07-11T20:04:25Z

the a workflow step will hang every time.

What Hangs

The step that actually hangs, is running Qt-based unit tests via CMake, like:

ctest --output-on-failure --test-dir "$RUNNER_TEMP/release" --verbose

And when it hangs (ie on specific macos-15 runners only), it hangs like:

...
1: Test command: /Users/runner/work/_temp/coverage/test/unit/cli/testAbstractCommand "-o" "AbstractCommand.tap,tap" "-o" "-,txt"
1: Working Directory: /Users/runner/work/_temp/coverage/test/unit/cli
1: Environment variables: 
1:  LLVM_PROFILE_FILE=AbstractCommand.profraw
1: Test timeout computed to be: 1500
1: ********* Start testing of TestAbstractCommand *********
1: Config: Using QtTest library 5.13.2, Qt 5.13.2 (x86_64-little_endian-lp64 shared (dynamic) release build; by Clang 10.0.0 (clang-1000.11.45.5) (Apple))
1: PASS   : TestAbstractCommand::initTestCase()
Error: The action 'Test' has timed out after x minutes.

I have experienced similar hangs in the past that occur because the tests link to Bluetooth libraries, which can result in macOS pausing the app while it displays a user prompt for Bluetooth access permission. So a hang at this point might be related to additional Bluetooth, or other permissions now being prompted for, but of course, there's no way for me to diagnose that, since I cannot see the macOS UI.

I do explicitly grant Bluetooth permission to provisioner runner, so it's probably not Bluetooth, but some other similar issue (macOS prompting the user for something else?).

To be clear, this works 100% of the time on macos-13 and macos-14 runners. And it works 100% of the time on macos-15 runners that start up with the 2.0.449.1+6d46f4794670abc677653e70ad733fc4b7478209 provisioner, but fails (hangs) 100% of the time on the macos-15 runners with that multiline Provisioner output.

Of course I can slowly track this down with a lot of trial and error, but I'd like to be sure that the Provisioner differences are intended first. Diagnosing this hang will be a very slow process 😅

Runner Image Provisioner

Runners that work start like:

Image

And ones that fail, start like:

Image

The only obvious difference that I can see (apart from the different ordering), is the the "Runner Image Provisioner" is completely different. Everything else seems to be the same.

Platforms affected

  • Azure DevOps
  • GitHub Actions - Standard Runners
  • GitHub Actions - Larger Runners

Runner images affected

  • Ubuntu 22.04
  • Ubuntu 24.04
  • macOS 13
  • macOS 13 Arm64
  • macOS 14
  • macOS 14 Arm64
  • macOS 15
  • macOS 15 Arm64
  • Windows Server 2019
  • Windows Server 2022
  • Windows Server 2025

Image version and build link

Here's one of the earlier runs where I saw this happening, with image version 20250721.2009: https://github.com/pcolby/dokit/actions/runs/16782089041 In this case, you can see I've re-run the failed jobs several times, with some of the previously-failing builds getting the different provisioner and then passing instead.

And here's an example run where I've:

  • restricted the jobs/matrix to just the macos-15 builds;
  • added some explicit retry-on-timeout handler (which did not help at all)

https://github.com/pcolby/dokit/actions/runs/16842562995/job/47716474819

Again, some passed, and some failed, each result strictly correlated to the provisioner output.

Is it regression?

Yes, but the same image version (20250722.2025) seems to succeed or fail, depending on the Image Provisioner

Expected behavior

The hanging step should finish within several seconds (typically well under 10 seconds, but always less than 30 seconds).

Actual behavior

The hanging step hangs indefinitely (I typically have a 1-minute timeout set, but if I remove the timeout, it'll run for the default (several hours) timeout).

Repro steps

  1. build a Qt-based C++ library with unit tests
  2. run the unit tests on multiple macos-15 runners
  3. see which test runs hang and which ones pass

I'll begin trying to cut down the project to see how small I can make it while still reproducing the issue, but that's likely to be a very long, and slow process, so any help I can get from the GitHub side would be very much appreciated! Thanks.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions