Skip to content

Fix Metal MPS encoder lifecycle and broaden macOS compatibility#308

Open
robtaylor wants to merge 3 commits intohuggingface:mainfrom
robtaylor:fix-metal-mps-lifecycle
Open

Fix Metal MPS encoder lifecycle and broaden macOS compatibility#308
robtaylor wants to merge 3 commits intohuggingface:mainfrom
robtaylor:fix-metal-mps-lifecycle

Conversation

@robtaylor
Copy link

@robtaylor robtaylor commented Mar 4, 2026

Summary

  • Fix MPS encoder lifecycle: Template and examples create compute encoders directly via [commandBuffer computeCommandEncoder], bypassing PyTorch's MPS stream encoder management (kernel coalescing). This causes a fatal crash when any kernel is called twice in sequence: A command encoder is already encoding to this command buffer. Fixed by using stream->commandEncoder() from PyTorch's MPSStream API.
  • Lower Metal standard to metal3.1 (macOS 14+): All current kernel features (bfloat16_t, simd_sum, simd_shuffle, threadgroup_barrier) are available in Metal 3.1. Previous metal4.0 required macOS 26.
  • Multi-strategy Metal toolchain detection: Support macOS 14/15 (metal bundled in Xcode) and macOS 26 (separate Metal toolchain cryptex). Falls back to scanning /Applications/Xcode*.app when xcrun/xcode-select are unavailable in the Nix sandbox.
  • macOS CI matrix: Build-test on macos-14, macos-15, and macos-26-xlarge (arm64 with GPU). macOS 14 is marked best-effort (continue-on-error) due to MPS OOM on 16GB runners.
  • Update Metal docs: Document macOS 15+ as supported baseline, macOS 14 as best-effort.

Files changed

  • build2cmake/src/templates/metal/compile-metal.cmake — metal3.1 default, multi-strategy toolchain detection
  • builder/lib/torch-extension/arch.nix — toolchain fallback for macOS 14/15, clear SDKROOT in xcrunHost
  • template/__KERNEL_NAME_NORMALIZED___metal/__KERNEL_NAME_NORMALIZED__.mm — use MPSStream encoder API
  • builder/examples/relu/relu_metal/relu.mm — same fix
  • builder/examples/extra-data/relu_metal/relu.mm — same fix
  • .github/workflows/build_kernel_macos.yaml — macOS version matrix, sandbox=relaxed, macos-14 best-effort
  • docs/source/builder/metal.md — macOS version support table, updated requirements

Test plan

  • Build-tested on macos-14 (Xcode 15), macos-15 (Xcode 16), macos-26 (Xcode 26) via CI matrix
  • GPU tests pass on macos-15 and macos-26 (relu + relu-metal-cpp)
  • macos-14 builds succeed, test_relu_layer OOM (known, MPS memory on 16GB runner)
  • Verified fused-rms-norm kernel: 74/74 tests pass (was crashing on sequential calls)
  • Verified rotary-embedding kernel: 217/217 tests pass
  • Crash is 100% reproducible before fix, 0% after fix

Fixes #307

@robtaylor robtaylor marked this pull request as draft March 4, 2026 13:39
@robtaylor
Copy link
Author

robtaylor commented Mar 4, 2026

Hi! A couple of questions about Metal CI:

  1. The build_kernel_macos.yaml workflow currently only builds but doesn't run tests (the comment mentions "Also run tests once we have a macOS runner"). Would it be possible to enable a macos-26-xlarge runner for this repo to get Metal GPU test coverage? The larger runners have actual Metal GPU access.

  2. If that's not feasible, I'm planning to set up a Mac Mini M2 as a self-hosted runner for Metal kernel testing. Would that be welcome as a contribution?

For context, we've been developing Metal implementations of rotary-embedding (217/217 tests pass) and fused-rms-norm (74/74 tests pass), and discovered a crash-on-sequential-calls bug in the MPS encoder pattern used by the template and examples (this PR fixes it). Having CI that actually runs Metal kernels would catch these kinds of issues.

@robtaylor robtaylor marked this pull request as ready for review March 4, 2026 14:48
@robtaylor robtaylor marked this pull request as draft March 4, 2026 14:50
@robtaylor robtaylor force-pushed the fix-metal-mps-lifecycle branch 10 times, most recently from ee0c112 to 33406e1 Compare March 4, 2026 16:54
@robtaylor robtaylor changed the title Fix Metal standard version and MPS encoder lifecycle Fix Metal MPS encoder lifecycle and broaden macOS compatibility Mar 4, 2026
@robtaylor robtaylor marked this pull request as ready for review March 4, 2026 16:55
@robtaylor robtaylor force-pushed the fix-metal-mps-lifecycle branch from 33406e1 to 3ae5790 Compare March 4, 2026 17:06
@robtaylor
Copy link
Author

FYI: We've set up a CI mirror at ChipFlow/kernels to validate these changes on real GPU hardware (GitHub-hosted macos-*-xlarge ARM64 runners with Metal GPU access).

Latest CI run (all green): https://github.com/ChipFlow/kernels/actions/runs/22684099194

The stacked branch at ChipFlow/kernels#2 includes this PR's patches plus additional fixes (relu-metal-cpp example). Results:

Runner relu build relu test relu-metal-cpp build relu-metal-cpp test
macos-26-xlarge
macos-15-xlarge
macos-14-xlarge ⚠️ OOM (best-effort) - -

macOS 14 is marked best-effort via continue-on-error — builds succeed but test_relu_layer OOMs on runners with 16GB unified memory.

robtaylor added a commit to ChipFlow/fused-rms-norm that referenced this pull request Mar 4, 2026
The upstream kernel-builder doesn't yet have multi-strategy Metal
toolchain detection for macOS 14/15. Use our ChipFlow fork's
metal-stack branch which includes the fix (PR huggingface/kernels#308).

Co-developed-by: Claude Code v2.1.50 (claude-opus-4-6)
@danieldk
Copy link
Member

danieldk commented Mar 5, 2026

Nice work! One question, isn't Metal 4 required for the new Neural Accelerators of M5 GPUs?

@robtaylor
Copy link
Author

Thanks @danieldk!

Good question. Metal 4.0 (-std=metal4.0air64_v28) is indeed needed for M5 Neural Engine / ANE features, but our current kernels only target the GPU compute pipeline — they don't use Neural Accelerator APIs.

We default to metal3.1 (air64_v26) for broad compatibility:

  • metal3.1 → macOS 14+ (M1 and later)
  • metal3.2 → macOS 15+
  • metal4.0 → macOS 26+ only

Individual kernels can override via the metal_std_version field in build.toml if they need Metal 4 features. So this is a safe default that doesn't prevent anyone from targeting Metal 4 when needed.

When ANE support becomes relevant (e.g., for matrix multiplication offload on M5), we'd add a metal4.0 kernel variant alongside the existing GPU compute path.

@danieldk
Copy link
Member

danieldk commented Mar 9, 2026

Even though they are not wired up yet, we already have Neural Accelerator code in https://github.com/huggingface/kernels-community/tree/830d15e09d865f5ef43b3873fd98f82442f7095c/mlx-quantization-metal-kernels/quantization_mlx . Also, I think it is probably not a good idea to downgrade to Metal < 4 by default, since we have shipped with Metal 4 as the default in releases, so I think if we add back Metal < 4 support it should be opt-in from build.toml.

Another thing I'm a bit worried about is the UX. For CUDA/XPU/ROCm versions, we encode the required library version in the build variant. We should probably have a similar mechanism for macOS to ensure that a version is downloaded that is compatible with the system's Metal (e.g. to avoid loading a Metal 4 kernel on a system that does not support it).

That said, I am not really sure it is worth it, Apple pushes people pretty hard to move to new releases, so I am not really sure if it is worth complicating beyond the current policy (where we support the previous macOS version for a few months).

What do you think @drbh ?

@robtaylor
Copy link
Author

robtaylor commented Mar 9, 2026 via email

Use stream->commandEncoder() instead of creating encoders directly via
[cmdBuf computeCommandEncoder] to properly integrate with PyTorch's MPS
stream encoder lifecycle management (kernel coalescing). Direct encoder
creation bypasses the stream's internal _commandEncoder state and crashes
on sequential kernel dispatches.

Lower the default Metal standard from metal3.2 (macOS 15+) to metal3.1
(macOS 14+) since all current kernel features (bfloat16_t, simd_sum,
simd_shuffle, threadgroup_barrier) are available in Metal 3.1.

Add multi-strategy Metal toolchain detection for macOS 14+:
- Separate Metal toolchain component (macOS 26+ cryptex mount)
- xcrun/xcode-select based detection
- Direct /Applications/Xcode*.app filesystem scan fallback

Also clear SDKROOT in xcrunHost to prevent Nix-set SDK paths from
interfering with system xcrun.

Fixes: huggingface#307

Co-developed-by: Claude Code v2.1.50 (claude-opus-4-6)
Test Metal kernel builds across multiple macOS versions to verify
compatibility with the metal3.1 standard (macOS 14+). Use sandbox=relaxed
for Nix to support __noChroot builds that access the host Metal toolchain.
The separate Metal toolchain download is only needed on macOS 26+.

Co-developed-by: Claude Code v2.1.50 (claude-opus-4-6)
macOS 14 builds succeed but MPS tests may OOM on runners with
limited unified memory. Use continue-on-error so macos-14 failures
don't block the workflow. Update Metal docs to reflect macOS 15+
as the supported baseline with macOS 14 best-effort.

Co-developed-by: Claude Code v2.1.50 (claude-opus-4-6)
@robtaylor robtaylor force-pushed the fix-metal-mps-lifecycle branch from 0913ee4 to c810460 Compare March 9, 2026 14:51
@danieldk
Copy link
Member

I'll note that telemetry data shows there's still a sizable number of folk still on macos 14 still (!)

Unless I'm looking at the stats wrong, it only seems to be around 3%?

Any chance we can get macos-xlarge enabled for this repo?

I think xlarge should work with our subscription.

More in general, we discussed supporting macOS multiple versions in our weekly sync yesterday. It is certainly something we would like to support, especially to accommodate the few months where the majority of users has not moved from macOS N to N+1 yet. The last cycle we did this by holding off the N + 1 requirement for some months, but that is not ideal, because it does not allow specific kernels to use features from N + 1.

However, since this requires that we change the macOS build variant format to encode the Metal version, which is a large, impactful change, we think it makes most sense to time this with the macOS 27 release and not change this mid-cycle.

@robtaylor
Copy link
Author

Unless I'm looking at the stats wrong, it only seems to be around 3%?

You're right about 14.x — I went back and checked the raw data and realised I was looking at numbers from early 2025 when it was 14-16%. It's ~2.8% now.

However, macOS 15.x is still at 32.3% as of late February. And 15.x users are notably stickier than 14.x was — 14.x was already down to ~8% at the same point after its successor launched. At current decay rate, 15.x won't reach ~3% until around April 2027. So the multi-version story is really about macOS 15 vs 26, not 14.

I think xlarge should work with our subscription.

That would be great! Happy to help set up the workflow if useful.

we think it makes most sense to time this with the macOS 27 release and not change this mid-cycle.

Agreed, that's the right timing. By macOS 27 launch, 15.x will likely still be at 10-15%, so the build variant format change would land exactly when it matters. The metal-std-version field gives kernel authors the escape hatch in the meantime.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Metal example uses unsafe MPS encoder pattern that crashes on sequential calls

2 participants