Fix Metal MPS encoder lifecycle and broaden macOS compatibility#308
Fix Metal MPS encoder lifecycle and broaden macOS compatibility#308robtaylor wants to merge 3 commits intohuggingface:mainfrom
Conversation
|
Hi! A couple of questions about Metal CI:
For context, we've been developing Metal implementations of |
ee0c112 to
33406e1
Compare
33406e1 to
3ae5790
Compare
|
FYI: We've set up a CI mirror at ChipFlow/kernels to validate these changes on real GPU hardware (GitHub-hosted Latest CI run (all green): https://github.com/ChipFlow/kernels/actions/runs/22684099194 The stacked branch at ChipFlow/kernels#2 includes this PR's patches plus additional fixes (relu-metal-cpp example). Results:
macOS 14 is marked best-effort via |
The upstream kernel-builder doesn't yet have multi-strategy Metal toolchain detection for macOS 14/15. Use our ChipFlow fork's metal-stack branch which includes the fix (PR huggingface/kernels#308). Co-developed-by: Claude Code v2.1.50 (claude-opus-4-6)
|
Nice work! One question, isn't Metal 4 required for the new Neural Accelerators of M5 GPUs? |
|
Thanks @danieldk! Good question. Metal 4.0 ( We default to
Individual kernels can override via the When ANE support becomes relevant (e.g., for matrix multiplication offload on M5), we'd add a |
|
Even though they are not wired up yet, we already have Neural Accelerator code in https://github.com/huggingface/kernels-community/tree/830d15e09d865f5ef43b3873fd98f82442f7095c/mlx-quantization-metal-kernels/quantization_mlx . Also, I think it is probably not a good idea to downgrade to Metal < 4 by default, since we have shipped with Metal 4 as the default in releases, so I think if we add back Metal < 4 support it should be opt-in from Another thing I'm a bit worried about is the UX. For CUDA/XPU/ROCm versions, we encode the required library version in the build variant. We should probably have a similar mechanism for macOS to ensure that a version is downloaded that is compatible with the system's Metal (e.g. to avoid loading a Metal 4 kernel on a system that does not support it). That said, I am not really sure it is worth it, Apple pushes people pretty hard to move to new releases, so I am not really sure if it is worth complicating beyond the current policy (where we support the previous macOS version for a few months). What do you think @drbh ? |
|
it's a fair point and worth considering.
I'll definitely take an action to check on the UX - My understanding is
that linking Metal 3.1 should work for all Metal > 3.1 (indeed,
that's [being tested in CI now](
https://github.com/ChipFlow/kernels/actions/runs/22684099194/job/65762236187)
). But yes, it'll get a bit complicated if we're also landing Neural
Accelerator code.
I'll note that telemetry data shows there's still a sizable number of folk
still on macos 14 still (!) though this PR doesn't work on macos 14 atm due
to the MPS encoder coalescing behavior that that #316 relies on wasn't
fully stabilized until macOS 15.
https://telemetrydeck.com/survey/apple/macOS/versions/
Any chance we can get macos-xlarge enabled for this repo? Being able to run
CI properly will help with our overall confidence!
…On Mon, 9 Mar 2026 at 10:47, Daniël de Kok ***@***.***> wrote:
*danieldk* left a comment (huggingface/kernels#308)
<#308 (comment)>
Even though they are not wired up yet, we already have Neural Accelerator
code in
https://github.com/huggingface/kernels-community/tree/830d15e09d865f5ef43b3873fd98f82442f7095c/mlx-quantization-metal-kernels/quantization_mlx
. Also, I think it is probably not a good idea to downgrade to Metal < 4 by
default, since we have shipped with Metal 4 as the default in releases, so
I think if we add back Metal < 4 support it should be opt-in from
build.toml.
Another thing I'm a bit worried about is the UX. For CUDA/XPU/ROCm
versions, we encode the required library version in the build variant. We
should probably have a similar mechanism for macOS to ensure that a version
is downloaded that is compatible with the system's Metal.
That said, I am not really sure it is worth it, Apple pushes people pretty
hard to move to new releases, so I am not really sure if it is worth
complicating beyond the current policy (where we support the previous macOS
version for a few months).
What do you think @drbh <https://github.com/drbh> ?
—
Reply to this email directly, view it on GitHub
<#308 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAB4O4HDPTAET6PI6LWWHOD4P2OLNAVCNFSM6AAAAACWGVO5RCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHM2DAMRSHA2DKMRQGA>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
Use stream->commandEncoder() instead of creating encoders directly via [cmdBuf computeCommandEncoder] to properly integrate with PyTorch's MPS stream encoder lifecycle management (kernel coalescing). Direct encoder creation bypasses the stream's internal _commandEncoder state and crashes on sequential kernel dispatches. Lower the default Metal standard from metal3.2 (macOS 15+) to metal3.1 (macOS 14+) since all current kernel features (bfloat16_t, simd_sum, simd_shuffle, threadgroup_barrier) are available in Metal 3.1. Add multi-strategy Metal toolchain detection for macOS 14+: - Separate Metal toolchain component (macOS 26+ cryptex mount) - xcrun/xcode-select based detection - Direct /Applications/Xcode*.app filesystem scan fallback Also clear SDKROOT in xcrunHost to prevent Nix-set SDK paths from interfering with system xcrun. Fixes: huggingface#307 Co-developed-by: Claude Code v2.1.50 (claude-opus-4-6)
Test Metal kernel builds across multiple macOS versions to verify compatibility with the metal3.1 standard (macOS 14+). Use sandbox=relaxed for Nix to support __noChroot builds that access the host Metal toolchain. The separate Metal toolchain download is only needed on macOS 26+. Co-developed-by: Claude Code v2.1.50 (claude-opus-4-6)
macOS 14 builds succeed but MPS tests may OOM on runners with limited unified memory. Use continue-on-error so macos-14 failures don't block the workflow. Update Metal docs to reflect macOS 15+ as the supported baseline with macOS 14 best-effort. Co-developed-by: Claude Code v2.1.50 (claude-opus-4-6)
0913ee4 to
c810460
Compare
Unless I'm looking at the stats wrong, it only seems to be around 3%?
I think xlarge should work with our subscription. More in general, we discussed supporting macOS multiple versions in our weekly sync yesterday. It is certainly something we would like to support, especially to accommodate the few months where the majority of users has not moved from macOS N to N+1 yet. The last cycle we did this by holding off the N + 1 requirement for some months, but that is not ideal, because it does not allow specific kernels to use features from N + 1. However, since this requires that we change the macOS build variant format to encode the Metal version, which is a large, impactful change, we think it makes most sense to time this with the macOS 27 release and not change this mid-cycle. |
You're right about 14.x — I went back and checked the raw data and realised I was looking at numbers from early 2025 when it was 14-16%. It's ~2.8% now. However, macOS 15.x is still at 32.3% as of late February. And 15.x users are notably stickier than 14.x was — 14.x was already down to ~8% at the same point after its successor launched. At current decay rate, 15.x won't reach ~3% until around April 2027. So the multi-version story is really about macOS 15 vs 26, not 14.
That would be great! Happy to help set up the workflow if useful.
Agreed, that's the right timing. By macOS 27 launch, 15.x will likely still be at 10-15%, so the build variant format change would land exactly when it matters. The |
Summary
[commandBuffer computeCommandEncoder], bypassing PyTorch's MPS stream encoder management (kernel coalescing). This causes a fatal crash when any kernel is called twice in sequence:A command encoder is already encoding to this command buffer. Fixed by usingstream->commandEncoder()from PyTorch'sMPSStreamAPI.bfloat16_t,simd_sum,simd_shuffle,threadgroup_barrier) are available in Metal 3.1. Previousmetal4.0required macOS 26./Applications/Xcode*.appwhenxcrun/xcode-selectare unavailable in the Nix sandbox.continue-on-error) due to MPS OOM on 16GB runners.Files changed
build2cmake/src/templates/metal/compile-metal.cmake— metal3.1 default, multi-strategy toolchain detectionbuilder/lib/torch-extension/arch.nix— toolchain fallback for macOS 14/15, clear SDKROOT in xcrunHosttemplate/__KERNEL_NAME_NORMALIZED___metal/__KERNEL_NAME_NORMALIZED__.mm— use MPSStream encoder APIbuilder/examples/relu/relu_metal/relu.mm— same fixbuilder/examples/extra-data/relu_metal/relu.mm— same fix.github/workflows/build_kernel_macos.yaml— macOS version matrix, sandbox=relaxed, macos-14 best-effortdocs/source/builder/metal.md— macOS version support table, updated requirementsTest plan
fused-rms-normkernel: 74/74 tests pass (was crashing on sequential calls)rotary-embeddingkernel: 217/217 tests passFixes #307