Skip to content

Metal: stacked patches for MPS lifecycle, CI, and relu-metal-cpp fix#2

Open
robtaylor wants to merge 7 commits intomainfrom
metal-stack
Open

Metal: stacked patches for MPS lifecycle, CI, and relu-metal-cpp fix#2
robtaylor wants to merge 7 commits intomainfrom
metal-stack

Conversation

@robtaylor
Copy link

@robtaylor robtaylor commented Mar 4, 2026

Summary

Stacked patches for Metal support improvements:

  1. Fix Metal MPS encoder lifecycle and broaden macOS compatibility (upstream PR Fix Metal MPS encoder lifecycle and broaden macOS compatibility huggingface/kernels#308)
  2. Add macOS CI matrix for macos-14, macos-15, and macos-26
  3. Fix MPS encoder lifecycle in relu-metal-cpp example (C bridge to ObjC++ MPS stream APIs)
  4. Mark macOS 14 as best-effort in CI and update Metal docs

CI status

Runner relu build relu test relu-metal-cpp build relu-metal-cpp test
macos-26-xlarge
macos-15-xlarge
macos-14-xlarge ⚠️ OOM (best-effort) - -

Use stream->commandEncoder() instead of creating encoders directly via
[cmdBuf computeCommandEncoder] to properly integrate with PyTorch's MPS
stream encoder lifecycle management (kernel coalescing). Direct encoder
creation bypasses the stream's internal _commandEncoder state and crashes
on sequential kernel dispatches.

Lower the default Metal standard from metal3.2 (macOS 15+) to metal3.1
(macOS 14+) since all current kernel features (bfloat16_t, simd_sum,
simd_shuffle, threadgroup_barrier) are available in Metal 3.1.

Add multi-strategy Metal toolchain detection for macOS 14+:
- Separate Metal toolchain component (macOS 26+ cryptex mount)
- xcrun/xcode-select based detection
- Direct /Applications/Xcode*.app filesystem scan fallback

Also clear SDKROOT in xcrunHost to prevent Nix-set SDK paths from
interfering with system xcrun.

Fixes: huggingface#307

Co-developed-by: Claude Code v2.1.50 (claude-opus-4-6)
Test Metal kernel builds across multiple macOS versions to verify
compatibility with the metal3.1 standard (macOS 14+). Use sandbox=relaxed
for Nix to support __noChroot builds that access the host Metal toolchain.
The separate Metal toolchain download is only needed on macOS 26+.

Co-developed-by: Claude Code v2.1.50 (claude-opus-4-6)
Add C bridge functions (getMPSCommandEncoder, mpsSynchronize,
mpsDispatchSync) to metallib_loader.mm so the C++ metal-cpp example
can properly integrate with PyTorch's MPS stream encoder lifecycle
without needing ObjC++ code in the main kernel file.

Co-developed-by: Claude Code v2.1.50 (claude-opus-4-6)
macOS 14 builds succeed but MPS tests may OOM on runners with
limited unified memory. Use continue-on-error so macos-14 failures
don't block the workflow. Update Metal docs to reflect macOS 15+
as the supported baseline with macOS 14 best-effort.

Co-developed-by: Claude Code v2.1.50 (claude-opus-4-6)
- Added section on vLLM Metal integration (March 2026)
- Documented platform backend, attention backend, worker/runner status
- Noted smoke test passing, E2E validation in progress
- Listed key findings on MPS lazy evaluation and memory model
- Updated open questions to include vLLM performance baseline

This tracks the active E2E validation work toward closing the gap
between HF kernel ecosystem and llama.cpp on macOS.

Co-developed-by: Claude Code v2.0.76 (claude-haiku-4-5-20251001)
Allow per-kernel Metal standard version configuration via build.toml,
following the pattern of cuda-flags, hip-flags, and sycl-flags.

The default remains metal4.0 (upstream's current value). Kernels that
need broader macOS compatibility can set metal-std-version = "metal3.1"
(macOS 14+) or "metal3.2" (macOS 15+). AIR versions are forward-
compatible, so metal3.1 kernels run on Metal 4 hardware.

Changes:
- Add metal_std_version field to Kernel::Metal in config structs (v2, v3, mod)
- Pass field through Jinja template context to generated CMake
- Accept METAL_STD_VERSION in metal_kernel_component() and propagate
  to compile_metal_shaders() via parent scope
- Default to metal4.0 in compile-metal.cmake when not specified
- Set metal-std-version = "metal3.1" in relu-metal-cpp example for
  broad macOS 14+ compatibility

Co-developed-by: Claude Code v2.1.58 (claude-opus-4-6)
Create a shared test utilities package that consolidates duplicated
device detection, tolerance tables, and allclose helpers across all
kernel repos. The package is automatically available in all kernel
dev/test shells via the default pythonCheckInputs.

Modules:
- device: get_device(), get_available_devices(), skip_if_no_gpu()
- tolerances: DEFAULT_TOLERANCES dict, get_tolerances(dtype)
- allclose: fp8_allclose() with MPS float64 workaround

Wired into nix overlay and set as default pythonCheckInputs in
genKernelFlakeOutputs so downstream repos get it automatically.
Updated template test to use kernels_test_utils imports.

Co-developed-by: Claude Code v2.1.58 (claude-opus-4-6)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant