[Docs] Add Apple MPS (Metal) GPU installation guide

robtaylor · robtaylor · commit fd871bcb5d52 · 2026-03-10T00:26:43.000Z
Add MPS as a GPU backend tab in the installation docs alongside
CUDA, ROCm, and XPU. Covers requirements, build from source,
optional Metal quantization kernels, usage examples, performance
expectations, memory guidelines, and troubleshooting.

Update cpu.apple.inc.md to point to the new GPU/MPS docs instead
of the external vllm-metal project.

Co-developed-by: Claude Code v2.1.58 (claude-opus-4-6)
Signed-off-by: Rob Taylor &lt;rob.taylor@chipflow.io&gt;
diff --git a/docs/getting_started/installation/cpu.apple.inc.md b/docs/getting_started/installation/cpu.apple.inc.md
@@ -5,8 +5,8 @@ vLLM has experimental support for macOS with Apple Silicon. For now, users must
 
 Currently the CPU implementation for macOS supports FP32 and FP16 datatypes.
 
-!!! tip "GPU-Accelerated Inference with vLLM-Metal"
-    For GPU-accelerated inference on Apple Silicon using Metal, check out [vllm-metal](https://github.com/vllm-project/vllm-metal), a community-maintained hardware plugin that uses MLX as the compute backend.
+!!! tip "GPU-Accelerated Inference with MPS"
+    For GPU-accelerated inference on Apple Silicon using Metal, see the [GPU installation guide](gpu.md) and select the "Apple MPS" tab.
 
 --8<-- [end:installation]
 --8<-- [start:requirements]
diff --git a/docs/getting_started/installation/gpu.md b/docs/getting_started/installation/gpu.md
@@ -18,9 +18,13 @@ vLLM is a Python library that supports the following GPU variants. Select your G
 
     --8<-- "docs/getting_started/installation/gpu.xpu.inc.md:installation"
 
+=== "Apple MPS"
+
+    --8<-- "docs/getting_started/installation/gpu.mps.inc.md:installation"
+
 ## Requirements
 
-- OS: Linux
+- OS: Linux (CUDA, ROCm, XPU), macOS 15+ (MPS)
 - Python: 3.10 -- 3.13
 
 !!! note
@@ -38,6 +42,10 @@ vLLM is a Python library that supports the following GPU variants. Select your G
 
     --8<-- "docs/getting_started/installation/gpu.xpu.inc.md:requirements"
 
+=== "Apple MPS"
+
+    --8<-- "docs/getting_started/installation/gpu.mps.inc.md:requirements"
+
 ## Set up using Python
 
 ### Create a new Python environment
@@ -56,6 +64,10 @@ vLLM is a Python library that supports the following GPU variants. Select your G
 
     --8<-- "docs/getting_started/installation/gpu.xpu.inc.md:set-up-using-python"
 
+=== "Apple MPS"
+
+    --8<-- "docs/getting_started/installation/gpu.mps.inc.md:set-up-using-python"
+
 ### Pre-built wheels {#pre-built-wheels}
 
 === "NVIDIA CUDA"
@@ -70,6 +82,10 @@ vLLM is a Python library that supports the following GPU variants. Select your G
 
     --8<-- "docs/getting_started/installation/gpu.xpu.inc.md:pre-built-wheels"
 
+=== "Apple MPS"
+
+    --8<-- "docs/getting_started/installation/gpu.mps.inc.md:pre-built-wheels"
+
 ### Build wheel from source
 
 === "NVIDIA CUDA"
@@ -84,11 +100,16 @@ vLLM is a Python library that supports the following GPU variants. Select your G
 
     --8<-- "docs/getting_started/installation/gpu.xpu.inc.md:build-wheel-from-source"
 
+=== "Apple MPS"
+
+    --8<-- "docs/getting_started/installation/gpu.mps.inc.md:build-wheel-from-source"
+
 ## Set up using Docker
 
 ### Pre-built images
 
---8<-- [start:pre-built-images]
+<!-- markdownlint-disable MD025 -->
+# --8<-- [start:pre-built-images]
 
 === "NVIDIA CUDA"
 
@@ -102,11 +123,19 @@ vLLM is a Python library that supports the following GPU variants. Select your G
 
     --8<-- "docs/getting_started/installation/gpu.xpu.inc.md:pre-built-images"
 
---8<-- [end:pre-built-images]
+=== "Apple MPS"
+
+    --8<-- "docs/getting_started/installation/gpu.mps.inc.md:pre-built-images"
+
+# --8<-- [end:pre-built-images]
+<!-- markdownlint-enable MD025 -->
 
+<!-- markdownlint-disable MD001 -->
 ### Build image from source
+<!-- markdownlint-enable MD001 -->
 
---8<-- [start:build-image-from-source]
+<!-- markdownlint-disable MD025 -->
+# --8<-- [start:build-image-from-source]
 
 === "NVIDIA CUDA"
 
@@ -120,7 +149,12 @@ vLLM is a Python library that supports the following GPU variants. Select your G
 
     --8<-- "docs/getting_started/installation/gpu.xpu.inc.md:build-image-from-source"
 
---8<-- [end:build-image-from-source]
+=== "Apple MPS"
+
+    --8<-- "docs/getting_started/installation/gpu.mps.inc.md:build-image-from-source"
+
+# --8<-- [end:build-image-from-source]
+<!-- markdownlint-enable MD025 -->
 
 ## Supported features
 
@@ -135,3 +169,7 @@ vLLM is a Python library that supports the following GPU variants. Select your G
 === "Intel XPU"
 
     --8<-- "docs/getting_started/installation/gpu.xpu.inc.md:supported-features"
+
+=== "Apple MPS"
+
+    --8<-- "docs/getting_started/installation/gpu.mps.inc.md:supported-features"
diff --git a/docs/getting_started/installation/gpu.mps.inc.md b/docs/getting_started/installation/gpu.mps.inc.md
@@ -0,0 +1,150 @@
+<!-- markdownlint-disable MD041 -->
+--8<-- [start:installation]
+
+vLLM has experimental support for GPU-accelerated inference on Apple Silicon using the MPS (Metal Performance Shaders) backend. This enables running LLM inference on the unified GPU in M1/M2/M3/M4 Macs.
+
+!!! warning "Experimental"
+    MPS support is under active development. Some features available on CUDA (PagedAttention, tensor parallelism, continuous batching for high-throughput serving) are not yet implemented. MPS is best suited for single-user local inference.
+
+--8<-- [end:installation]
+--8<-- [start:requirements]
+
+- Hardware: Apple Silicon Mac (M1, M2, M3, or M4 series)
+- OS: macOS 15 (Sequoia) or later
+- Memory: 16 GB unified memory minimum, 24+ GB recommended
+- Python: 3.10 -- 3.13
+- PyTorch: 2.9+ with MPS support
+
+--8<-- [end:requirements]
+--8<-- [start:set-up-using-python]
+
+There is no extra information on creating a new Python environment for this device.
+
+--8<-- [end:set-up-using-python]
+--8<-- [start:pre-built-wheels]
+
+Currently, there are no pre-built MPS wheels. You must build from source.
+
+--8<-- [end:pre-built-wheels]
+--8<-- [start:build-wheel-from-source]
+
+Clone and install from source:
+
+```bash
+git clone https://github.com/vllm-project/vllm.git
+cd vllm
+pip install -e ".[dev]"
+```
+
+Verify MPS platform detection:
+
+```bash
+python -c "
+import torch
+print('MPS available:', torch.backends.mps.is_available())
+from vllm.platforms import current_platform
+print('Platform:', current_platform.device_type)
+"
+```
+
+### Installing Metal quantization kernels (optional)
+
+For accelerated INT4 (AWQ/GPTQ) and GGUF inference, build and install the Metal dequantization kernels. These require [Nix](https://determinate.systems/nix-installer/) to build.
+
+```bash
+# INT4 dequantization (AWQ + GPTQ)
+cd kernels-community/dequant-int4
+nix build
+cp -r result/torch*-metal-aarch64-darwin/ \
+  $(python -c "import site; print(site.getsitepackages()[0])")/dequant_int4/
+
+# GGUF dequantization (Q4_0, Q8_0, Q4_K, and more)
+cd ../dequant-gguf
+nix build
+cp -r result/torch*-metal-aarch64-darwin/ \
+  $(python -c "import site; print(site.getsitepackages()[0])")/dequant_gguf/
+```
+
+Without these kernels, quantized models will still work but use a slower PyTorch fallback path.
+
+--8<-- [end:build-wheel-from-source]
+--8<-- [start:pre-built-images]
+
+Docker is not applicable for MPS. macOS does not support GPU passthrough to containers.
+
+--8<-- [end:pre-built-images]
+--8<-- [start:build-image-from-source]
+
+Docker is not applicable for MPS. macOS does not support GPU passthrough to containers.
+
+--8<-- [end:build-image-from-source]
+--8<-- [start:supported-features]
+
+### Running inference
+
+MPS requires spawn multiprocessing. Set the environment variable before running:
+
+```bash
+export VLLM_WORKER_MULTIPROC_METHOD=spawn
+```
+
+Example with a small model:
+
+```bash
+python -c "
+from vllm import LLM, SamplingParams
+llm = LLM(model='distilgpt2', dtype='float16', max_model_len=128)
+output = llm.generate(['Hello, world!'], SamplingParams(max_tokens=32))
+print(output[0].outputs[0].text)
+"
+```
+
+Example with a quantized model (requires Metal kernels above):
+
+```bash
+python -c "
+from vllm import LLM, SamplingParams
+llm = LLM(model='Qwen/Qwen2.5-1.5B-Instruct-AWQ', dtype='float16',
+          max_model_len=512, quantization='awq')
+print(llm.generate(['Explain quantum computing.'],
+                    SamplingParams(max_tokens=64))[0].outputs[0].text)
+"
+```
+
+### Performance
+
+Typical throughput on Apple Silicon (varies by chip and memory):
+
+| Model | Quantization | Throughput |
+| ----- | ------------ | ---------- |
+| GGUF small model | Q8_0 | ~62 tok/s |
+| GGUF small model | Q4_0 | ~45 tok/s |
+| Qwen2.5-1.5B | INT4 AWQ | ~17 tok/s |
+| Qwen2.5-1.5B | INT4 GPTQ | ~16 tok/s |
+
+### Memory guidelines
+
+MPS uses unified memory shared between CPU and GPU. When the KV cache exceeds approximately 40% of system RAM, Metal's memory manager can thrash, causing 50-100x slowdowns.
+
+The default KV cache allocation is set conservatively to 25% of system RAM. On a 24 GB system this allows roughly 9 GB for KV cache. Adjust with `gpu_memory_utilization` if needed.
+
+### Known limitations
+
+- No PagedAttention on Metal (uses PyTorch SDPA)
+- No tensor parallelism (single GPU only)
+- No continuous batching optimizations
+- GGUF Q4_K_M models may be slow if the model uses Q6_K layers (numpy fallback)
+- `fork()` crashes on MPS -- `VLLM_WORKER_MULTIPROC_METHOD=spawn` is required
+
+### Troubleshooting
+
+**Slow inference (50-100x slower than expected)**:
+KV cache memory thrashing. Try a smaller model or set `gpu_memory_utilization=0.2`.
+
+**SIGSEGV during startup**:
+Set `VLLM_WORKER_MULTIPROC_METHOD=spawn`.
+
+**"No module named 'vllm.platforms.mps'"**:
+Ensure you are on the `mps-platform-support` branch.
+
+--8<-- [end:supported-features]
diff --git a/tests/v1/attention/test_mps_attn.py b/tests/v1/attention/test_mps_attn.py
@@ -45,7 +45,10 @@ def create_kv_cache_hnd(
     dtype: torch.dtype,
     device: torch.device,
 ) -> torch.Tensor:
-    """Create KV cache in HND layout: (2, num_blocks, num_kv_heads, block_size, head_size)."""
+    """Create KV cache in HND layout.
+
+    Shape: (2, num_blocks, num_kv_heads, block_size, head_size).
+    """
     return torch.zeros(
         2,
         num_blocks,
@@ -102,7 +105,6 @@ def sdpa_reference(
     for i in range(len(seq_lens)):
         q_len = query_lens[i]
         s_len = seq_lens[i]
-        context_len = s_len - q_len
 
         q = query[q_start : q_start + q_len]  # [q_len, num_heads, head_size]
         # Full key/value includes context + query tokens
@@ -277,9 +279,6 @@ def test_attention_correctness(
         batch_spec = BATCH_SPECS[batch_name]
 
         num_tokens = sum(batch_spec.query_lens)
-        total_context_tokens = sum(
-            s - q for s, q in zip(batch_spec.seq_lens, batch_spec.query_lens)
-        )
 
         # Generate full Q, K, V for reference computation
         # Full K, V = context + query tokens for each sequence
@@ -479,7 +478,9 @@ def test_get_attn_backend_returns_mps(self):
         attention_config = AttentionConfig(backend=AttentionBackendEnum.MPS_ATTN)
         vllm_config = VllmConfig(attention_config=attention_config)
 
-        with set_current_vllm_config(vllm_config):
-            with patch("vllm.platforms.current_platform", MpsPlatform()):
-                backend = get_attn_backend(64, torch.float16, None)
+        with (
+            set_current_vllm_config(vllm_config),
+            patch("vllm.platforms.current_platform", MpsPlatform()),
+        ):
+            backend = get_attn_backend(64, torch.float16, None)
         assert backend.get_name() == "MPS_ATTN"
diff --git a/vllm/v1/attention/backends/mps_attn.py b/vllm/v1/attention/backends/mps_attn.py
@@ -292,7 +292,8 @@ def forward(
             blocks = block_table[i, :num_blocks_needed]
 
             # Gather K,V from paged cache
-            # key_cache[blocks]: [num_blocks_needed, num_kv_heads, block_size, head_size]
+            # key_cache[blocks]:
+            #   [num_blocks_needed, num_kv_heads, block_size, head_size]
             # Transpose to [num_kv_heads, num_blocks_needed, block_size, head_size]
             # then reshape to merge blocks×block_size into the sequence dim.
             k_paged = (
@@ -306,9 +307,11 @@ def forward(
                 .reshape(self.num_kv_heads, -1, self.head_size)[:, :seq_len, :]
             )
 
-            # query slice: [q_len, num_heads, head_size] -> [1, num_heads, q_len, head_size]
+            # query: [q_len, num_heads, head_size]
+            #     -> [1, num_heads, q_len, head_size]
             q = query[q_start:q_end].transpose(0, 1).unsqueeze(0)
-            # k,v: [num_kv_heads, seq_len, head_size] -> [1, num_kv_heads, seq_len, head_size]
+            # k,v: [num_kv_heads, seq_len, head_size]
+            #   -> [1, num_kv_heads, seq_len, head_size]
             k = k_paged.unsqueeze(0)
             v = v_paged.unsqueeze(0)