Skip to content

Commit fd871bc

Browse files
committed
[Docs] Add Apple MPS (Metal) GPU installation guide
Add MPS as a GPU backend tab in the installation docs alongside CUDA, ROCm, and XPU. Covers requirements, build from source, optional Metal quantization kernels, usage examples, performance expectations, memory guidelines, and troubleshooting. Update cpu.apple.inc.md to point to the new GPU/MPS docs instead of the external vllm-metal project. Co-developed-by: Claude Code v2.1.58 (claude-opus-4-6) Signed-off-by: Rob Taylor <rob.taylor@chipflow.io>
1 parent de4495b commit fd871bc

File tree

5 files changed

+210
-18
lines changed

5 files changed

+210
-18
lines changed

docs/getting_started/installation/cpu.apple.inc.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -5,8 +5,8 @@ vLLM has experimental support for macOS with Apple Silicon. For now, users must
55

66
Currently the CPU implementation for macOS supports FP32 and FP16 datatypes.
77

8-
!!! tip "GPU-Accelerated Inference with vLLM-Metal"
9-
For GPU-accelerated inference on Apple Silicon using Metal, check out [vllm-metal](https://github.com/vllm-project/vllm-metal), a community-maintained hardware plugin that uses MLX as the compute backend.
8+
!!! tip "GPU-Accelerated Inference with MPS"
9+
For GPU-accelerated inference on Apple Silicon using Metal, see the [GPU installation guide](gpu.md) and select the "Apple MPS" tab.
1010

1111
--8<-- [end:installation]
1212
--8<-- [start:requirements]

docs/getting_started/installation/gpu.md

Lines changed: 43 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -18,9 +18,13 @@ vLLM is a Python library that supports the following GPU variants. Select your G
1818

1919
--8<-- "docs/getting_started/installation/gpu.xpu.inc.md:installation"
2020

21+
=== "Apple MPS"
22+
23+
--8<-- "docs/getting_started/installation/gpu.mps.inc.md:installation"
24+
2125
## Requirements
2226

23-
- OS: Linux
27+
- OS: Linux (CUDA, ROCm, XPU), macOS 15+ (MPS)
2428
- Python: 3.10 -- 3.13
2529

2630
!!! note
@@ -38,6 +42,10 @@ vLLM is a Python library that supports the following GPU variants. Select your G
3842

3943
--8<-- "docs/getting_started/installation/gpu.xpu.inc.md:requirements"
4044

45+
=== "Apple MPS"
46+
47+
--8<-- "docs/getting_started/installation/gpu.mps.inc.md:requirements"
48+
4149
## Set up using Python
4250

4351
### Create a new Python environment
@@ -56,6 +64,10 @@ vLLM is a Python library that supports the following GPU variants. Select your G
5664

5765
--8<-- "docs/getting_started/installation/gpu.xpu.inc.md:set-up-using-python"
5866

67+
=== "Apple MPS"
68+
69+
--8<-- "docs/getting_started/installation/gpu.mps.inc.md:set-up-using-python"
70+
5971
### Pre-built wheels {#pre-built-wheels}
6072

6173
=== "NVIDIA CUDA"
@@ -70,6 +82,10 @@ vLLM is a Python library that supports the following GPU variants. Select your G
7082

7183
--8<-- "docs/getting_started/installation/gpu.xpu.inc.md:pre-built-wheels"
7284

85+
=== "Apple MPS"
86+
87+
--8<-- "docs/getting_started/installation/gpu.mps.inc.md:pre-built-wheels"
88+
7389
### Build wheel from source
7490

7591
=== "NVIDIA CUDA"
@@ -84,11 +100,16 @@ vLLM is a Python library that supports the following GPU variants. Select your G
84100

85101
--8<-- "docs/getting_started/installation/gpu.xpu.inc.md:build-wheel-from-source"
86102

103+
=== "Apple MPS"
104+
105+
--8<-- "docs/getting_started/installation/gpu.mps.inc.md:build-wheel-from-source"
106+
87107
## Set up using Docker
88108

89109
### Pre-built images
90110

91-
--8<-- [start:pre-built-images]
111+
<!-- markdownlint-disable MD025 -->
112+
# --8<-- [start:pre-built-images]
92113

93114
=== "NVIDIA CUDA"
94115

@@ -102,11 +123,19 @@ vLLM is a Python library that supports the following GPU variants. Select your G
102123

103124
--8<-- "docs/getting_started/installation/gpu.xpu.inc.md:pre-built-images"
104125

105-
--8<-- [end:pre-built-images]
126+
=== "Apple MPS"
127+
128+
--8<-- "docs/getting_started/installation/gpu.mps.inc.md:pre-built-images"
129+
130+
# --8<-- [end:pre-built-images]
131+
<!-- markdownlint-enable MD025 -->
106132

133+
<!-- markdownlint-disable MD001 -->
107134
### Build image from source
135+
<!-- markdownlint-enable MD001 -->
108136

109-
--8<-- [start:build-image-from-source]
137+
<!-- markdownlint-disable MD025 -->
138+
# --8<-- [start:build-image-from-source]
110139

111140
=== "NVIDIA CUDA"
112141

@@ -120,7 +149,12 @@ vLLM is a Python library that supports the following GPU variants. Select your G
120149

121150
--8<-- "docs/getting_started/installation/gpu.xpu.inc.md:build-image-from-source"
122151

123-
--8<-- [end:build-image-from-source]
152+
=== "Apple MPS"
153+
154+
--8<-- "docs/getting_started/installation/gpu.mps.inc.md:build-image-from-source"
155+
156+
# --8<-- [end:build-image-from-source]
157+
<!-- markdownlint-enable MD025 -->
124158

125159
## Supported features
126160

@@ -135,3 +169,7 @@ vLLM is a Python library that supports the following GPU variants. Select your G
135169
=== "Intel XPU"
136170

137171
--8<-- "docs/getting_started/installation/gpu.xpu.inc.md:supported-features"
172+
173+
=== "Apple MPS"
174+
175+
--8<-- "docs/getting_started/installation/gpu.mps.inc.md:supported-features"
Lines changed: 150 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,150 @@
1+
<!-- markdownlint-disable MD041 -->
2+
--8<-- [start:installation]
3+
4+
vLLM has experimental support for GPU-accelerated inference on Apple Silicon using the MPS (Metal Performance Shaders) backend. This enables running LLM inference on the unified GPU in M1/M2/M3/M4 Macs.
5+
6+
!!! warning "Experimental"
7+
MPS support is under active development. Some features available on CUDA (PagedAttention, tensor parallelism, continuous batching for high-throughput serving) are not yet implemented. MPS is best suited for single-user local inference.
8+
9+
--8<-- [end:installation]
10+
--8<-- [start:requirements]
11+
12+
- Hardware: Apple Silicon Mac (M1, M2, M3, or M4 series)
13+
- OS: macOS 15 (Sequoia) or later
14+
- Memory: 16 GB unified memory minimum, 24+ GB recommended
15+
- Python: 3.10 -- 3.13
16+
- PyTorch: 2.9+ with MPS support
17+
18+
--8<-- [end:requirements]
19+
--8<-- [start:set-up-using-python]
20+
21+
There is no extra information on creating a new Python environment for this device.
22+
23+
--8<-- [end:set-up-using-python]
24+
--8<-- [start:pre-built-wheels]
25+
26+
Currently, there are no pre-built MPS wheels. You must build from source.
27+
28+
--8<-- [end:pre-built-wheels]
29+
--8<-- [start:build-wheel-from-source]
30+
31+
Clone and install from source:
32+
33+
```bash
34+
git clone https://github.com/vllm-project/vllm.git
35+
cd vllm
36+
pip install -e ".[dev]"
37+
```
38+
39+
Verify MPS platform detection:
40+
41+
```bash
42+
python -c "
43+
import torch
44+
print('MPS available:', torch.backends.mps.is_available())
45+
from vllm.platforms import current_platform
46+
print('Platform:', current_platform.device_type)
47+
"
48+
```
49+
50+
### Installing Metal quantization kernels (optional)
51+
52+
For accelerated INT4 (AWQ/GPTQ) and GGUF inference, build and install the Metal dequantization kernels. These require [Nix](https://determinate.systems/nix-installer/) to build.
53+
54+
```bash
55+
# INT4 dequantization (AWQ + GPTQ)
56+
cd kernels-community/dequant-int4
57+
nix build
58+
cp -r result/torch*-metal-aarch64-darwin/ \
59+
$(python -c "import site; print(site.getsitepackages()[0])")/dequant_int4/
60+
61+
# GGUF dequantization (Q4_0, Q8_0, Q4_K, and more)
62+
cd ../dequant-gguf
63+
nix build
64+
cp -r result/torch*-metal-aarch64-darwin/ \
65+
$(python -c "import site; print(site.getsitepackages()[0])")/dequant_gguf/
66+
```
67+
68+
Without these kernels, quantized models will still work but use a slower PyTorch fallback path.
69+
70+
--8<-- [end:build-wheel-from-source]
71+
--8<-- [start:pre-built-images]
72+
73+
Docker is not applicable for MPS. macOS does not support GPU passthrough to containers.
74+
75+
--8<-- [end:pre-built-images]
76+
--8<-- [start:build-image-from-source]
77+
78+
Docker is not applicable for MPS. macOS does not support GPU passthrough to containers.
79+
80+
--8<-- [end:build-image-from-source]
81+
--8<-- [start:supported-features]
82+
83+
### Running inference
84+
85+
MPS requires spawn multiprocessing. Set the environment variable before running:
86+
87+
```bash
88+
export VLLM_WORKER_MULTIPROC_METHOD=spawn
89+
```
90+
91+
Example with a small model:
92+
93+
```bash
94+
python -c "
95+
from vllm import LLM, SamplingParams
96+
llm = LLM(model='distilgpt2', dtype='float16', max_model_len=128)
97+
output = llm.generate(['Hello, world!'], SamplingParams(max_tokens=32))
98+
print(output[0].outputs[0].text)
99+
"
100+
```
101+
102+
Example with a quantized model (requires Metal kernels above):
103+
104+
```bash
105+
python -c "
106+
from vllm import LLM, SamplingParams
107+
llm = LLM(model='Qwen/Qwen2.5-1.5B-Instruct-AWQ', dtype='float16',
108+
max_model_len=512, quantization='awq')
109+
print(llm.generate(['Explain quantum computing.'],
110+
SamplingParams(max_tokens=64))[0].outputs[0].text)
111+
"
112+
```
113+
114+
### Performance
115+
116+
Typical throughput on Apple Silicon (varies by chip and memory):
117+
118+
| Model | Quantization | Throughput |
119+
| ----- | ------------ | ---------- |
120+
| GGUF small model | Q8_0 | ~62 tok/s |
121+
| GGUF small model | Q4_0 | ~45 tok/s |
122+
| Qwen2.5-1.5B | INT4 AWQ | ~17 tok/s |
123+
| Qwen2.5-1.5B | INT4 GPTQ | ~16 tok/s |
124+
125+
### Memory guidelines
126+
127+
MPS uses unified memory shared between CPU and GPU. When the KV cache exceeds approximately 40% of system RAM, Metal's memory manager can thrash, causing 50-100x slowdowns.
128+
129+
The default KV cache allocation is set conservatively to 25% of system RAM. On a 24 GB system this allows roughly 9 GB for KV cache. Adjust with `gpu_memory_utilization` if needed.
130+
131+
### Known limitations
132+
133+
- No PagedAttention on Metal (uses PyTorch SDPA)
134+
- No tensor parallelism (single GPU only)
135+
- No continuous batching optimizations
136+
- GGUF Q4_K_M models may be slow if the model uses Q6_K layers (numpy fallback)
137+
- `fork()` crashes on MPS -- `VLLM_WORKER_MULTIPROC_METHOD=spawn` is required
138+
139+
### Troubleshooting
140+
141+
**Slow inference (50-100x slower than expected)**:
142+
KV cache memory thrashing. Try a smaller model or set `gpu_memory_utilization=0.2`.
143+
144+
**SIGSEGV during startup**:
145+
Set `VLLM_WORKER_MULTIPROC_METHOD=spawn`.
146+
147+
**"No module named 'vllm.platforms.mps'"**:
148+
Ensure you are on the `mps-platform-support` branch.
149+
150+
--8<-- [end:supported-features]

tests/v1/attention/test_mps_attn.py

Lines changed: 9 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -45,7 +45,10 @@ def create_kv_cache_hnd(
4545
dtype: torch.dtype,
4646
device: torch.device,
4747
) -> torch.Tensor:
48-
"""Create KV cache in HND layout: (2, num_blocks, num_kv_heads, block_size, head_size)."""
48+
"""Create KV cache in HND layout.
49+
50+
Shape: (2, num_blocks, num_kv_heads, block_size, head_size).
51+
"""
4952
return torch.zeros(
5053
2,
5154
num_blocks,
@@ -102,7 +105,6 @@ def sdpa_reference(
102105
for i in range(len(seq_lens)):
103106
q_len = query_lens[i]
104107
s_len = seq_lens[i]
105-
context_len = s_len - q_len
106108

107109
q = query[q_start : q_start + q_len] # [q_len, num_heads, head_size]
108110
# Full key/value includes context + query tokens
@@ -277,9 +279,6 @@ def test_attention_correctness(
277279
batch_spec = BATCH_SPECS[batch_name]
278280

279281
num_tokens = sum(batch_spec.query_lens)
280-
total_context_tokens = sum(
281-
s - q for s, q in zip(batch_spec.seq_lens, batch_spec.query_lens)
282-
)
283282

284283
# Generate full Q, K, V for reference computation
285284
# Full K, V = context + query tokens for each sequence
@@ -479,7 +478,9 @@ def test_get_attn_backend_returns_mps(self):
479478
attention_config = AttentionConfig(backend=AttentionBackendEnum.MPS_ATTN)
480479
vllm_config = VllmConfig(attention_config=attention_config)
481480

482-
with set_current_vllm_config(vllm_config):
483-
with patch("vllm.platforms.current_platform", MpsPlatform()):
484-
backend = get_attn_backend(64, torch.float16, None)
481+
with (
482+
set_current_vllm_config(vllm_config),
483+
patch("vllm.platforms.current_platform", MpsPlatform()),
484+
):
485+
backend = get_attn_backend(64, torch.float16, None)
485486
assert backend.get_name() == "MPS_ATTN"

vllm/v1/attention/backends/mps_attn.py

Lines changed: 6 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -292,7 +292,8 @@ def forward(
292292
blocks = block_table[i, :num_blocks_needed]
293293

294294
# Gather K,V from paged cache
295-
# key_cache[blocks]: [num_blocks_needed, num_kv_heads, block_size, head_size]
295+
# key_cache[blocks]:
296+
# [num_blocks_needed, num_kv_heads, block_size, head_size]
296297
# Transpose to [num_kv_heads, num_blocks_needed, block_size, head_size]
297298
# then reshape to merge blocks×block_size into the sequence dim.
298299
k_paged = (
@@ -306,9 +307,11 @@ def forward(
306307
.reshape(self.num_kv_heads, -1, self.head_size)[:, :seq_len, :]
307308
)
308309

309-
# query slice: [q_len, num_heads, head_size] -> [1, num_heads, q_len, head_size]
310+
# query: [q_len, num_heads, head_size]
311+
# -> [1, num_heads, q_len, head_size]
310312
q = query[q_start:q_end].transpose(0, 1).unsqueeze(0)
311-
# k,v: [num_kv_heads, seq_len, head_size] -> [1, num_kv_heads, seq_len, head_size]
313+
# k,v: [num_kv_heads, seq_len, head_size]
314+
# -> [1, num_kv_heads, seq_len, head_size]
312315
k = k_paged.unsqueeze(0)
313316
v = v_paged.unsqueeze(0)
314317

0 commit comments

Comments
 (0)