Skip to content

Commit 1122eb6

Browse files
Amir BalwelAmir Balwel
authored andcommitted
add markdown
Signed-off-by: Amir Balwel <[email protected]>
1 parent f8178b2 commit 1122eb6

File tree

1 file changed

+321
-0
lines changed

1 file changed

+321
-0
lines changed

_posts/2025-10-03-sleep-mode.md

Lines changed: 321 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,321 @@
1+
---
2+
layout: post
3+
title: "Zero‑Reload Model Switching with vLLM Sleep Mode"
4+
author: "Embedded LLM"
5+
---
6+
7+
Serve multiple LLMs on a single GPU **without increasing VRAM,** and finish faster, by combining **vLLM Sleep Mode** with a **1‑token warm‑up**.
8+
9+
## TL;DR
10+
11+
- **Problem:** Stage‑based pipelines often need to swap between different models. Reloading models each round burns time and VRAM.
12+
- **Solution:** vLLM **Sleep Mode** offloads weights (and optionally frees them) so you can **wake** a model in ~0.1–0.4 s, run a **1‑token warm‑up**, and then hit steady‑state latency on the first user request.
13+
- **Result:** Across six alternating prompts (two models, three rounds), **Sleep Mode + warm‑up** is **~2.7×** faster than reloading models each time, **69.2 s vs 188.9 s,** with prefix caching **disabled**.
14+
15+
<img align="center" src="/assets/figures/sleep-mode/First-prompt-latency.png" alt="First prompt latency" width="90%" height="90%">
16+
17+
<img align="center" src="/assets/figures/sleep-mode/total-time-ratio.png" alt="Total time ratio" width="90%" height="90%">
18+
19+
Figures 1 & 2 — End-to-end time & first-prompt latency (lower is better).
20+
Two models, 6 alternating prompts; Sleep L1 + 1-token warm-up. Total time: 69.2 s vs 188.9 s (~2.7× faster). First-prompt (after wake): 4.12 s → 0.96 s (–77%).
21+
22+
## Why Sleep Mode
23+
24+
**The problem.** In multi‑stage pipelines (e.g., classify → caption → rerank), different steps call different models. On a single GPU, naïve orchestration reloads models from disk each time, repeatedly incurring initialization, allocator priming, and other cold‑start costs. VRAM also prevents keeping multiple models resident.
25+
26+
**Sleep Mode (CUDA‑only).** vLLM’s Sleep Mode keeps the server up while reducing GPU footprint:
27+
28+
- **Level 1** — Offload model **weights to CPU RAM** and discard **KV cache**.
29+
- Use when you plan to switch **back to the same model**.
30+
- Wake is fast because weights are already in host memory.
31+
- **API:** `POST /sleep?sleep_level=1` then `POST /wake_up`
32+
- **Level 2** — Discard **both** weights and KV cache (free on CPU + GPU).
33+
- Use when switching to **a different model** or **updating weights**; the next wake will reload from disk.
34+
- **API:** `POST /sleep?sleep_level=2` then `POST /wake_up`
35+
36+
> Works on multi‑GPU too. If you serve large models with tensor parallelism (TP) or pipeline parallelism (PP), Sleep Mode offloads/frees each partition across devices in the same way. The control endpoints are identical; the speed‑up dynamics remain (wake is cheap, reload is expensive), only the absolute times change.
37+
38+
## Setup
39+
40+
**Hardware**
41+
42+
- CPU: Ryzen 7 7900x
43+
- GPU: **RTX A4000** (PCIe 4.0 x16)
44+
- RAM: **64 GB** DDR5
45+
46+
**Software**
47+
48+
- vLLM **0.10.0** (older versions may work similarly)
49+
- CUDA (required; Sleep Mode is CUDA‑only)
50+
51+
**Models**
52+
53+
- `Qwen/Qwen3-0.6B` (text, compact reasoning)
54+
- `HuggingFaceTB/SmolVLM2-2.2B-Instruct` (lightweight VLM)
55+
56+
**Common flags**
57+
58+
- `-enable-sleep-mode`
59+
- `-no-enable-prefix-caching` _(prefix caching disabled for fairness)_
60+
- `-compilation-config '{"full_cuda_graph": true}'` _(baseline runs)_
61+
- `-dtype auto` _(bf16 if supported; otherwise fp16)_
62+
- `-trust-remote-code` _(needed by many VLMs, including SmolVLM2)_
63+
64+
> Quantization: none for these runs. If you quantize, the absolute latencies change but the patterns (cold‑start vs steady‑state) remain.
65+
66+
## Method
67+
68+
**Scenario.** Two models, three alternating rounds (A→B→A→B→A→B). One prompt per model per round.
69+
70+
**Policy.**
71+
72+
- **Sleep Mode runs:**
73+
- Load both models once (steady state), then for each turn:
74+
- **wake → warm‑up (1 token) → prompt → sleep (L1)**
75+
- We use **Sleep Level 1** because we switch back to the same models later.
76+
- **No‑sleep baseline:**
77+
- **Reload the needed model from disk** each time.
78+
- No explicit warm‑up call: the cold‑start cost is **embedded** into the **first prompt** latency on every round.
79+
80+
**Controls.**
81+
82+
- Prefix caching **disabled** (`-no-enable-prefix-caching`).
83+
- Same prompts across runs.
84+
- Measured total wall time; TTFT observed but not used to drive control flow.
85+
- CUDA Graph **enabled** in baseline; other ablations (eager mode, CG off) did not remove the cold‑start spike.
86+
- Concurrency: single request at a time, to isolate first‑prompt effects.
87+
88+
## Results
89+
90+
### Canonical end‑to‑end comparison
91+
92+
**Condition:** _Sleep Mode (L1), warm‑up ON, CUDA Graph ON, eager OFF, prefix caching OFF._
93+
94+
**Workload:** _Two models, three rounds, one prompt per turn, single‑threaded._
95+
96+
- **Sleep + warm‑up:** **69.2 s** total
97+
- **No‑sleep:** **188.9 s** total
98+
- **Speed‑up:** **~2.7×**
99+
100+
### Warm‑up removes the first‑prompt spike
101+
102+
**Condition:** _Second model’s first inference after wake; Sleep Mode (L1); prefix caching OFF._
103+
104+
A single **1‑token warm‑up** right after `wake_up` reduces the first‑prompt latency from **4.12 s → 0.96 s** (**77%**). Subsequent prompts stay at steady state; you pay the warm‑up once per wake.
105+
106+
> Why the big gap vs no‑sleep? In no‑sleep, you reload from disk every round, and the cold‑start cost is repaid repeatedly because there’s no persistent server state. In Sleep Mode, you pay a small wake + warm‑up and keep the process hot.
107+
108+
## What causes the first‑prompt spike?
109+
110+
It’s not (just) token length; it’s general **cold‑start work** concentrated in the first request:
111+
112+
- CUDA runtime/driver initialization paths
113+
- TorchInductor graph specialization and/or JIT compilation
114+
- CUDA Graph capture (if enabled)
115+
- Memory allocator priming and graphable pool setup
116+
- Prefill‑path specialization (e.g., attention mask/layout)
117+
118+
Flipping **CUDA Graph off** or **enforce‑eager on** didn’t eliminate the spike in our tests; both still need allocator setup and prefill specialization. A **1‑token warm‑up** absorbs these costs so user‑visible requests start in steady state.
119+
120+
## Quickstart
121+
122+
**The fastest way to use Sleep Mode today—just what you need, nothing else.**
123+
124+
```bash
125+
# 1) Install
126+
pip install vllm==0.10.0
127+
128+
# 2) Start a server (CUDA only)
129+
export HF_TOKEN=...
130+
export VLLM_SERVER_DEV_MODE=1 # dev endpoints; run behind your proxy
131+
vllm serve Qwen/Qwen3-0.6B \
132+
--enable-sleep-mode \
133+
--no-enable-prefix-caching \
134+
--port 8000
135+
```
136+
137+
**Sleep / Wake**
138+
139+
```bash
140+
# Sleep Level 1 (weights → CPU, KV cache cleared)
141+
curl -X POST localhost:8000/sleep?sleep_level=1
142+
143+
# Wake
144+
curl -X POST localhost:8000/wake_up
145+
146+
```
147+
148+
**Warm‑up (1 token) + prompt**
149+
150+
```bash
151+
# Warm-up: absorbs cold-start; keeps first user request fast
152+
curl localhost:8000/v1/chat/completions -H 'content-type: application/json' -d '{
153+
"model": "Qwen/Qwen3-0.6B",
154+
"messages": [{"role":"user","content":"warm-up"}],
155+
"max_tokens": 1, "temperature": 0, "top_p": 1, "seed": 0
156+
}'
157+
158+
# Real prompt
159+
curl localhost:8000/v1/chat/completions -H 'content-type: application/json' -d '{
160+
"model": "Qwen/Qwen3-0.6B",
161+
"messages": [{"role":"user","content":"Give me a fun fact about the Moon."}],
162+
"max_tokens": 32, "temperature": 0, "top_p": 1, "seed": 0
163+
}'
164+
165+
```
166+
167+
> Multi‑GPU note: Works with TP/PP as well; Sleep Mode offloads/frees each partition the same way. The endpoints and workflow don’t change.A minimal script that launches **two** vLLM servers (Sleep Mode enabled), then runs one full cycle on each model:
168+
169+
---
170+
171+
A minimal script that launches **two** vLLM servers (Sleep Mode enabled), then runs one full cycle on each model:
172+
173+
- **wake → warm‑up (1 token) → prompt → sleep**
174+
175+
## Notes
176+
177+
- Endpoints for sleeping/waking are **outside** `/v1`: use `POST /sleep?sleep_level=...` and `POST /wake_up`.
178+
- This example uses **Sleep Level 1**. Change to `sleep_level=2` when you won’t switch back soon or want to reclaim CPU RAM.
179+
- Logging prints timings for **wake / warm‑up / prompt / sleep** so you can see the first‑prompt drop.
180+
181+
```python
182+
# two_model_sleep_quickstart.py
183+
# Minimal quickstart for Sleep Mode + Warm‑up in vLLM (two models)
184+
185+
import os, time, signal, subprocess, requests
186+
from contextlib import contextmanager
187+
from openai import OpenAI
188+
189+
A_MODEL = "Qwen/Qwen3-0.6B"
190+
B_MODEL = "HuggingFaceTB/SmolVLM2-2.2B-Instruct"
191+
A_PORT, B_PORT = 8001, 8002
192+
A_URL, B_URL = f"http://localhost:{A_PORT}", f"http://localhost:{B_PORT}"
193+
194+
COMMON = [
195+
"--enable-sleep-mode",
196+
"--no-enable-prefix-caching",
197+
"--dtype", "auto",
198+
"--compilation-config", '{"full_cuda_graph": true}',
199+
]
200+
201+
def run_vllm(model, port, extra_flags=None):
202+
flags = extra_flags or []
203+
cmd = [
204+
"python3", "-m", "vllm.entrypoints.openai.api_server",
205+
"--model", model, "--port", str(port),
206+
*COMMON, *flags,
207+
]
208+
return subprocess.Popen(cmd, env=os.environ.copy())
209+
210+
def wait_ready(url, timeout=120):
211+
t0 = time.time()
212+
while time.time() - t0 < timeout:
213+
try:
214+
if requests.get(url + "/health", timeout=2).status_code == 200:
215+
return True
216+
except requests.RequestException:
217+
pass
218+
time.sleep(1)
219+
raise RuntimeError(f"Server at {url} not ready in {timeout}s")
220+
221+
def client(base_url):
222+
# vLLM OpenAI-compatible endpoint is served under /v1
223+
return OpenAI(api_key="EMPTY", base_url=base_url + "/v1")
224+
225+
def post(url, path):
226+
r = requests.post(url + path, timeout=10)
227+
r.raise_for_status()
228+
229+
@contextmanager
230+
def timed(label):
231+
t0 = time.time()
232+
yield
233+
dt = time.time() - t0
234+
print(f"{label:<18} {dt:.2f}s")
235+
236+
def warmup_call(url, model):
237+
# 1-token warm‑up to absorb cold-start
238+
client(url).chat.completions.create(
239+
model=model,
240+
messages=[{"role": "user", "content": "warm-up"}],
241+
max_tokens=1,
242+
temperature=0.0,
243+
top_p=1.0,
244+
extra_body={"seed": 0},
245+
)
246+
247+
def user_prompt(url, model, text, max_tokens=32):
248+
resp = client(url).chat.completions.create(
249+
model=model,
250+
messages=[{"role": "user", "content": text}],
251+
max_tokens=max_tokens,
252+
temperature=0.0,
253+
top_p=1.0,
254+
extra_body={"seed": 0},
255+
)
256+
return resp.choices[0].message.content
257+
258+
def cycle(url, model, text, sleep_level=1):
259+
with timed("wake"):
260+
post(url, "/wake_up")
261+
262+
with timed("warm-up"):
263+
warmup_call(url, model)
264+
265+
with timed("prompt"):
266+
out = user_prompt(url, model, text)
267+
print("", out.strip())
268+
269+
with timed(f"sleep(L{sleep_level})"):
270+
post(url, f"/sleep?sleep_level={sleep_level}")
271+
272+
if __name__ == "__main__":
273+
# SmolVLM2 needs trust-remote-code
274+
a = run_vllm(A_MODEL, A_PORT)
275+
b = run_vllm(B_MODEL, B_PORT, extra_flags=["--trust-remote-code"])
276+
277+
try:
278+
wait_ready(A_URL); wait_ready(B_URL)
279+
print("\n[A cycle]")
280+
cycle(A_URL, A_MODEL, "Give me a fun fact about the Moon.", sleep_level=1)
281+
282+
print("\n[B cycle]")
283+
cycle(B_URL, B_MODEL, "Describe an image pipeline in one line.", sleep_level=1)
284+
285+
finally:
286+
a.terminate(); b.terminate()
287+
try:
288+
a.wait(timeout=5)
289+
except Exception:
290+
a.kill()
291+
try:
292+
b.wait(timeout=5)
293+
except Exception:
294+
b.kill()
295+
```
296+
297+
Run
298+
299+
```python
300+
python two_model_sleep_quickstart.py
301+
```
302+
303+
You’ll see logs like:
304+
305+
```markdown
306+
[A cycle]
307+
wake 0.12s
308+
warm-up 0.96s
309+
prompt 2.55s
310+
→ The Moon’s day is about 29.5 Earth days.
311+
sleep(L1) 0.33s
312+
...
313+
```
314+
315+
## Limits & when _not_ to use it
316+
317+
- **CPU RAM bound.** Level‑1 offloads **weights to CPU**. Reserve roughly `param_count × bytes_per_param` (bf16≈2 bytes, fp16≈2 bytes) **plus overhead**.
318+
- Example: 2.2B params × 2 bytes ≈ **4.4 GB** baseline, expect ~5–6 GB with overheads.
319+
- **Level‑2 reload penalty.** L2 frees CPU+GPU memory; the next **wake** reloads from disk and pays the full cold‑start again. Use L2 only when you won’t switch back soon.
320+
- **CUDA‑only.** Sleep Mode isn’t available on ROCm or CPU‑only backends (as of v0.10.0).
321+
- **Heavy concurrency.** Warm‑up is cheap but still a request—run it once per wake, not per thread. Many concurrent first‑requests can stampede into cold‑start work; serialize the warm‑up or gate the first request.

0 commit comments

Comments
 (0)