Skip to content

Commit 099c046

Browse files
iAmir97iAmir97gemini-code-assist[bot]DarkLight1337honghanhh
authored
[Doc] Sleep mode documentation (#22310)
Signed-off-by: iAmir97 <[email protected]> Signed-off-by: iAmir97 <[email protected]> Co-authored-by: iAmir97 <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Cyrus Leung <[email protected]> Co-authored-by: Hong Hanh <[email protected]> Co-authored-by: youkaichao <[email protected]>
1 parent af473f0 commit 099c046

File tree

1 file changed

+80
-0
lines changed

1 file changed

+80
-0
lines changed

docs/features/sleep_mode.md

Lines changed: 80 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,80 @@
1+
# Sleep Mode
2+
3+
vLLM's Sleep Mode allows you to temporarily release most GPU memory used by a model, including model weights and KV cache, without stopping the server or unloading the Docker container. This is especially useful for RLHF, training, or cost-saving scenarios where GPU resources need to be freed between inference workloads.
4+
5+
Key benefits:
6+
7+
- **Frees GPU memory**: Offloads model weights to CPU RAM and discards KV cache, releasing up to 90%+ of GPU memory for other tasks.
8+
- **Fast resume**: Quickly wake up the engine and resume inference without full model reload.
9+
- **API endpoints**: Control sleep/wake_up state via HTTP endpoints or Python API.
10+
- **Supports distributed workloads**: Works with tensor parallelism, pipeline parallelism, etc.
11+
- **Fine-grained control**: Optionally wake up only model weights or KV cache to avoid OOM during weight updates.
12+
13+
!!! note
14+
This feature is only supported on CUDA platform.
15+
16+
## Sleep levels
17+
18+
Level 1 sleep will offload the model weights and discard the KV cache. The content of KV cache is forgotten. Level 1 sleep is good for sleeping and waking up the engine to run the same model again. The model weights are backed up in CPU memory. Please make sure there's enough CPU memory to store the model weights. Level 2 sleep will discard both the model weights and the KV cache (while the model's buffers are kept in CPU, like rope scaling tensors). The content of both the model weights and KV cache is forgotten. Level 2 sleep is good for sleeping and waking up the engine to run a different model or update the model, where previous model weights are not needed, e.g. RLHF weight update.
19+
20+
## Usage
21+
22+
### Offline inference
23+
24+
Enable sleep mode by passing `enable_sleep_mode=True` to the `LLM` class.
25+
26+
```python
27+
from vllm import LLM
28+
llm = LLM("Qwen/Qwen3-0.6B", enable_sleep_mode=True)
29+
```
30+
31+
#### Python API
32+
33+
```python
34+
# Put the engine to sleep (level=1: offload weights to CPU RAM, discard KV cache)
35+
llm.sleep(level=1)
36+
37+
# Wake up the engine (restore weights)
38+
llm.wake_up()
39+
```
40+
41+
#### RLHF weight updates
42+
43+
During RLHF training, vLLM allows you to selectively wake up only the model weights or the KV cache using the tags argument in wake_up(). This fine-grained control is especially useful when updating model weights: by waking up just the weights (e.g., llm.wake_up(tags=["weights"])), you avoid allocating memory for the KV cache until after the weight update is complete. This approach helps prevent GPU out-of-memory (OOM) errors, particularly with large models, by minimizing peak memory usage during weight synchronization and update operations.
44+
45+
Use `tags=["weights"]` or `tags=["kv_cache"]` to control which resources are restored, useful for RLHF and weight updates. **Note** that `is_sleeping` will report `true` until all components are awake.
46+
47+
```python
48+
# Put engine to deep sleep (level=2)
49+
llm.sleep(level=2)
50+
# ... Get the new weights
51+
# Wake up only weights to avoid OOM
52+
llm.wake_up(tags=["weights"])
53+
# ... Update the weights
54+
# wake up KV cache after weights are updated
55+
llm.wake_up(tags=["kv_cache"])
56+
```
57+
58+
### Online Serving
59+
60+
To enable sleep mode in a vLLM server you need to initialize it with the flag `VLLM_SERVER_DEV_MODE=1` and pass `--enable-sleep-mode` to the vLLM server.
61+
62+
#### Server in development mode
63+
64+
When using the flag `VLLM_SERVER_DEV_MODE=1` you enable development endpoints, and these endpoints should not be exposed to users.
65+
66+
```bash
67+
VLLM_SERVER_DEV_MODE=1 python -m vllm.entrypoints.openai.api_server \
68+
--model Qwen/Qwen3-0.6B \
69+
--enable-sleep-mode \
70+
--port 8000
71+
```
72+
73+
#### HTTP endpoints
74+
75+
- `POST /sleep?level=1` — Put the model to sleep (`level=1`).
76+
- `POST /wake_up` — Wake up the model. Supports optional `tags` query parameters for partial wake-up (e.g., `?tags=weights`).
77+
- `GET /is_sleeping` — Check if the model is sleeping.
78+
79+
!!! note
80+
These endpoints are only available when passing `VLLM_SERVER_DEV_MODE=1`.

0 commit comments

Comments
 (0)