To learn more about vLLM sleep mode, please visit the official vLLM blog post: vLLM Sleep Mode
You can find the experiment scripts of sleep mode under this directory.
This test compares the end-to-end performance of sleep mode (enabled) versus nosleep mode (i.e., fully unloading the model).
- Script:
online_default.py - Corresponding Result: Table 1 (Sleep mode vs No sleep mode)
Sleep Mode (Test Group):
python online_default.py sleep > out_info_sleep.log 2>&1No Sleep Mode (Control Group):
python online_default.py nosleep > out_info_nosleep.log 2>&1This test compares the performance difference between enabling and disabling FP8 quantization while in Sleep Mode.
- Script:
online_fp8.py - Corresponding Result: Table 2 (Sleep mode(without fp8 vs fp8))
No FP8 (Test Group):
python online_fp8.py > out_info.log 2>&1FP8 (Control Group):
python online_fp8.py fp8 > out_info_fp8.log 2>&1This test compares the performance difference in Sleep Mode with and without a warmup phase.
- Script:
online_sleep1_nowarm.py - Corresponding Result: Table 3 (Sleep mode vs vLLM 0.11.0 no warmup)
With Warmup (Test Group):
python online_sleep1_nowarm.py sleep1 > out_info_sleep1.log 2>&1No Warmup (Control Group):
python online_sleep1_nowarm.py nowarmup > out_info_nowarmup.log 2>&1This test compares Sleep Mode Level 1 (default, retains weights) with Level 2 (weights and KV cache are offloaded to CPU).
- Script:
online_sleep1_2.py - Corresponding Result: Table 4 (Sleep mode(without fp8) vs Sleep level 2 wake + reload weights)
Sleep Level 1 (Test Group):
python online_sleep1_2.py sleep1 > out_info_sleep1.log 2>&1Sleep Level 2 (Control Group):
python online_sleep1_2.py sleep2 > out_info_sleep2.log 2>&1This test compares the end-to-end performance of Sleep Mode Level 2 against nosleep mode (i.e., fully unloading the model).
- Script:
online_sleep2_nosleep.py - Corresponding Result: Table 5 (Sleep level 2 wake + reload weights vs No sleep mode)
Sleep Level 2 (Test Group):
python online_sleep2_nosleep.py sleep2 > out_info_sleep2.log 2>&1No Sleep Mode (Control Group):
python online_sleep2_nosleep.py nosleep > out_info_nosleep.log 2>&1This test compares the end-to-end performance and correctness of Sleep Mode Level 2 with FP8 KV Cache enabled (kv_cache_dtype="fp8") against the baseline (default/auto KV cache).
- Script:
online_sleep2_fp8.py - Corresponding Result: Table 6 (Sleep level 2 wake + reload weights: FP8 KV Cache vs. Default)
Test Group (FP8 KV Cache):
python online_sleep2_fp8.py fp8 > online_sleep2_fp8.log 2>&1Control Group (Default/No FP8):
python online_sleep2_fp8.py > online_sleep2_nofp8.log 2>&1