Skip to content

Commit f54496a

Browse files
committed
Fix vLLM slow test OOM by reducing GPU memory utilization and improving cleanup
The vLLM slow tests were failing with OOM errors when running after accelerate tests. The issue was: 1. vLLM V1 engine requires a specific amount of free GPU memory at startup 2. After accelerate tests, only 5.89 GiB was free (out of 14.74 GiB) 3. vLLM with gpu_memory_utilization=0.6 wanted 8.84 GiB Fixes: - Reduce gpu_memory_utilization from 0.6 to 0.35 in test config (needs 5.16 GiB) - Add GPU memory cleanup fixture in conftest.py that runs before/after slow tests - Improve AsyncVLLMModel.cleanup() to properly delete model object The gpu_memory_utilization parameter only affects KV cache allocation and does not impact model outputs with temperature=0.0, so this change is safe.
1 parent e438e2d commit f54496a

File tree

3 files changed

+34
-1
lines changed

3 files changed

+34
-1
lines changed

examples/model_configs/vllm_model_config.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@ model_parameters:
55
tensor_parallel_size: 1
66
data_parallel_size: 1
77
pipeline_parallel_size: 1
8-
gpu_memory_utilization: 0.6
8+
gpu_memory_utilization: 0.35
99
max_model_length: null
1010
swap_space: 4
1111
seed: 42

src/lighteval/models/vllm/vllm_model.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -544,6 +544,8 @@ class AsyncVLLMModel(VLLMModel):
544544
is_async = True
545545

546546
def cleanup(self):
547+
if self.model is not None:
548+
del self.model
547549
gc.collect()
548550
destroy_distributed_environment()
549551
torch.cuda.empty_cache()

tests/conftest.py

Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,8 @@
22

33
# Copyright (c) 2024 The HuggingFace Team
44

5+
import gc
6+
57
import pytest
68

79

@@ -21,3 +23,32 @@ def pytest_collection_modifyitems(config, items):
2123
for item in items:
2224
if "slow" in item.keywords:
2325
item.add_marker(skip_slow)
26+
27+
28+
@pytest.fixture(autouse=True, scope="function")
29+
def cleanup_gpu_memory(request):
30+
"""Cleanup GPU memory before and after each test to prevent OOM errors."""
31+
# Cleanup before test (especially important for tests that run after other GPU-heavy tests)
32+
if "slow" in request.keywords:
33+
try:
34+
import torch
35+
36+
if torch.cuda.is_available():
37+
torch.cuda.empty_cache()
38+
torch.cuda.synchronize()
39+
except ImportError:
40+
pass
41+
gc.collect()
42+
43+
yield
44+
45+
# Cleanup after test
46+
try:
47+
import torch
48+
49+
if torch.cuda.is_available():
50+
torch.cuda.empty_cache()
51+
torch.cuda.synchronize()
52+
except ImportError:
53+
pass
54+
gc.collect()

0 commit comments

Comments
 (0)