Skip to content

Commit 3f606c5

Browse files
committed
Fix vLLM CI test by increasing gpu_memory_utilization to 0.4
The CI test was failing with 'ValueError: To serve at least one request with the model's max seq len (8192), 1.5 GiB KV cache is needed, which is larger than the available KV cache memory (1.42 GiB).' Root cause: - Tesla T4 GPU (15.36 GB) in CI environment - With gpu_memory_utilization=0.35, only 1.42 GiB available for KV cache - Required 1.5 GiB for max_seq_len=8192 - Shortfall: 80 MB Fix: - Increase gpu_memory_utilization from 0.35 to 0.4 - Now provides ~1.62 GiB for KV cache (sufficient for 1.5 GiB requirement) - Does not affect model outputs with temperature=0.0 (deterministic)
1 parent f54496a commit 3f606c5

File tree

1 file changed

+1
-1
lines changed

1 file changed

+1
-1
lines changed

examples/model_configs/vllm_model_config.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@ model_parameters:
55
tensor_parallel_size: 1
66
data_parallel_size: 1
77
pipeline_parallel_size: 1
8-
gpu_memory_utilization: 0.35
8+
gpu_memory_utilization: 0.4
99
max_model_length: null
1010
swap_space: 4
1111
seed: 42

0 commit comments

Comments
 (0)