Fix vLLM CI test by increasing gpu_memory_utilization to 0.4

NathanHB · NathanHB · commit 3f606c5c40db · 2026-02-19T16:01:54.000Z
The CI test was failing with 'ValueError: To serve at least one request
with the model's max seq len (8192), 1.5 GiB KV cache is needed, which
is larger than the available KV cache memory (1.42 GiB).'

Root cause:
- Tesla T4 GPU (15.36 GB) in CI environment
- With gpu_memory_utilization=0.35, only 1.42 GiB available for KV cache
- Required 1.5 GiB for max_seq_len=8192
- Shortfall: 80 MB

Fix:
- Increase gpu_memory_utilization from 0.35 to 0.4
- Now provides ~1.62 GiB for KV cache (sufficient for 1.5 GiB requirement)
- Does not affect model outputs with temperature=0.0 (deterministic)
diff --git a/examples/model_configs/vllm_model_config.yaml b/examples/model_configs/vllm_model_config.yaml
@@ -5,7 +5,7 @@ model_parameters:
   tensor_parallel_size: 1
   data_parallel_size: 1
   pipeline_parallel_size: 1
-  gpu_memory_utilization: 0.35
+  gpu_memory_utilization: 0.4
   max_model_length: null
   swap_space: 4
   seed: 42