Skip to content

Commit df50b07

Browse files
committed
[Feature] Improve logging for error messages
Signed-off-by: Elizabeth Thomas <[email protected]>
1 parent 68b254d commit df50b07

File tree

10 files changed

+1681
-90
lines changed

10 files changed

+1681
-90
lines changed

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -92,7 +92,7 @@ Find the full list of supported models [here](https://docs.vllm.ai/en/latest/mod
9292
Install vLLM with `pip` or [from source](https://docs.vllm.ai/en/latest/getting_started/installation/gpu/index.html#build-wheel-from-source):
9393

9494
```bash
95-
pip install vllm
95+
uv pip install vllm
9696
```
9797

9898
Visit our [documentation](https://docs.vllm.ai/en/latest/) to learn more.

tests/v1/engine/test_initialization_errors.py

Lines changed: 531 additions & 0 deletions
Large diffs are not rendered by default.

vllm/v1/core/kv_cache_utils.py

Lines changed: 36 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -651,17 +651,30 @@ def check_enough_kv_cache_memory(vllm_config: VllmConfig,
651651
available_memory: Memory available for KV cache in bytes.
652652
653653
Raises:
654-
ValueError: If there is not enough memory available for the KV cache.
654+
InsufficientKVCacheMemoryError: If there is not enough memory available for the KV cache.
655655
"""
656+
from vllm.v1.engine.initialization_errors import (
657+
InsufficientKVCacheMemoryError, get_memory_suggestions
658+
)
656659

657660
# No need to check for available memory if the kv_cache_spec is empty
658661
if not kv_cache_spec:
659662
return
660663

661664
if available_memory <= 0:
662-
raise ValueError("No available memory for the cache blocks. "
663-
"Try increasing `gpu_memory_utilization` when "
664-
"initializing the engine.")
665+
suggestions = get_memory_suggestions(
666+
required_memory=1024**3, # 1 GiB as minimum
667+
available_memory=available_memory,
668+
current_gpu_utilization=vllm_config.cache_config.gpu_memory_utilization,
669+
max_model_len=vllm_config.model_config.max_model_len,
670+
is_kv_cache=True
671+
)
672+
raise InsufficientKVCacheMemoryError(
673+
required_kv_memory=1024**3, # 1 GiB as minimum
674+
available_kv_memory=available_memory,
675+
max_model_len=vllm_config.model_config.max_model_len,
676+
suggestions=suggestions
677+
)
665678

666679
max_model_len = vllm_config.model_config.max_model_len
667680
needed_memory = max_memory_usage_bytes(vllm_config, kv_cache_spec.values())
@@ -670,20 +683,26 @@ def check_enough_kv_cache_memory(vllm_config: VllmConfig,
670683
# Estimate the maximum model length that can fit in the available memory
671684
estimated_max_len = estimate_max_model_len(vllm_config, kv_cache_spec,
672685
available_memory)
673-
estimated_msg = ""
686+
687+
suggestions = get_memory_suggestions(
688+
required_memory=needed_memory,
689+
available_memory=available_memory,
690+
current_gpu_utilization=vllm_config.cache_config.gpu_memory_utilization,
691+
max_model_len=max_model_len,
692+
is_kv_cache=True
693+
)
694+
695+
# Add model-specific suggestions
674696
if estimated_max_len > 0:
675-
estimated_msg = (
676-
"Based on the available memory, "
677-
f"the estimated maximum model length is {estimated_max_len}.")
678-
679-
raise ValueError(
680-
f"To serve at least one request with the models's max seq len "
681-
f"({max_model_len}), ({needed_memory/GiB_bytes:.2f} GiB KV "
682-
f"cache is needed, which is larger than the available KV cache "
683-
f"memory ({available_memory/GiB_bytes:.2f} GiB). "
684-
f"{estimated_msg} "
685-
f"Try increasing `gpu_memory_utilization` or decreasing "
686-
f"`max_model_len` when initializing the engine.")
697+
suggestions.insert(0, f"Reduce max_model_len from {max_model_len} to {estimated_max_len} or lower")
698+
699+
raise InsufficientKVCacheMemoryError(
700+
required_kv_memory=needed_memory,
701+
available_kv_memory=available_memory,
702+
max_model_len=max_model_len,
703+
estimated_max_len=estimated_max_len,
704+
suggestions=suggestions
705+
)
687706

688707

689708
def create_kv_cache_group_specs(
Lines changed: 203 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,203 @@
1+
# Enhanced Error Handling for vLLM V1 Initialization
2+
3+
This enhancement provides improved error handling and logging for common initialization errors in vLLM V1, making it easier for users to diagnose and resolve issues.
4+
5+
## Overview
6+
7+
The enhanced error handling addresses the most common initialization problems:
8+
9+
1. **Insufficient GPU Memory** - When the model is too large for available GPU memory
10+
2. **Insufficient KV Cache Memory** - When there's not enough memory for the KV cache given the max_model_len
11+
3. **Model Loading Errors** - When model files can't be loaded or are incompatible
12+
4. **CUDA Errors** - When CUDA-related issues occur during initialization
13+
14+
## Key Features
15+
16+
### 1. Detailed Error Messages
17+
Instead of generic error messages, users now get:
18+
- Clear descriptions of what went wrong
19+
- Specific memory requirements vs. available memory
20+
- Estimated maximum model lengths based on available memory
21+
- Context about where the error occurred (model loading, KV cache, etc.)
22+
23+
### 2. Actionable Suggestions
24+
Each error provides specific suggestions like:
25+
- Adjusting `gpu_memory_utilization`
26+
- Reducing `max_model_len`
27+
- Using quantization (GPTQ, AWQ, FP8)
28+
- Enabling tensor parallelism
29+
- Closing other GPU processes
30+
31+
### 3. Enhanced Logging
32+
- Detailed initialization information logged at startup
33+
- Memory usage statistics
34+
- Model configuration details
35+
- Progress indicators for different initialization phases
36+
37+
### 4. Critical Safety Improvements
38+
- **ZeroDivisionError Prevention**: Safely handles edge cases where memory profiling returns zero values, preventing uncaught exceptions during initialization
39+
- **Input Validation**: All error classes validate input parameters (no negative memory values, positive model lengths)
40+
- **Graceful Error Messaging**: Instead of cryptic crashes, users receive clear explanations of configuration issues
41+
- **Robust Error Recovery**: Handles unusual memory profiling results that could occur with certain models or test configurations
42+
43+
## New Error Classes
44+
45+
### `InsufficientMemoryError`
46+
Raised when there's not enough GPU memory to load the model.
47+
48+
```python
49+
InsufficientMemoryError: Insufficient GPU memory to load the model.
50+
Required: 24.50 GiB
51+
Available: 22.30 GiB
52+
Shortage: 2.20 GiB
53+
54+
Suggestions to resolve this issue:
55+
1. Try increasing gpu_memory_utilization first (safest option)
56+
2. Increase gpu_memory_utilization from 0.80 (e.g., to 0.90)
57+
3. Consider using quantization (GPTQ, AWQ, FP8) to reduce model memory usage
58+
4. Use tensor parallelism to distribute the model across multiple GPUs
59+
5. Close other GPU processes to free up memory
60+
```
61+
62+
### `InsufficientKVCacheMemoryError`
63+
Raised when there's not enough memory for the KV cache.
64+
65+
```python
66+
InsufficientKVCacheMemoryError: Insufficient memory for KV cache to serve requests.
67+
Required KV cache memory: 8.45 GiB (for max_model_len=4096)
68+
Available KV cache memory: 6.20 GiB
69+
Shortage: 2.25 GiB
70+
Based on available memory, estimated maximum model length: 3000
71+
72+
Suggestions to resolve this issue:
73+
1. Reduce max_model_len from 4096 to 3000 or lower
74+
2. Reduce max_model_len from 4096 to a smaller value
75+
3. Increase gpu_memory_utilization from 0.80 (e.g., to 0.90)
76+
4. Consider using quantization (GPTQ, AWQ, FP8) to reduce memory usage
77+
5. Use tensor parallelism to distribute the model across multiple GPUs
78+
```
79+
80+
### `ModelLoadingError`
81+
Raised when model loading fails for various reasons.
82+
83+
```python
84+
ModelLoadingError: Failed to load model 'meta-llama/Llama-3.1-8B' during initialization.
85+
Error details: CUDA out of memory. Tried to allocate 2.50 GiB
86+
87+
Suggestions to resolve this issue:
88+
1. The model is too large for available GPU memory
89+
2. Consider using a smaller model or quantization
90+
3. Try tensor parallelism to distribute the model across multiple GPUs
91+
4. Reduce gpu_memory_utilization to leave more memory for CUDA operations
92+
```
93+
94+
## Implementation Details
95+
96+
### Files Modified/Added
97+
98+
1. **`vllm/v1/engine/initialization_errors.py`** (NEW)
99+
- Contains the new error classes and utility functions
100+
- Provides suggestion generation based on error context
101+
- Includes detailed logging functions
102+
103+
2. **`vllm/v1/engine/core.py`** (ENHANCED)
104+
- Enhanced `_initialize_kv_caches()` method with better error handling
105+
- Detailed logging of initialization progress
106+
- Proper exception handling with enhanced error messages
107+
108+
3. **`vllm/v1/core/kv_cache_utils.py`** (ENHANCED)
109+
- Updated `check_enough_kv_cache_memory()` to use new error classes
110+
- Better error messages with specific suggestions
111+
112+
4. **`vllm/v1/worker/gpu_worker.py`** (ENHANCED)
113+
- Enhanced memory checking in `init_device()`
114+
- Better error handling in `load_model()` and `determine_available_memory()`
115+
- More detailed memory profiling error handling
116+
117+
5. **`vllm/v1/engine/llm_engine.py`** (ENHANCED)
118+
- Enhanced `__init__()` method with comprehensive error handling
119+
- Better error messages for tokenizer and processor initialization
120+
121+
### Error Handling Strategy
122+
123+
The enhancement follows a layered approach:
124+
125+
1. **Low-level functions** (workers, memory profiling) catch specific errors and provide context
126+
2. **Mid-level functions** (core engine, KV cache utils) add domain-specific suggestions
127+
3. **High-level functions** (LLM engine) provide user-friendly error aggregation
128+
129+
Each layer adds value while preserving the original error context through exception chaining.
130+
131+
## Usage Examples
132+
133+
### Basic Usage
134+
```python
135+
import os
136+
os.environ["VLLM_USE_V1"] = "1"
137+
138+
from vllm import LLM
139+
140+
try:
141+
llm = LLM(
142+
model="meta-llama/Llama-3.1-70B-Instruct",
143+
gpu_memory_utilization=0.95,
144+
max_model_len=8192
145+
)
146+
except Exception as e:
147+
print(f"Initialization failed: {e}")
148+
# Error message will include specific suggestions
149+
```
150+
151+
### Advanced Error Handling
152+
```python
153+
from vllm.v1.engine.initialization_errors import (
154+
InsufficientMemoryError,
155+
InsufficientKVCacheMemoryError,
156+
ModelLoadingError
157+
)
158+
159+
try:
160+
llm = LLM(model="large-model", gpu_memory_utilization=0.9)
161+
except InsufficientMemoryError as e:
162+
print(f"Memory issue: {e}")
163+
# Handle memory-specific errors
164+
except InsufficientKVCacheMemoryError as e:
165+
print(f"KV cache issue: {e}")
166+
# Handle KV cache-specific errors
167+
except ModelLoadingError as e:
168+
print(f"Model loading issue: {e}")
169+
# Handle model loading errors
170+
```
171+
172+
## Testing
173+
174+
Run the demo script to see the enhanced error handling in action:
175+
176+
```bash
177+
python enhanced_error_demo.py
178+
```
179+
180+
This script intentionally triggers various error conditions to demonstrate the improved error messages and suggestions.
181+
182+
## Benefits
183+
184+
1. **Faster Debugging** - Users can quickly understand what went wrong
185+
2. **Self-Service Resolution** - Clear suggestions help users fix issues independently
186+
3. **Better Support Experience** - More detailed error reports improve support quality
187+
4. **Reduced Trial-and-Error** - Specific suggestions reduce the need for guesswork
188+
189+
## Backward Compatibility
190+
191+
The enhancement is fully backward compatible:
192+
- Existing error handling code continues to work
193+
- New error classes inherit from standard Python exceptions
194+
- Original error messages are preserved in the error chain
195+
- No breaking changes to existing APIs
196+
197+
## Future Enhancements
198+
199+
Potential areas for further improvement:
200+
1. Add error handling for distributed setup issues
201+
2. Enhanced logging for multimodal model initialization
202+
3. Better error messages for quantization setup
203+
4. Integration with monitoring/telemetry systems

0 commit comments

Comments
 (0)