Skip to content

Commit da69cf6

Browse files
committed
[Feature] Improve logging for error messages
Signed-off-by: Elizabeth Thomas <[email protected]>
1 parent 68b254d commit da69cf6

10 files changed

+1985
-92
lines changed

ENHANCED_ERROR_HANDLING_README.md

Lines changed: 197 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,197 @@
1+
# Enhanced Error Handling for vLLM V1 Initialization
2+
3+
This enhancement provides improved error handling and logging for common initialization errors in vLLM V1, making it easier for users to diagnose and resolve issues.
4+
5+
## Overview
6+
7+
The enhanced error handling addresses the most common initialization problems:
8+
9+
1. **Insufficient GPU Memory** - When the model is too large for available GPU memory
10+
2. **Insufficient KV Cache Memory** - When there's not enough memory for the KV cache given the max_model_len
11+
3. **Model Loading Errors** - When model files can't be loaded or are incompatible
12+
4. **CUDA Errors** - When CUDA-related issues occur during initialization
13+
14+
## Key Features
15+
16+
### 1. Detailed Error Messages
17+
Instead of generic error messages, users now get:
18+
- Clear descriptions of what went wrong
19+
- Specific memory requirements vs. available memory
20+
- Estimated maximum model lengths based on available memory
21+
- Context about where the error occurred (model loading, KV cache, etc.)
22+
23+
### 2. Actionable Suggestions
24+
Each error provides specific suggestions like:
25+
- Adjusting `gpu_memory_utilization`
26+
- Reducing `max_model_len`
27+
- Using quantization (GPTQ, AWQ, FP8)
28+
- Enabling tensor parallelism
29+
- Closing other GPU processes
30+
31+
### 3. Enhanced Logging
32+
- Detailed initialization information logged at startup
33+
- Memory usage statistics
34+
- Model configuration details
35+
- Progress indicators for different initialization phases
36+
37+
## New Error Classes
38+
39+
### `InsufficientMemoryError`
40+
Raised when there's not enough GPU memory to load the model.
41+
42+
```python
43+
InsufficientMemoryError: Insufficient GPU memory to load the model.
44+
Required: 24.50 GiB
45+
Available: 22.30 GiB
46+
Shortage: 2.20 GiB
47+
48+
Suggestions to resolve this issue:
49+
1. Try increasing gpu_memory_utilization first (safest option)
50+
2. Increase gpu_memory_utilization from 0.80 (e.g., to 0.90)
51+
3. Consider using quantization (GPTQ, AWQ, FP8) to reduce model memory usage
52+
4. Use tensor parallelism to distribute the model across multiple GPUs
53+
5. Close other GPU processes to free up memory
54+
```
55+
56+
### `InsufficientKVCacheMemoryError`
57+
Raised when there's not enough memory for the KV cache.
58+
59+
```python
60+
InsufficientKVCacheMemoryError: Insufficient memory for KV cache to serve requests.
61+
Required KV cache memory: 8.45 GiB (for max_model_len=4096)
62+
Available KV cache memory: 6.20 GiB
63+
Shortage: 2.25 GiB
64+
Based on available memory, estimated maximum model length: 3000
65+
66+
Suggestions to resolve this issue:
67+
1. Reduce max_model_len from 4096 to 3000 or lower
68+
2. Reduce max_model_len from 4096 to a smaller value
69+
3. Increase gpu_memory_utilization from 0.80 (e.g., to 0.90)
70+
4. Consider using quantization (GPTQ, AWQ, FP8) to reduce memory usage
71+
5. Use tensor parallelism to distribute the model across multiple GPUs
72+
```
73+
74+
### `ModelLoadingError`
75+
Raised when model loading fails for various reasons.
76+
77+
```python
78+
ModelLoadingError: Failed to load model 'meta-llama/Llama-3.1-8B' during initialization.
79+
Error details: CUDA out of memory. Tried to allocate 2.50 GiB
80+
81+
Suggestions to resolve this issue:
82+
1. The model is too large for available GPU memory
83+
2. Consider using a smaller model or quantization
84+
3. Try tensor parallelism to distribute the model across multiple GPUs
85+
4. Reduce gpu_memory_utilization to leave more memory for CUDA operations
86+
```
87+
88+
## Implementation Details
89+
90+
### Files Modified/Added
91+
92+
1. **`vllm/v1/engine/initialization_errors.py`** (NEW)
93+
- Contains the new error classes and utility functions
94+
- Provides suggestion generation based on error context
95+
- Includes detailed logging functions
96+
97+
2. **`vllm/v1/engine/core.py`** (ENHANCED)
98+
- Enhanced `_initialize_kv_caches()` method with better error handling
99+
- Detailed logging of initialization progress
100+
- Proper exception handling with enhanced error messages
101+
102+
3. **`vllm/v1/core/kv_cache_utils.py`** (ENHANCED)
103+
- Updated `check_enough_kv_cache_memory()` to use new error classes
104+
- Better error messages with specific suggestions
105+
106+
4. **`vllm/v1/worker/gpu_worker.py`** (ENHANCED)
107+
- Enhanced memory checking in `init_device()`
108+
- Better error handling in `load_model()` and `determine_available_memory()`
109+
- More detailed memory profiling error handling
110+
111+
5. **`vllm/v1/engine/llm_engine.py`** (ENHANCED)
112+
- Enhanced `__init__()` method with comprehensive error handling
113+
- Better error messages for tokenizer and processor initialization
114+
115+
### Error Handling Strategy
116+
117+
The enhancement follows a layered approach:
118+
119+
1. **Low-level functions** (workers, memory profiling) catch specific errors and provide context
120+
2. **Mid-level functions** (core engine, KV cache utils) add domain-specific suggestions
121+
3. **High-level functions** (LLM engine) provide user-friendly error aggregation
122+
123+
Each layer adds value while preserving the original error context through exception chaining.
124+
125+
## Usage Examples
126+
127+
### Basic Usage
128+
```python
129+
import os
130+
os.environ["VLLM_USE_V1"] = "1"
131+
132+
from vllm import LLM
133+
134+
try:
135+
llm = LLM(
136+
model="meta-llama/Llama-3.1-70B-Instruct",
137+
gpu_memory_utilization=0.95,
138+
max_model_len=8192
139+
)
140+
except Exception as e:
141+
print(f"Initialization failed: {e}")
142+
# Error message will include specific suggestions
143+
```
144+
145+
### Advanced Error Handling
146+
```python
147+
from vllm.v1.engine.initialization_errors import (
148+
InsufficientMemoryError,
149+
InsufficientKVCacheMemoryError,
150+
ModelLoadingError
151+
)
152+
153+
try:
154+
llm = LLM(model="large-model", gpu_memory_utilization=0.9)
155+
except InsufficientMemoryError as e:
156+
print(f"Memory issue: {e}")
157+
# Handle memory-specific errors
158+
except InsufficientKVCacheMemoryError as e:
159+
print(f"KV cache issue: {e}")
160+
# Handle KV cache-specific errors
161+
except ModelLoadingError as e:
162+
print(f"Model loading issue: {e}")
163+
# Handle model loading errors
164+
```
165+
166+
## Testing
167+
168+
Run the demo script to see the enhanced error handling in action:
169+
170+
```bash
171+
python enhanced_error_demo.py
172+
```
173+
174+
This script intentionally triggers various error conditions to demonstrate the improved error messages and suggestions.
175+
176+
## Benefits
177+
178+
1. **Faster Debugging** - Users can quickly understand what went wrong
179+
2. **Self-Service Resolution** - Clear suggestions help users fix issues independently
180+
3. **Better Support Experience** - More detailed error reports improve support quality
181+
4. **Reduced Trial-and-Error** - Specific suggestions reduce the need for guesswork
182+
183+
## Backward Compatibility
184+
185+
The enhancement is fully backward compatible:
186+
- Existing error handling code continues to work
187+
- New error classes inherit from standard Python exceptions
188+
- Original error messages are preserved in the error chain
189+
- No breaking changes to existing APIs
190+
191+
## Future Enhancements
192+
193+
Potential areas for further improvement:
194+
1. Add error handling for distributed setup issues
195+
2. Enhanced logging for multimodal model initialization
196+
3. Better error messages for quantization setup
197+
4. Integration with monitoring/telemetry systems

0 commit comments

Comments
 (0)