Skip to content

Commit d1b9f49

Browse files
committed
[Feature] Improve logging for error messages
Signed-off-by: Elizabeth Thomas <[email protected]>
1 parent 68b254d commit d1b9f49

10 files changed

+2018
-92
lines changed

ENHANCED_ERROR_HANDLING_README.md

Lines changed: 204 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,204 @@
1+
# Enhanced Error Handling for vLLM V1 Initialization
2+
3+
This enhancement provides improved error handling and logging for common initialization errors in vLLM V1, making it easier for users to diagnose and resolve issues.
4+
5+
## Overview
6+
7+
The enhanced error handling addresses the most common initialization problems:
8+
9+
1. **Insufficient GPU Memory** - When the model is too large for available GPU memory
10+
2. **Insufficient KV Cache Memory** - When there's not enough memory for the KV cache given the max_model_len
11+
3. **Model Loading Errors** - When model files can't be loaded or are incompatible
12+
4. **CUDA Errors** - When CUDA-related issues occur during initialization
13+
14+
## Key Features
15+
16+
### 1. Detailed Error Messages
17+
18+
Instead of generic error messages, users now get:
19+
20+
- Clear descriptions of what went wrong
21+
- Specific memory requirements vs. available memory
22+
- Estimated maximum model lengths based on available memory
23+
- Context about where the error occurred (model loading, KV cache, etc.)
24+
25+
### 2. Actionable Suggestions
26+
27+
Each error provides specific suggestions like:
28+
29+
- Adjusting `gpu_memory_utilization`
30+
- Reducing `max_model_len`
31+
- Using quantization (GPTQ, AWQ, FP8)
32+
- Enabling tensor parallelism
33+
- Closing other GPU processes
34+
35+
### 3. Enhanced Logging
36+
37+
- Detailed initialization information logged at startup
38+
- Memory usage statistics
39+
- Model configuration details
40+
- Progress indicators for different initialization phases
41+
42+
## New Error Classes
43+
44+
### `InsufficientMemoryError`
45+
46+
Raised when there's not enough GPU memory to load the model.
47+
48+
```python
49+
InsufficientMemoryError: Insufficient GPU memory to load the model.
50+
Required: 24.50 GiB
51+
Available: 22.30 GiB
52+
Shortage: 2.20 GiB
53+
54+
Suggestions to resolve this issue:
55+
1. Try increasing gpu_memory_utilization first (safest option)
56+
2. Increase gpu_memory_utilization from 0.80 (e.g., to 0.90)
57+
3. Consider using quantization (GPTQ, AWQ, FP8) to reduce model memory usage
58+
4. Use tensor parallelism to distribute the model across multiple GPUs
59+
5. Close other GPU processes to free up memory
60+
```
61+
62+
### `InsufficientKVCacheMemoryError`
63+
64+
Raised when there's not enough memory for the KV cache.
65+
66+
```python
67+
InsufficientKVCacheMemoryError: Insufficient memory for KV cache to serve requests.
68+
Required KV cache memory: 8.45 GiB (for max_model_len=4096)
69+
Available KV cache memory: 6.20 GiB
70+
Shortage: 2.25 GiB
71+
Based on available memory, estimated maximum model length: 3000
72+
73+
Suggestions to resolve this issue:
74+
1. Reduce max_model_len from 4096 to 3000 or lower
75+
2. Reduce max_model_len from 4096 to a smaller value
76+
3. Increase gpu_memory_utilization from 0.80 (e.g., to 0.90)
77+
4. Consider using quantization (GPTQ, AWQ, FP8) to reduce memory usage
78+
5. Use tensor parallelism to distribute the model across multiple GPUs
79+
```
80+
81+
### `ModelLoadingError`
82+
Raised when model loading fails for various reasons.
83+
84+
```python
85+
ModelLoadingError: Failed to load model 'meta-llama/Llama-3.1-8B' during initialization.
86+
Error details: CUDA out of memory. Tried to allocate 2.50 GiB
87+
88+
Suggestions to resolve this issue:
89+
1. The model is too large for available GPU memory
90+
2. Consider using a smaller model or quantization
91+
3. Try tensor parallelism to distribute the model across multiple GPUs
92+
4. Reduce gpu_memory_utilization to leave more memory for CUDA operations
93+
```
94+
95+
## Implementation Details
96+
97+
### Files Modified/Added
98+
99+
1. **`vllm/v1/engine/initialization_errors.py`** (NEW)
100+
- Contains the new error classes and utility functions
101+
- Provides suggestion generation based on error context
102+
- Includes detailed logging functions
103+
104+
2. **`vllm/v1/engine/core.py`** (ENHANCED)
105+
- Enhanced `_initialize_kv_caches()` method with better error handling
106+
- Detailed logging of initialization progress
107+
- Proper exception handling with enhanced error messages
108+
109+
3. **`vllm/v1/core/kv_cache_utils.py`** (ENHANCED)
110+
- Updated `check_enough_kv_cache_memory()` to use new error classes
111+
- Better error messages with specific suggestions
112+
113+
4. **`vllm/v1/worker/gpu_worker.py`** (ENHANCED)
114+
- Enhanced memory checking in `init_device()`
115+
- Better error handling in `load_model()` and `determine_available_memory()`
116+
- More detailed memory profiling error handling
117+
118+
5. **`vllm/v1/engine/llm_engine.py`** (ENHANCED)
119+
- Enhanced `__init__()` method with comprehensive error handling
120+
- Better error messages for tokenizer and processor initialization
121+
122+
### Error Handling Strategy
123+
124+
The enhancement follows a layered approach:
125+
126+
1. **Low-level functions** (workers, memory profiling) catch specific errors and provide context
127+
2. **Mid-level functions** (core engine, KV cache utils) add domain-specific suggestions
128+
3. **High-level functions** (LLM engine) provide user-friendly error aggregation
129+
130+
Each layer adds value while preserving the original error context through exception chaining.
131+
132+
## Usage Examples
133+
134+
### Basic Usage
135+
```python
136+
import os
137+
os.environ["VLLM_USE_V1"] = "1"
138+
139+
from vllm import LLM
140+
141+
try:
142+
llm = LLM(
143+
model="meta-llama/Llama-3.1-70B-Instruct",
144+
gpu_memory_utilization=0.95,
145+
max_model_len=8192
146+
)
147+
except Exception as e:
148+
print(f"Initialization failed: {e}")
149+
# Error message will include specific suggestions
150+
```
151+
152+
### Advanced Error Handling
153+
```python
154+
from vllm.v1.engine.initialization_errors import (
155+
InsufficientMemoryError,
156+
InsufficientKVCacheMemoryError,
157+
ModelLoadingError
158+
)
159+
160+
try:
161+
llm = LLM(model="large-model", gpu_memory_utilization=0.9)
162+
except InsufficientMemoryError as e:
163+
print(f"Memory issue: {e}")
164+
# Handle memory-specific errors
165+
except InsufficientKVCacheMemoryError as e:
166+
print(f"KV cache issue: {e}")
167+
# Handle KV cache-specific errors
168+
except ModelLoadingError as e:
169+
print(f"Model loading issue: {e}")
170+
# Handle model loading errors
171+
```
172+
173+
## Testing
174+
175+
Run the demo script to see the enhanced error handling in action:
176+
177+
```bash
178+
python enhanced_error_demo.py
179+
```
180+
181+
This script intentionally triggers various error conditions to demonstrate the improved error messages and suggestions.
182+
183+
## Benefits
184+
185+
1. **Faster Debugging** - Users can quickly understand what went wrong
186+
2. **Self-Service Resolution** - Clear suggestions help users fix issues independently
187+
3. **Better Support Experience** - More detailed error reports improve support quality
188+
4. **Reduced Trial-and-Error** - Specific suggestions reduce the need for guesswork
189+
190+
## Backward Compatibility
191+
192+
The enhancement is fully backward compatible:
193+
- Existing error handling code continues to work
194+
- New error classes inherit from standard Python exceptions
195+
- Original error messages are preserved in the error chain
196+
- No breaking changes to existing APIs
197+
198+
## Future Enhancements
199+
200+
Potential areas for further improvement:
201+
1. Add error handling for distributed setup issues
202+
2. Enhanced logging for multimodal model initialization
203+
3. Better error messages for quantization setup
204+
4. Integration with monitoring/telemetry systems

0 commit comments

Comments
 (0)