vllm-project
diff --git a/‎ENHANCED_ERROR_HANDLING_README.md
Lines changed: 204 additions & 0 deletions b/‎ENHANCED_ERROR_HANDLING_README.md
Lines changed: 204 additions & 0 deletions
@@ -0,0 +1,204 @@
+# Enhanced Error Handling for vLLM V1 Initialization
+
+This enhancement provides improved error handling and logging for common initialization errors in vLLM V1, making it easier for users to diagnose and resolve issues.
+
+## Overview
+
+The enhanced error handling addresses the most common initialization problems:
+
+1. **Insufficient GPU Memory** - When the model is too large for available GPU memory
+2. **Insufficient KV Cache Memory** - When there's not enough memory for the KV cache given the max_model_len
+3. **Model Loading Errors** - When model files can't be loaded or are incompatible
+4. **CUDA Errors** - When CUDA-related issues occur during initialization
+
+## Key Features
+
+### 1. Detailed Error Messages
+
+Instead of generic error messages, users now get:
+
+- Clear descriptions of what went wrong
+- Specific memory requirements vs. available memory
+- Estimated maximum model lengths based on available memory
+- Context about where the error occurred (model loading, KV cache, etc.)
+
+### 2. Actionable Suggestions
+
+Each error provides specific suggestions like:
+
+- Adjusting `gpu_memory_utilization`
+- Reducing `max_model_len`
+- Using quantization (GPTQ, AWQ, FP8)
+- Enabling tensor parallelism
+- Closing other GPU processes
+
+### 3. Enhanced Logging
+
+- Detailed initialization information logged at startup
+- Memory usage statistics
+- Model configuration details
+- Progress indicators for different initialization phases
+
+## New Error Classes
+
+### `InsufficientMemoryError`
+
+Raised when there's not enough GPU memory to load the model.
+
+```python
+InsufficientMemoryError: Insufficient GPU memory to load the model.
+Required: 24.50 GiB
+Available: 22.30 GiB
+Shortage: 2.20 GiB
+
+Suggestions to resolve this issue:
+  1. Try increasing gpu_memory_utilization first (safest option)
+  2. Increase gpu_memory_utilization from 0.80 (e.g., to 0.90)
+  3. Consider using quantization (GPTQ, AWQ, FP8) to reduce model memory usage
+  4. Use tensor parallelism to distribute the model across multiple GPUs
+  5. Close other GPU processes to free up memory
+```
+
+### `InsufficientKVCacheMemoryError`
+
+Raised when there's not enough memory for the KV cache.
+
+```python
+InsufficientKVCacheMemoryError: Insufficient memory for KV cache to serve requests.
+Required KV cache memory: 8.45 GiB (for max_model_len=4096)
+Available KV cache memory: 6.20 GiB
+Shortage: 2.25 GiB
+Based on available memory, estimated maximum model length: 3000
+
+Suggestions to resolve this issue:
+  1. Reduce max_model_len from 4096 to 3000 or lower
+  2. Reduce max_model_len from 4096 to a smaller value
+  3. Increase gpu_memory_utilization from 0.80 (e.g., to 0.90)
+  4. Consider using quantization (GPTQ, AWQ, FP8) to reduce memory usage
+  5. Use tensor parallelism to distribute the model across multiple GPUs
+```
+
+### `ModelLoadingError`
+Raised when model loading fails for various reasons.
+
+```python
+ModelLoadingError: Failed to load model 'meta-llama/Llama-3.1-8B' during initialization.
+Error details: CUDA out of memory. Tried to allocate 2.50 GiB
+
+Suggestions to resolve this issue:
+  1. The model is too large for available GPU memory
+  2. Consider using a smaller model or quantization
+  3. Try tensor parallelism to distribute the model across multiple GPUs
+  4. Reduce gpu_memory_utilization to leave more memory for CUDA operations
+```
+
+## Implementation Details
+
+### Files Modified/Added
+
+1. **`vllm/v1/engine/initialization_errors.py`** (NEW)
+   - Contains the new error classes and utility functions
+   - Provides suggestion generation based on error context
+   - Includes detailed logging functions
+
+2. **`vllm/v1/engine/core.py`** (ENHANCED)
+   - Enhanced `_initialize_kv_caches()` method with better error handling
+   - Detailed logging of initialization progress
+   - Proper exception handling with enhanced error messages
+
+3. **`vllm/v1/core/kv_cache_utils.py`** (ENHANCED)
+   - Updated `check_enough_kv_cache_memory()` to use new error classes
+   - Better error messages with specific suggestions
+
+4. **`vllm/v1/worker/gpu_worker.py`** (ENHANCED)
+   - Enhanced memory checking in `init_device()`
+   - Better error handling in `load_model()` and `determine_available_memory()`
+   - More detailed memory profiling error handling
+
+5. **`vllm/v1/engine/llm_engine.py`** (ENHANCED)
+   - Enhanced `__init__()` method with comprehensive error handling
+   - Better error messages for tokenizer and processor initialization
+
+### Error Handling Strategy
+
+The enhancement follows a layered approach:
+
+1. **Low-level functions** (workers, memory profiling) catch specific errors and provide context
+2. **Mid-level functions** (core engine, KV cache utils) add domain-specific suggestions
+3. **High-level functions** (LLM engine) provide user-friendly error aggregation
+
+Each layer adds value while preserving the original error context through exception chaining.
+
+## Usage Examples
+
+### Basic Usage
+```python
+import os
+os.environ["VLLM_USE_V1"] = "1"
+
+from vllm import LLM
+
+try:
+    llm = LLM(
+        model="meta-llama/Llama-3.1-70B-Instruct",
+        gpu_memory_utilization=0.95,
+        max_model_len=8192
+    )
+except Exception as e:
+    print(f"Initialization failed: {e}")
+    # Error message will include specific suggestions
+```
+
+### Advanced Error Handling
+```python
+from vllm.v1.engine.initialization_errors import (
+    InsufficientMemoryError,
+    InsufficientKVCacheMemoryError, 
+    ModelLoadingError
+)
+
+try:
+    llm = LLM(model="large-model", gpu_memory_utilization=0.9)
+except InsufficientMemoryError as e:
+    print(f"Memory issue: {e}")
+    # Handle memory-specific errors
+except InsufficientKVCacheMemoryError as e:
+    print(f"KV cache issue: {e}")
+    # Handle KV cache-specific errors
+except ModelLoadingError as e:
+    print(f"Model loading issue: {e}")
+    # Handle model loading errors
+```
+
+## Testing
+
+Run the demo script to see the enhanced error handling in action:
+
+```bash
+python enhanced_error_demo.py
+```
+
+This script intentionally triggers various error conditions to demonstrate the improved error messages and suggestions.
+
+## Benefits
+
+1. **Faster Debugging** - Users can quickly understand what went wrong
+2. **Self-Service Resolution** - Clear suggestions help users fix issues independently
+3. **Better Support Experience** - More detailed error reports improve support quality
+4. **Reduced Trial-and-Error** - Specific suggestions reduce the need for guesswork
+
+## Backward Compatibility
+
+The enhancement is fully backward compatible:
+- Existing error handling code continues to work
+- New error classes inherit from standard Python exceptions
+- Original error messages are preserved in the error chain
+- No breaking changes to existing APIs
+
+## Future Enhancements
+
+Potential areas for further improvement:
+1. Add error handling for distributed setup issues
+2. Enhanced logging for multimodal model initialization
+3. Better error messages for quantization setup
+4. Integration with monitoring/telemetry systems