Skip to content

[Feature] Improve logging for error messages #22599

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
209 changes: 209 additions & 0 deletions ENHANCED_ERROR_HANDLING_README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,209 @@
# Enhanced Error Handling for vLLM V1 Initialization

This enhancement provides improved error handling and logging for common initialization errors in vLLM V1, making it easier for users to diagnose and resolve issues.

## Overview

The enhanced error handling addresses the most common initialization problems:

1. **Insufficient GPU Memory** - When the model is too large for available GPU memory
2. **Insufficient KV Cache Memory** - When there's not enough memory for the KV cache given the max_model_len
3. **Model Loading Errors** - When model files can't be loaded or are incompatible
4. **CUDA Errors** - When CUDA-related issues occur during initialization

## Key Features

### 1. Detailed Error Messages

Instead of generic error messages, users now get:

- Clear descriptions of what went wrong
- Specific memory requirements vs. available memory
- Estimated maximum model lengths based on available memory
- Context about where the error occurred (model loading, KV cache, etc.)

### 2. Actionable Suggestions

Each error provides specific suggestions like:

- Adjusting `gpu_memory_utilization`
- Reducing `max_model_len`
- Using quantization (GPTQ, AWQ, FP8)
- Enabling tensor parallelism
- Closing other GPU processes

### 3. Enhanced Logging

- Detailed initialization information logged at startup
- Memory usage statistics
- Model configuration details
- Progress indicators for different initialization phases

## New Error Classes

### `InsufficientMemoryError`

Raised when there's not enough GPU memory to load the model.

```python
InsufficientMemoryError: Insufficient GPU memory to load the model.
Required: 24.50 GiB
Available: 22.30 GiB
Shortage: 2.20 GiB

Suggestions to resolve this issue:
1. Try increasing gpu_memory_utilization first (safest option)
2. Increase gpu_memory_utilization from 0.80 (e.g., to 0.90)
3. Consider using quantization (GPTQ, AWQ, FP8) to reduce model memory usage
4. Use tensor parallelism to distribute the model across multiple GPUs
5. Close other GPU processes to free up memory
```

### `InsufficientKVCacheMemoryError`

Raised when there's not enough memory for the KV cache.

```python
InsufficientKVCacheMemoryError: Insufficient memory for KV cache to serve requests.
Required KV cache memory: 8.45 GiB (for max_model_len=4096)
Available KV cache memory: 6.20 GiB
Shortage: 2.25 GiB
Based on available memory, estimated maximum model length: 3000

Suggestions to resolve this issue:
1. Reduce max_model_len from 4096 to 3000 or lower
2. Reduce max_model_len from 4096 to a smaller value
3. Increase gpu_memory_utilization from 0.80 (e.g., to 0.90)
4. Consider using quantization (GPTQ, AWQ, FP8) to reduce memory usage
5. Use tensor parallelism to distribute the model across multiple GPUs
```

### `ModelLoadingError`

Raised when model loading fails for various reasons.

```python
ModelLoadingError: Failed to load model 'meta-llama/Llama-3.1-8B' during initialization.
Error details: CUDA out of memory. Tried to allocate 2.50 GiB

Suggestions to resolve this issue:
1. The model is too large for available GPU memory
2. Consider using a smaller model or quantization
3. Try tensor parallelism to distribute the model across multiple GPUs
4. Reduce gpu_memory_utilization to leave more memory for CUDA operations
```

## Implementation Details

### Files Modified/Added

1. **`vllm/v1/engine/initialization_errors.py`** (NEW)
- Contains the new error classes and utility functions
- Provides suggestion generation based on error context
- Includes detailed logging functions

2. **`vllm/v1/engine/core.py`** (ENHANCED)
- Enhanced `_initialize_kv_caches()` method with better error handling
- Detailed logging of initialization progress
- Proper exception handling with enhanced error messages

3. **`vllm/v1/core/kv_cache_utils.py`** (ENHANCED)
- Updated `check_enough_kv_cache_memory()` to use new error classes
- Better error messages with specific suggestions

4. **`vllm/v1/worker/gpu_worker.py`** (ENHANCED)
- Enhanced memory checking in `init_device()`
- Better error handling in `load_model()` and `determine_available_memory()`
- More detailed memory profiling error handling

5. **`vllm/v1/engine/llm_engine.py`** (ENHANCED)
- Enhanced `__init__()` method with comprehensive error handling
- Better error messages for tokenizer and processor initialization

### Error Handling Strategy

The enhancement follows a layered approach:

1. **Low-level functions** (workers, memory profiling) catch specific errors and provide context
2. **Mid-level functions** (core engine, KV cache utils) add domain-specific suggestions
3. **High-level functions** (LLM engine) provide user-friendly error aggregation

Each layer adds value while preserving the original error context through exception chaining.

## Usage Examples

### Basic Usage

```python
import os
os.environ["VLLM_USE_V1"] = "1"

from vllm import LLM

try:
llm = LLM(
model="meta-llama/Llama-3.1-70B-Instruct",
gpu_memory_utilization=0.95,
max_model_len=8192
)
except Exception as e:
print(f"Initialization failed: {e}")
# Error message will include specific suggestions
```

### Advanced Error Handling

```python
from vllm.v1.engine.initialization_errors import (
InsufficientMemoryError,
InsufficientKVCacheMemoryError,
ModelLoadingError
)

try:
llm = LLM(model="large-model", gpu_memory_utilization=0.9)
except InsufficientMemoryError as e:
print(f"Memory issue: {e}")
# Handle memory-specific errors
except InsufficientKVCacheMemoryError as e:
print(f"KV cache issue: {e}")
# Handle KV cache-specific errors
except ModelLoadingError as e:
print(f"Model loading issue: {e}")
# Handle model loading errors
```

## Testing

Run the demo script to see the enhanced error handling in action:

```bash
python enhanced_error_demo.py
```

This script intentionally triggers various error conditions to demonstrate the improved error messages and suggestions.

## Benefits

1. **Faster Debugging** - Users can quickly understand what went wrong
2. **Self-Service Resolution** - Clear suggestions help users fix issues independently
3. **Better Support Experience** - More detailed error reports improve support quality
4. **Reduced Trial-and-Error** - Specific suggestions reduce the need for guesswork

## Backward Compatibility

The enhancement is fully backward compatible:

- Existing error handling code continues to work
- New error classes inherit from standard Python exceptions
- Original error messages are preserved in the error chain
- No breaking changes to existing APIs

## Future Enhancements

Potential areas for further improvement:

1. Add error handling for distributed setup issues
2. Enhanced logging for multimodal model initialization
3. Better error messages for quantization setup
4. Integration with monitoring/telemetry systems
Loading