Error Handling and System Robustness

Date: 2023-12-08 Topic: Fine-grained exception handling and circuit breakers

Background

The system was exposing internal error details to users and failing ungracefully when the VL model service had issues.

The Problems

Information Exposure: Stack traces and system paths in error responses
Broad Exception Catching: Single try-except for entire operations
No Resilience: Model failures caused cascading problems

Example of exposed error:

Error code: 400 - {'detail': '[address=0.0.0.0:28264, pid=8336] Model not found, uid: qwen2-vl-instruct-0'}

Error Classification

Create a hierarchy of errors with appropriate handling:

class BaseError(Exception):
    def __init__(self, message: str, code: int = 400):
        self.message = message
        self.code = code

class BusinessError(BaseError):
    """User-caused errors (bad input)"""
    pass

class SystemError(BaseError):
    """Infrastructure errors (model down)"""
    pass

class ValidationError(BusinessError):
    pass

class ModelError(SystemError):
    pass

Unified Error Handler

async def handle_error(error: Exception) -> dict:
    if isinstance(error, BusinessError):
        return {"code": error.code, "message": error.message}

    if isinstance(error, SystemError):
        logger.error(f"System error: {error.message}")
        return {"code": 500, "message": "Service temporarily unavailable"}

    # Unknown errors: log details, return generic message
    logger.error(f"Unknown error: {str(error)}\n{traceback.format_exc()}")
    return {"code": 500, "message": "System exception"}

Circuit Breaker Pattern

Protect against cascading failures from model service:

class CircuitBreaker:
    def __init__(self, failure_threshold=5, recovery_timeout=30):
        self.failure_threshold = failure_threshold
        self.recovery_timeout = recovery_timeout
        self.state = "CLOSED"  # CLOSED, OPEN, HALF_OPEN
        self.failure_count = 0

    async def execute(self, func, *args, **kwargs):
        if self.state == "OPEN":
            if time.time() - self.last_failure > self.recovery_timeout:
                self.state = "HALF_OPEN"
            else:
                raise CircuitOpenError()

        try:
            result = await func(*args, **kwargs)
            if self.state == "HALF_OPEN":
                self.reset()
            return result
        except Exception as e:
            self.record_failure()
            raise

File Validation Pipeline

async def validate_file(file: UploadFile):
    # 1. Existence
    if not file:
        raise ValueError("No file uploaded")

    # 2. Type
    if file.content_type not in SUPPORTED_TYPES:
        raise ValueError(f"Unsupported type: {file.content_type}")

    # 3. Size
    await check_file_size(file)

    # 4. Content
    content = await file.read()
    if not content:
        raise ValueError("Empty file")

    file.file.seek(0)
    return content

Today's Reflection

Error handling is often an afterthought, but today showed why it matters. Users seeing internal stack traces is both a security issue and a poor experience.

The key principle: log everything internally, expose nothing sensitive externally. System errors should be logged with full details for debugging, but users should see generic "something went wrong" messages.

The circuit breaker was new to me. It's a pattern from distributed systems - when a downstream service is failing, stop trying to call it for a while to let it recover. This prevents your entire system from getting stuck waiting for a dead service.

Further Learning

Circuit breaker implementations (pybreaker)
Structured logging with correlation IDs
Retry patterns with exponential backoff
Health check endpoints

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error Handling and System Robustness

Background

The Problems

Error Classification

Unified Error Handler

Circuit Breaker Pattern

File Validation Pipeline

Today's Reflection

Further Learning

FilesExpand file tree

29.md

Latest commit

History

29.md

File metadata and controls

Error Handling and System Robustness

Background

The Problems

Error Classification

Unified Error Handler

Circuit Breaker Pattern

File Validation Pipeline

Today's Reflection

Further Learning