Skip to content

Latest commit

 

History

History
144 lines (102 loc) · 3.83 KB

File metadata and controls

144 lines (102 loc) · 3.83 KB

Error Handling and System Robustness

Date: 2023-12-08 Topic: Fine-grained exception handling and circuit breakers


Background

The system was exposing internal error details to users and failing ungracefully when the VL model service had issues.


The Problems

  1. Information Exposure: Stack traces and system paths in error responses
  2. Broad Exception Catching: Single try-except for entire operations
  3. No Resilience: Model failures caused cascading problems

Example of exposed error:

Error code: 400 - {'detail': '[address=0.0.0.0:28264, pid=8336] Model not found, uid: qwen2-vl-instruct-0'}

Error Classification

Create a hierarchy of errors with appropriate handling:

class BaseError(Exception):
    def __init__(self, message: str, code: int = 400):
        self.message = message
        self.code = code

class BusinessError(BaseError):
    """User-caused errors (bad input)"""
    pass

class SystemError(BaseError):
    """Infrastructure errors (model down)"""
    pass

class ValidationError(BusinessError):
    pass

class ModelError(SystemError):
    pass

Unified Error Handler

async def handle_error(error: Exception) -> dict:
    if isinstance(error, BusinessError):
        return {"code": error.code, "message": error.message}

    if isinstance(error, SystemError):
        logger.error(f"System error: {error.message}")
        return {"code": 500, "message": "Service temporarily unavailable"}

    # Unknown errors: log details, return generic message
    logger.error(f"Unknown error: {str(error)}\n{traceback.format_exc()}")
    return {"code": 500, "message": "System exception"}

Circuit Breaker Pattern

Protect against cascading failures from model service:

class CircuitBreaker:
    def __init__(self, failure_threshold=5, recovery_timeout=30):
        self.failure_threshold = failure_threshold
        self.recovery_timeout = recovery_timeout
        self.state = "CLOSED"  # CLOSED, OPEN, HALF_OPEN
        self.failure_count = 0

    async def execute(self, func, *args, **kwargs):
        if self.state == "OPEN":
            if time.time() - self.last_failure > self.recovery_timeout:
                self.state = "HALF_OPEN"
            else:
                raise CircuitOpenError()

        try:
            result = await func(*args, **kwargs)
            if self.state == "HALF_OPEN":
                self.reset()
            return result
        except Exception as e:
            self.record_failure()
            raise

File Validation Pipeline

async def validate_file(file: UploadFile):
    # 1. Existence
    if not file:
        raise ValueError("No file uploaded")

    # 2. Type
    if file.content_type not in SUPPORTED_TYPES:
        raise ValueError(f"Unsupported type: {file.content_type}")

    # 3. Size
    await check_file_size(file)

    # 4. Content
    content = await file.read()
    if not content:
        raise ValueError("Empty file")

    file.file.seek(0)
    return content

Today's Reflection

Error handling is often an afterthought, but today showed why it matters. Users seeing internal stack traces is both a security issue and a poor experience.

The key principle: log everything internally, expose nothing sensitive externally. System errors should be logged with full details for debugging, but users should see generic "something went wrong" messages.

The circuit breaker was new to me. It's a pattern from distributed systems - when a downstream service is failing, stop trying to call it for a while to let it recover. This prevents your entire system from getting stuck waiting for a dead service.


Further Learning

  • Circuit breaker implementations (pybreaker)
  • Structured logging with correlation IDs
  • Retry patterns with exponential backoff
  • Health check endpoints