Date: 2023-12-08 Topic: Fine-grained exception handling and circuit breakers
The system was exposing internal error details to users and failing ungracefully when the VL model service had issues.
- Information Exposure: Stack traces and system paths in error responses
- Broad Exception Catching: Single try-except for entire operations
- No Resilience: Model failures caused cascading problems
Example of exposed error:
Error code: 400 - {'detail': '[address=0.0.0.0:28264, pid=8336] Model not found, uid: qwen2-vl-instruct-0'}
Create a hierarchy of errors with appropriate handling:
class BaseError(Exception):
def __init__(self, message: str, code: int = 400):
self.message = message
self.code = code
class BusinessError(BaseError):
"""User-caused errors (bad input)"""
pass
class SystemError(BaseError):
"""Infrastructure errors (model down)"""
pass
class ValidationError(BusinessError):
pass
class ModelError(SystemError):
passasync def handle_error(error: Exception) -> dict:
if isinstance(error, BusinessError):
return {"code": error.code, "message": error.message}
if isinstance(error, SystemError):
logger.error(f"System error: {error.message}")
return {"code": 500, "message": "Service temporarily unavailable"}
# Unknown errors: log details, return generic message
logger.error(f"Unknown error: {str(error)}\n{traceback.format_exc()}")
return {"code": 500, "message": "System exception"}Protect against cascading failures from model service:
class CircuitBreaker:
def __init__(self, failure_threshold=5, recovery_timeout=30):
self.failure_threshold = failure_threshold
self.recovery_timeout = recovery_timeout
self.state = "CLOSED" # CLOSED, OPEN, HALF_OPEN
self.failure_count = 0
async def execute(self, func, *args, **kwargs):
if self.state == "OPEN":
if time.time() - self.last_failure > self.recovery_timeout:
self.state = "HALF_OPEN"
else:
raise CircuitOpenError()
try:
result = await func(*args, **kwargs)
if self.state == "HALF_OPEN":
self.reset()
return result
except Exception as e:
self.record_failure()
raiseasync def validate_file(file: UploadFile):
# 1. Existence
if not file:
raise ValueError("No file uploaded")
# 2. Type
if file.content_type not in SUPPORTED_TYPES:
raise ValueError(f"Unsupported type: {file.content_type}")
# 3. Size
await check_file_size(file)
# 4. Content
content = await file.read()
if not content:
raise ValueError("Empty file")
file.file.seek(0)
return contentError handling is often an afterthought, but today showed why it matters. Users seeing internal stack traces is both a security issue and a poor experience.
The key principle: log everything internally, expose nothing sensitive externally. System errors should be logged with full details for debugging, but users should see generic "something went wrong" messages.
The circuit breaker was new to me. It's a pattern from distributed systems - when a downstream service is failing, stop trying to call it for a while to let it recover. This prevents your entire system from getting stuck waiting for a dead service.
- Circuit breaker implementations (pybreaker)
- Structured logging with correlation IDs
- Retry patterns with exponential backoff
- Health check endpoints