-
Notifications
You must be signed in to change notification settings - Fork 50
Open
Description
Problem Statement
ClearML Serving currently lacks standard health check endpoints (/health, /readiness, /liveness, /metrics), making it difficult to properly monitor and orchestrate in production environments. The current Docker healthcheck only verifies that the web server is responding by checking the /docs endpoint, which doesn't indicate if models are actually loaded and ready to serve requests.
Current Limitations
- No service health verification - Can't determine if the service is truly ready to handle traffic
- No model readiness checks - No way to verify if models are properly loaded
- No standardized monitoring - Difficult to integrate with standard monitoring tools
- Limited orchestration support - Kubernetes, ECS, etc. can't properly manage the service lifecycle
Proposed Solution
Add the following standard endpoints:
1. GET /health
- Purpose: Basic service health check
- Response: 200 OK with service status, version, and timestamp
- Example:
{ "status": "healthy", "service": "clearml-serving", "version": "1.5.0", "timestamp": 1729700000.0 }
2. GET /readiness
- Purpose: Verify service is ready to accept traffic
- Checks:
- ModelRequestProcessor is initialized
- At least one model is loaded
- GPU is accessible (if GPU support enabled)
- Response: 200 OK if ready, 503 Service Unavailable if not
3. GET /liveness
- Purpose: Simple check if service is running
- Response: 200 OK with minimal overhead
4. GET /metrics (Optional)
- Purpose: Service metrics in Prometheus format
- Includes:
- Uptime
- Request counts
- Model loading status
- GPU metrics (if available)
Benefits
- Better Monitoring: Standard endpoints for monitoring tools
- Improved Reliability: Better container orchestration
- Easier Debugging: Clear service state visibility
- Standard Compliance: Follows cloud-native best practices
Implementation Details
- Add endpoints to
clearml_serving/serving/main.py - Extend
ModelRequestProcessorwith status methods - Update Docker healthcheck to use
/health - Add documentation
Example Implementation
# In clearml_serving/serving/main.py
@app.get("/health")
async def health_check():
return {
"status": "healthy",
"service": "clearml-serving",
"version": __version__,
"timestamp": time.time()
}
@app.get("/readiness")
async def readiness_check():
if not processor or not processor.get_loaded_models():
raise HTTPException(status_code=503, detail="Service not ready")
return {"status": "ready"}Note: This issue was created based on production experience with ClearML Serving at Sunrise Robotics. We're happy to contribute this feature if the maintainers agree with the approach.
Metadata
Metadata
Assignees
Labels
No labels