Skip to content

[Feature Request] Add Standard Health Check Endpoints to ClearML Serving #94

@Edition-X

Description

@Edition-X

Problem Statement

ClearML Serving currently lacks standard health check endpoints (/health, /readiness, /liveness, /metrics), making it difficult to properly monitor and orchestrate in production environments. The current Docker healthcheck only verifies that the web server is responding by checking the /docs endpoint, which doesn't indicate if models are actually loaded and ready to serve requests.

Current Limitations

  1. No service health verification - Can't determine if the service is truly ready to handle traffic
  2. No model readiness checks - No way to verify if models are properly loaded
  3. No standardized monitoring - Difficult to integrate with standard monitoring tools
  4. Limited orchestration support - Kubernetes, ECS, etc. can't properly manage the service lifecycle

Proposed Solution

Add the following standard endpoints:

1. GET /health

  • Purpose: Basic service health check
  • Response: 200 OK with service status, version, and timestamp
  • Example:
    {
      "status": "healthy",
      "service": "clearml-serving",
      "version": "1.5.0",
      "timestamp": 1729700000.0
    }

2. GET /readiness

  • Purpose: Verify service is ready to accept traffic
  • Checks:
    • ModelRequestProcessor is initialized
    • At least one model is loaded
    • GPU is accessible (if GPU support enabled)
  • Response: 200 OK if ready, 503 Service Unavailable if not

3. GET /liveness

  • Purpose: Simple check if service is running
  • Response: 200 OK with minimal overhead

4. GET /metrics (Optional)

  • Purpose: Service metrics in Prometheus format
  • Includes:
    • Uptime
    • Request counts
    • Model loading status
    • GPU metrics (if available)

Benefits

  1. Better Monitoring: Standard endpoints for monitoring tools
  2. Improved Reliability: Better container orchestration
  3. Easier Debugging: Clear service state visibility
  4. Standard Compliance: Follows cloud-native best practices

Implementation Details

  1. Add endpoints to clearml_serving/serving/main.py
  2. Extend ModelRequestProcessor with status methods
  3. Update Docker healthcheck to use /health
  4. Add documentation

Example Implementation

# In clearml_serving/serving/main.py
@app.get("/health")
async def health_check():
    return {
        "status": "healthy",
        "service": "clearml-serving",
        "version": __version__,
        "timestamp": time.time()
    }

@app.get("/readiness")
async def readiness_check():
    if not processor or not processor.get_loaded_models():
        raise HTTPException(status_code=503, detail="Service not ready")
    return {"status": "ready"}

Note: This issue was created based on production experience with ClearML Serving at Sunrise Robotics. We're happy to contribute this feature if the maintainers agree with the approach.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions