[Feature Request] Add Standard Health Check Endpoints to ClearML Serving

## Problem Statement
ClearML Serving currently lacks standard health check endpoints (`/health`, `/readiness`, `/liveness`, `/metrics`), making it difficult to properly monitor and orchestrate in production environments. The current Docker healthcheck only verifies that the web server is responding by checking the `/docs` endpoint, which doesn't indicate if models are actually loaded and ready to serve requests.

## Current Limitations
1. **No service health verification** - Can't determine if the service is truly ready to handle traffic
2. **No model readiness checks** - No way to verify if models are properly loaded
3. **No standardized monitoring** - Difficult to integrate with standard monitoring tools
4. **Limited orchestration support** - Kubernetes, ECS, etc. can't properly manage the service lifecycle

## Proposed Solution
Add the following standard endpoints:

### 1. `GET /health`
- **Purpose**: Basic service health check
- **Response**: 200 OK with service status, version, and timestamp
- **Example**:
  ```json
  {
    "status": "healthy",
    "service": "clearml-serving",
    "version": "1.5.0",
    "timestamp": 1729700000.0
  }
  ```

### 2. `GET /readiness`
- **Purpose**: Verify service is ready to accept traffic
- **Checks**:
  - ModelRequestProcessor is initialized
  - At least one model is loaded
  - GPU is accessible (if GPU support enabled)
- **Response**: 200 OK if ready, 503 Service Unavailable if not

### 3. `GET /liveness`
- **Purpose**: Simple check if service is running
- **Response**: 200 OK with minimal overhead

### 4. `GET /metrics` (Optional)
- **Purpose**: Service metrics in Prometheus format
- **Includes**:
  - Uptime
  - Request counts
  - Model loading status
  - GPU metrics (if available)

## Benefits
1. **Better Monitoring**: Standard endpoints for monitoring tools
2. **Improved Reliability**: Better container orchestration
3. **Easier Debugging**: Clear service state visibility
4. **Standard Compliance**: Follows cloud-native best practices

## Implementation Details
1. Add endpoints to `clearml_serving/serving/main.py`
2. Extend `ModelRequestProcessor` with status methods
3. Update Docker healthcheck to use `/health`
4. Add documentation

## Example Implementation
```python
# In clearml_serving/serving/main.py
@app.get("/health")
async def health_check():
    return {
        "status": "healthy",
        "service": "clearml-serving",
        "version": __version__,
        "timestamp": time.time()
    }

@app.get("/readiness")
async def readiness_check():
    if not processor or not processor.get_loaded_models():
        raise HTTPException(status_code=503, detail="Service not ready")
    return {"status": "ready"}
```


*Note: This issue was created based on production experience with ClearML Serving at Sunrise Robotics. We're happy to contribute this feature if the maintainers agree with the approach.*


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feature Request] Add Standard Health Check Endpoints to ClearML Serving #94

Problem Statement

Current Limitations

Proposed Solution

1. `GET /health`

2. `GET /readiness`

3. `GET /liveness`

4. `GET /metrics` (Optional)

Benefits

Implementation Details

Example Implementation

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Feature Request] Add Standard Health Check Endpoints to ClearML Serving #94

Description

Problem Statement

Current Limitations

Proposed Solution

1. GET /health

2. GET /readiness

3. GET /liveness

4. GET /metrics (Optional)

Benefits

Implementation Details

Example Implementation

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

1. `GET /health`

2. `GET /readiness`

3. `GET /liveness`

4. `GET /metrics` (Optional)