Skip to content

Conversation

@Edition-X
Copy link

Description

Adds comprehensive health monitoring endpoints to ClearML Serving inference containers.

Motivation

Modern container orchestration platforms (Kubernetes, Docker Swarm, etc.) require standardized health check endpoints to:

  • Determine when containers are ready to receive traffic
  • Detect and restart unhealthy containers
  • Monitor service health and performance

Changes

  • New Endpoints:

    • GET /health - Basic health check with service metadata
    • GET /readiness - Readiness probe (checks model loading, GPU availability)
    • GET /liveness - Lightweight liveness probe for orchestration
    • GET /metrics - Detailed service metrics (uptime, requests, GPU usage, loaded models)
  • Model Request Processor:

    • Added request counting and tracking
    • Added methods to query loaded endpoints and service state
    • Enables metrics endpoint to provide meaningful data

Benefits

  • ✅ Standard cloud-native health check patterns
  • ✅ Better Kubernetes integration
  • ✅ Docker healthcheck support
  • ✅ Improved observability
  • ✅ No breaking changes to existing functionality

Testing

  • Deployed and tested in production environment
  • Integrated with Docker healthchecks and autoheal
  • Verified all endpoints return correct status codes and data
  • Tested GPU availability detection and model loading checks

Example Usage

# Health check
curl http://localhost:8080/health

# Readiness check
curl http://localhost:8080/readiness

# Get metrics
curl http://localhost:8080/metrics

Closes Issue 94

- Added /health, /readiness, /liveness and /metrics endpoints for monitoring service status
- Implemented request tracking in ModelRequestProcessor to count requests and record last prediction time
- Added service instance ID and startup time tracking for monitoring
- Added GPU memory metrics collection using pynvml when available
- Enhanced readiness check to verify model loading status and GPU availability
- Added detailed metrics endpoint providing
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature Request] Add Standard Health Check Endpoints to ClearML Serving

1 participant