feat: Add health check endpoints for container orchestration #95

Edition-X · 2025-10-29T18:19:12Z

Description

Adds comprehensive health monitoring endpoints to ClearML Serving inference containers.

Motivation

Modern container orchestration platforms (Kubernetes, Docker Swarm, etc.) require standardized health check endpoints to:

Determine when containers are ready to receive traffic
Detect and restart unhealthy containers
Monitor service health and performance

Changes

New Endpoints:
- GET /health - Basic health check with service metadata
- GET /readiness - Readiness probe (checks model loading, GPU availability)
- GET /liveness - Lightweight liveness probe for orchestration
- GET /metrics - Detailed service metrics (uptime, requests, GPU usage, loaded models)
Model Request Processor:
- Added request counting and tracking
- Added methods to query loaded endpoints and service state
- Enables metrics endpoint to provide meaningful data

Benefits

✅ Standard cloud-native health check patterns
✅ Better Kubernetes integration
✅ Docker healthcheck support
✅ Improved observability
✅ No breaking changes to existing functionality

Testing

Deployed and tested in production environment
Integrated with Docker healthchecks and autoheal
Verified all endpoints return correct status codes and data
Tested GPU availability detection and model loading checks

Example Usage

# Health check
curl http://localhost:8080/health

# Readiness check
curl http://localhost:8080/readiness

# Get metrics
curl http://localhost:8080/metrics

Closes Issue 94

- Added /health, /readiness, /liveness and /metrics endpoints for monitoring service status - Implemented request tracking in ModelRequestProcessor to count requests and record last prediction time - Added service instance ID and startup time tracking for monitoring - Added GPU memory metrics collection using pynvml when available - Enhanced readiness check to verify model loading status and GPU availability - Added detailed metrics endpoint providing

…ompatibility

Edition-X added 6 commits October 29, 2025 12:43

fix: f-string syntax for Python 3.10 compatibility

cb3c923

fix: make vllm imports and OpenAI endpoints optional for production c…

575cdae

…ompatibility

fix indent

173622c

docs: add health check endpoints documentation

4877e7e

add back how it was

c330bcb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Add health check endpoints for container orchestration #95

feat: Add health check endpoints for container orchestration #95

Uh oh!

Edition-X commented Oct 29, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

feat: Add health check endpoints for container orchestration #95

Are you sure you want to change the base?

feat: Add health check endpoints for container orchestration #95

Uh oh!

Conversation

Edition-X commented Oct 29, 2025

Description

Motivation

Changes

Benefits

Testing

Example Usage

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant