KubeAI Autoscaler
KubeAI Autoscaler is a Kubernetes-native solution for dynamically scaling AI inference workloads based on real-time performance metrics such as latency, GPU utilization, and request throughput. Unlike traditional autoscalers that rely on CPU/memory metrics, this project introduces AI-specific scaling logic to optimize resource usage and improve inference performance in cloud-native environments.
Current Kubernetes autoscaling solutions (Horizontal Pod Autoscaler, KEDA) are designed for general workloads and do not account for GPU-intensive AI inference jobs. AI workloads often require:
- GPU-aware scaling for cost efficiency
- Latency-based scaling to maintain SLA for inference
- Custom metrics like model response time and queue depth
Without these capabilities, organizations face:
- Over-provisioning of expensive GPU resources
- Poor inference performance under variable load
- Lack of observability for AI-specific metrics
- Provide a Kubernetes controller that scales AI inference pods based on custom metrics
- Integrate with Prometheus for metric collection and KEDA for event-driven scaling
- Support GPU-aware scheduling using Kubernetes device plugins
- Offer CRDs for defining AI autoscaling policies (e.g., latency thresholds, GPU utilization targets)
- Custom Metrics Adapter for AI workloads (latency, GPU usage, request queue depth)
- Dynamic Scaling Logic based on AI-specific SLAs
- Integration with CNCF Ecosystem:
- Prometheus for metrics
- KEDA for event-driven scaling
- ArgoCD for GitOps-based deployment
- Extensible Architecture for future ML pipeline integration
- Implement CRD for autoscaling policy (
AIInferenceAutoscalerPolicy) - Basic controller logic for scaling based on latency and GPU utilization
- Prometheus integration for metrics collection
- Collect GPU and latency metrics via Prometheus
- Basic scaling logic for inference pods
- Add predictive scaling using AI models
- Integration with KEDA for event-driven scaling
- Advanced GPU-aware scheduling optimizations
- Multi-cluster support
- Service mesh integration for secure metric collection
- Observability dashboards for AI workloads
- License: Apache 2.0
- Governance Model: Maintainers + community contributors
- Initial Contributors: pavan4devops@gmail.com
- Open for CNCF community participation
- Aligns with CNCF's mission to advance cloud-native technologies
- Bridges the gap between AI workloads and Kubernetes-native scaling
- Early-stage but addresses a growing need in AI/ML infrastructure
- GitHub: github.com/pmady/kubeai-autoscaler
- Documentation: Installation guide, CRD specs, examples
AIInferenceAutoscalerPolicy - Defines scaling rules:
- Latency threshold
- GPU utilization target
- Min/max replicas
- Watches CRDs and metrics
- Applies scaling decisions to Kubernetes Deployment or StatefulSet
- Prometheus scrapes:
- GPU metrics (via NVIDIA DCGM exporter or similar)
- Latency metrics (from inference service)
- Custom Metrics Adapter exposes these metrics to Kubernetes API
Calculates desired replicas based on:
- Latency SLA
- GPU utilization
- Request queue depth
- KEDA for event-driven scaling (optional)
- ArgoCD for GitOps deployment
- Device Plugin for GPU scheduling
[User] --> [CRD: AIInferenceAutoscalerPolicy] --> [Controller]
|
v
[Kubernetes API: Scale Deployment]
^
|
[Prometheus] --> [Custom Metrics Adapter] --> [Controller]
[GPU Device Plugin] --> [Kubernetes Scheduler]
┌─────────────────────────────────────────────────────────────────┐
│ KubeAI Autoscaler │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────────────┐ │
│ │ CRDs │ │ Controller │ │ Metrics Adapter │ │
│ │ │───▶│ │◀───│ │ │
│ │ AIScaler │ │ Reconciler │ │ GPU/Latency/Queue │ │
│ │ AIPolicy │ │ │ │ │ │
│ └─────────────┘ └──────┬──────┘ └──────────┬──────────┘ │
│ │ │ │
│ ▼ │ │
│ ┌─────────────┐ │ │
│ │ Kubernetes │ │ │
│ │ Deployments │ │ │
│ └─────────────┘ │ │
│ │ │
└───────────────────────────────────────────────────┼─────────────┘
│
┌───────────────────────────────┘
│
▼
┌─────────────────┐
│ Prometheus │
│ (Metrics) │
└─────────────────┘
| Project | Relationship |
|---|---|
| KEDA | Event-driven scaling integration |
| Prometheus | Metrics collection |
| NVIDIA GPU Operator | GPU device plugin |
| KServe | AI inference serving |
- Maintainer: Pavan Madduri
- Email: pavan4devops@gmail.com
- GitHub: @pmady