A Kubernetes-native solution for dynamically scaling AI inference workloads based on real-time performance metrics.
KubeAI Autoscaler bridges the gap between AI workloads and cloud-native autoscaling by introducing AI-specific scaling logic based on:
- GPU Utilization - Scale based on GPU compute usage
- Latency SLA - Maintain response time targets (P99/P95)
- Request Queue Depth - Scale based on pending requests
Traditional Kubernetes autoscalers (HPA, KEDA) are CPU/memory-focused and not optimized for GPU-heavy AI inference workloads.
| Feature | HPA | KEDA | KubeAI Autoscaler |
|---|---|---|---|
| CPU/Memory Scaling | ✅ | ✅ | ✅ |
| GPU-Aware Scaling | ❌ | ✅ | |
| Latency-Based Scaling | ❌ | ✅ | |
| AI-Specific Metrics | ❌ | ✅ | |
| Queue Depth Scaling | ❌ | ✅ | ✅ |
- Custom Resource Definitions (CRDs) - Define AI autoscaling policies declaratively
- Prometheus Integration - Collect GPU and latency metrics
- Dynamic Scaling Logic - AI-specific scaling algorithms
- Extensible Architecture - Support for custom metrics and scaling strategies
- CNCF Ecosystem Integration - Works with Prometheus, KEDA, ArgoCD
┌─────────────────────────────────────────────────────────────────┐
│ KubeAI Autoscaler │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────────────┐ │
│ │ CRDs │ │ Controller │ │ Metrics Adapter │ │
│ │ │───▶│ │◀───│ │ │
│ │ AIPolicy │ │ Reconciler │ │ GPU/Latency/Queue │ │
│ └─────────────┘ └──────┬──────┘ └──────────┬──────────┘ │
│ │ │ │
│ ▼ │ │
│ ┌─────────────┐ │ │
│ │ Kubernetes │ │ │
│ │ Deployments │ │ │
│ └─────────────┘ │ │
└───────────────────────────────────────────────────┼─────────────┘
│
▼
┌─────────────────┐
│ Prometheus │
└─────────────────┘
See Architecture Documentation for details.
- Kubernetes cluster (v1.24+)
- Prometheus installed with GPU metrics
- NVIDIA GPU device plugin (for GPU workloads)
- kubectl configured
# Install CRDs
kubectl apply -f crds/
# Install controller
kubectl apply -f controller/- Create an AIInferenceAutoscalerPolicy:
apiVersion: kubeai.io/v1alpha1
kind: AIInferenceAutoscalerPolicy
metadata:
name: llm-inference-policy
namespace: ai-workloads
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: llm-inference-server
minReplicas: 2
maxReplicas: 10
metrics:
latency:
enabled: true
targetP99Ms: 500
gpuUtilization:
enabled: true
targetPercentage: 80- Apply the policy:
kubectl apply -f examples/basic-policy.yaml- Check status:
kubectl get aiap -n ai-workloads| Example | Description |
|---|---|
| basic-policy.yaml | Basic autoscaling with latency and GPU metrics |
| gpu-focused-policy.yaml | GPU-intensive workload scaling |
| latency-sla-policy.yaml | Strict latency SLA enforcement |
- CRD for autoscaling policy
- Basic controller logic
- Prometheus integration
- Predictive scaling using AI models
- KEDA integration
- Advanced GPU scheduling
- Multi-cluster support
- Service mesh integration
- Observability dashboards
We welcome contributions! Please see our Contributing Guide for details.
- GitHub Issues: Report bugs or request features
- Discussions: Ask questions and share ideas
Apache License 2.0 - see LICENSE for details.
- KEDA - Event-driven autoscaling
- Prometheus - Metrics and monitoring
- NVIDIA GPU Operator - GPU management
- KServe - AI inference serving