KubeAI Autoscaler

A Kubernetes-native solution for dynamically scaling AI inference workloads based on real-time performance metrics.

Overview

KubeAI Autoscaler bridges the gap between AI workloads and cloud-native autoscaling by introducing AI-specific scaling logic based on:

GPU Utilization - Scale based on GPU compute usage
Latency SLA - Maintain response time targets (P99/P95)
Request Queue Depth - Scale based on pending requests

Why KubeAI Autoscaler?

Traditional Kubernetes autoscalers (HPA, KEDA) are CPU/memory-focused and not optimized for GPU-heavy AI inference workloads.

Feature	HPA	KEDA	KubeAI Autoscaler
CPU/Memory Scaling	✅	✅	✅
GPU-Aware Scaling	❌	⚠️	✅
Latency-Based Scaling	❌	⚠️	✅
AI-Specific Metrics	❌	⚠️	✅
Queue Depth Scaling	❌	✅	✅

Features

Custom Resource Definitions (CRDs) - Define AI autoscaling policies declaratively
Prometheus Integration - Collect GPU and latency metrics
Dynamic Scaling Logic - AI-specific scaling algorithms
Extensible Architecture - Support for custom metrics and scaling strategies
CNCF Ecosystem Integration - Works with Prometheus, KEDA, ArgoCD

Architecture

┌─────────────────────────────────────────────────────────────────┐
│                    KubeAI Autoscaler                            │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  ┌─────────────┐    ┌─────────────┐    ┌─────────────────────┐  │
│  │   CRDs      │    │ Controller  │    │  Metrics Adapter    │  │
│  │             │───▶│             │◀───│                     │  │
│  │ AIPolicy    │    │ Reconciler  │    │ GPU/Latency/Queue   │  │
│  └─────────────┘    └──────┬──────┘    └──────────┬──────────┘  │
│                            │                      │             │
│                            ▼                      │             │
│                   ┌─────────────┐                 │             │
│                   │ Kubernetes  │                 │             │
│                   │ Deployments │                 │             │
│                   └─────────────┘                 │             │
└───────────────────────────────────────────────────┼─────────────┘
                                                    │
                                                    ▼
                                          ┌─────────────────┐
                                          │   Prometheus    │
                                          └─────────────────┘

See Architecture Documentation for details.

Getting Started

Prerequisites

Kubernetes cluster (v1.24+)
Prometheus installed with GPU metrics
NVIDIA GPU device plugin (for GPU workloads)
kubectl configured

Installation

# Install CRDs
kubectl apply -f crds/

# Install controller
kubectl apply -f controller/

Quick Start

Create an AIInferenceAutoscalerPolicy:

apiVersion: kubeai.io/v1alpha1
kind: AIInferenceAutoscalerPolicy
metadata:
  name: llm-inference-policy
  namespace: ai-workloads
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: llm-inference-server
  minReplicas: 2
  maxReplicas: 10
  metrics:
    latency:
      enabled: true
      targetP99Ms: 500
    gpuUtilization:
      enabled: true
      targetPercentage: 80

Apply the policy:

kubectl apply -f examples/basic-policy.yaml

Check status:

kubectl get aiap -n ai-workloads

Examples

Example	Description
basic-policy.yaml	Basic autoscaling with latency and GPU metrics
gpu-focused-policy.yaml	GPU-intensive workload scaling
latency-sla-policy.yaml	Strict latency SLA enforcement

Roadmap

Phase 1 (MVP) - Complete

CRD for autoscaling policy
Basic controller logic
Prometheus integration

Phase 2

Predictive scaling using AI models
KEDA integration
Advanced GPU scheduling

Phase 3

Multi-cluster support
Service mesh integration
Observability dashboards

Contributing

We welcome contributions! Please see our Contributing Guide for details.

Community

GitHub Issues: Report bugs or request features
Discussions: Ask questions and share ideas

License

Apache License 2.0 - see LICENSE for details.

Related Projects

KEDA - Event-driven autoscaling
Prometheus - Metrics and monitoring
NVIDIA GPU Operator - GPU management
KServe - AI inference serving

Name		Name	Last commit message	Last commit date
Latest commit History 43 Commits
.github/workflows		.github/workflows
api/v1alpha1		api/v1alpha1
charts/kubeai-autoscaler		charts/kubeai-autoscaler
cmd/controller		cmd/controller
controller		controller
crds		crds
deploy		deploy
docs		docs
examples		examples
pkg		pkg
.gitignore		.gitignore
.golangci.yml		.golangci.yml
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
PROPOSAL.md		PROPOSAL.md
README.MD		README.MD
SECURITY.md		SECURITY.md
go.mod		go.mod
go.sum		go.sum

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

KubeAI Autoscaler

Overview

Why KubeAI Autoscaler?

Features

Architecture

Getting Started

Prerequisites

Installation

Quick Start

Examples

Roadmap

Phase 1 (MVP) - Complete

Phase 2

Phase 3

Contributing

Community

License

Related Projects

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

KubeAI Autoscaler

Overview

Why KubeAI Autoscaler?

Features

Architecture

Getting Started

Prerequisites

Installation

Quick Start

Examples

Roadmap

Phase 1 (MVP) - Complete

Phase 2

Phase 3

Contributing

Community

License

Related Projects

About

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages