Current Version: 0.5.0 Last Updated: March 2026 Status: ✅ Phase 2 in Progress - Metal Agent, Memory Management, Observability, Benchmarking
Make GPU-accelerated LLM inference on Kubernetes dead simple. Deploy production-ready inference services in minutes, not days.
Target Users:
- Developers building AI-powered applications
- Platform teams running internal LLM services
- Organizations needing air-gapped/edge deployments
- Anyone wanting OpenAI-compatible APIs on their own infrastructure
Core Platform:
- ✅ Kubernetes-native CRDs (
Model,InferenceService) - ✅ Automatic model download from HuggingFace/HTTP
- ✅ OpenAI-compatible
/v1/chat/completionsAPI - ✅ Multi-replica deployment support
- ✅ Full CLI tool (
llmkube deploy/list/status/delete/queue/cache/benchmark/inspect/license) - ✅ Helm chart for easy installation
- ✅ Model catalog with pre-configured models and one-command deployment
- ✅ Air-gapped mode with
file://and local path model sources
GPU Acceleration:
- ✅ NVIDIA GPU support (T4, L4, A100)
- ✅ 17x performance improvement (64 tok/s GPU vs 4.6 tok/s CPU)
- ✅ Automatic GPU layer offloading
- ✅ Single-node multi-GPU sharding (up to 8 GPUs, layer-based splitting)
- ✅ Cost optimization (spot instances, auto-scale to 0)
Apple Silicon (Metal) Support:
- ✅ Metal Agent daemon for native macOS llama-server processes
- ✅ Pre-flight memory validation with GGUF-based estimation
- ✅ Per-model
memoryBudgetandmemoryFractionCRD fields - ✅ Continuous health monitoring of managed processes
- ✅ Agent metrics (memory budget, process health, active count)
- ✅ Homebrew auto-detection for llama-server binary
- ✅ Heterogeneous clusters (mix NVIDIA and Apple Silicon nodes)
GPU Scheduling & Queue Management:
- ✅ GPU contention visibility (
WaitingForGPUphase) - ✅ Queue position tracking for pending services
- ✅ Priority classes (critical/high/normal/low/batch)
- ✅
llmkube queuecommand to view waiting services - ✅ Detailed scheduling status and messages
Observability:
- ✅ 10 custom Prometheus metrics (downloads, phases, queues, reconciliation)
- ✅ OpenTelemetry tracing with Tempo export (gRPC port 4317)
- ✅ GPU metrics via DCGM exporter (utilization, temperature, power, memory)
- ✅ PodMonitor for llama.cpp inference metrics scraping
- ✅ Pre-built Grafana dashboards
- ✅ SLO alerts for GPU health and service availability
Model Management:
- ✅ PVC-based model caching with SHA256 cache keys
- ✅
llmkube cachecommand (list, clear, preload, inspect) - ✅ Native GGUF parser with metadata extraction
- ✅
llmkube inspectcommand for GGUF file introspection - ✅ License compliance scanning (
llmkube license) - ✅ Init container customization with custom CA certificate support
Benchmarking:
- ✅
llmkube benchmarkwith load testing, stress testing, and sweep modes - ✅ Concurrency, context size, and token count sweeps
- ✅ Per-request latency and throughput capture (p50, p95, p99)
- ✅ GPU monitoring during benchmarks
Deployment Options:
- ✅ Minikube/Kind (local development)
- ✅ GKE with GPU support (Terraform)
- ✅ EKS with GPU support (Terraform)
- ✅ AKS with GPU support (Terraform)
- ✅ Network policies in Helm chart
Focus: Make it easier to use, support more use cases
Completed:
- ✅ Model Catalog - Pre-configured popular models with
llmkube deploy llama-3-8b --gpu - ✅
llmkube benchmark- Comprehensive performance testing (load, stress, sweeps) - ✅ Health checks and readiness probes - Three-probe pattern for llama.cpp pods
- ✅ Persistent model storage - PVC-based caching (v0.4.1)
- ✅ GPU queue management - Priority classes and queue tracking (v0.4.9)
- ✅ Multi-GPU - Single-node sharding for 13B+ models (~65 tok/s on 8B with 2x RTX 5060 Ti)
Remaining:
llmkube init- Interactive setup wizardllmkube chat <model>- Test inference from CLI- Horizontal Pod Autoscaling (HPA) support
Focus: Support more deployment scenarios
Planned:
- K3s Compatibility - Lightweight Kubernetes for edge
- ARM64 Support - Raspberry Pi, NVIDIA Jetson
- Cost Tracking - Per-deployment cost attribution
- Additional Model Formats - SafeTensors, HuggingFace native
Already shipped early:
Air-gapped Mode✅ Shipped in v0.4.x (file:// and local path model sources)
Focus: Larger models, better performance
Planned:
- Distributed Inference - Shard large models (70B+) across nodes via llama.cpp RPC backend
- Advanced Auto-scaling - Queue depth, latency-based scaling
- AMD GPU Support - ROCm backend for AMD GPUs
- Intel GPU Support - oneAPI integration
- Performance Optimizations - KV cache sharing, batching improvements
Focus: Build sustainable open-source project
Planned:
- v1.0 Release - Production-hardened, stable APIs
- Operator Hub - Official Kubernetes Operator listing
- ArgoCD/Flux Templates - GitOps-ready deployments
- Terraform Provider - Infrastructure-as-code support
- VS Code Extension - YAML validation, snippets
- Community Program - Contributor guides, good first issues
| Milestone | Target | Status |
|---|---|---|
| Single GPU (3B model) | >60 tok/s | ✅ Achieved (64 tok/s on L4) |
| Multi-GPU (13B model) | >40 tok/s | ✅ Achieved (~44 tok/s on 2x RTX 5060 Ti) |
| Multi-GPU (8B model) | >60 tok/s | ✅ Achieved (~65 tok/s on 2x RTX 5060 Ti) |
| Multi-node (70B model) | <500ms P99 latency | 🎯 Planned |
| Cost Efficiency | <$0.01 per 1K tokens | 🎯 Planned |
| Model Load Time | <30s for any model | 🎯 Planned |
| Metric | Current | Q2 2026 Goal | Q4 2026 Goal |
|---|---|---|---|
| GitHub Stars | 25 | 200 | 1,000 |
| Contributors | 4 | 10 | 25 |
| Forks | 4 | 15 | 50 |
| Production Deployments | 1 | 10 | 25 |
| Models Supported | Any GGUF | 20+ pre-configured | 50+ pre-configured |
We're actively looking for contributors! Here's how you can help:
Open an issue with:
- What you expected to happen
- What actually happened
- Steps to reproduce
- Your environment (K8s version, cloud provider, etc.)
- Explain your use case
- Describe the proposed solution
- Share why others might find it useful
Good First Issues:
- Documentation improvements
- Example applications (chatbot UI, RAG system, etc.)
- Additional model configurations
- Testing on different K8s platforms
Advanced Contributions:
- Distributed inference via llama.cpp RPC (70B+ models)
- Additional GPU vendors (AMD, Intel)
- Model format support (SafeTensors)
- Horizontal Pod Autoscaling (HPA)
- Performance optimizations
Before starting: Comment on the issue or open a discussion to avoid duplicate work.
- Tutorials and guides
- Deployment examples
- Use case write-ups
- Troubleshooting tips
We ship frequently with semantic versioning:
- Patch releases (0.2.x): Bug fixes, minor improvements - Monthly
- Minor releases (0.x.0): New features, backward compatible - Quarterly
- Major releases (x.0.0): Breaking changes, major milestones - Yearly
Past releases:
- v0.4.0 - November 2025 (multi-GPU support) ✅
- v0.4.1 - November 2025 (persistent model cache) ✅
- v0.4.9 - December 2025 (GPU scheduling & priority classes) ✅
- v0.4.13 - February 2026 (GGUF parser, init container customization, NetworkPolicy) ✅
- v0.5.0 - March 2026 (Metal agent, memory validation, health monitoring, benchmarking) ✅ Current
Upcoming releases:
- v0.6.0 - Q2 2026 (edge deployment, K3s, HPA)
- v1.0.0 - Q4 2026 (stable, production-ready)
What guides our development:
- Ease of Use - If it takes more than 5 minutes to deploy, we're doing it wrong
- Show, Don't Tell - Working examples over lengthy docs
- Performance Matters - GPU acceleration should be automatic and obvious
- Production-Ready - Observability and reliability from day one
- Community-First - Build what users actually need, not what we think they need
- Keep it Simple - Avoid over-engineering until there's proven demand
Your feedback shapes our roadmap! Tell us:
- What features would make LLMKube more useful for you?
- What's blocking you from using it in production?
- What models/use cases should we prioritize?
- What's confusing or hard to use?
Ways to share:
- 💬 GitHub Discussions
- 🐛 GitHub Issues
- ⭐ Star the repo if you find it useful!
Apache 2.0 - See LICENSE
Last Updated: March 2026 Next Review: June 2026
This roadmap is a living document. Priorities may shift based on community feedback and real-world usage.