Workload-Aware, Topology-Optimized Scheduler for Heterogeneous Kubernetes Clusters
KubeNexus optimizes last-mile placement across CPU and GPU fleets. It adapts placement strategy (pack vs spread) based on workload intent and preserves accelerator topology (NVLink/NUMA) to prevent fragmentation and tenant interference.
Works standalone or layered with Kueue for admission and fairness.
β οΈ Beta Status: Ready for testing in dev/staging. Production use should be carefully evaluated.
KubeNexus provides topology- and fragmentation-aware placement for multi-tenant GPU fleets under mixed workload intents.
The differentiator: Even after Kueue constrains nodes via FlavorFungibility, critical placement decisions remain:
- Heterogeneity within allowed nodes: GPU contiguity (8 free vs 6+2 fragmented), NVSwitch island locality, NUMA/NIC paths
- Workload-intent heterogeneity: Training (pack + preserve islands) vs Inference (spread + stability) vs Batch (opportunistic)
- Fragmentation prevention: Node affinity can't express "prefer nodes where 8 GPUs are contiguous AND preserve future 8-GPU placements"
Layered architecture:
- Standalone: Workload-aware placement + topology/interference control using native Kubernetes primitives (PriorityClasses, ResourceQuotas, namespaces)
- With Kueue: Kueue handles admission/quotas/flavors; KubeNexus optimizes node-level and topology-aware placement within admitted intent
Modern AI/ML infrastructure requires:
- Multiple Teams (Gold/Silver/Bronze tiers)
- Multiple Workload Types (Training/Inference/Service/Batch)
- Multiple Hardware Tiers (H100/A100/L40 GPUs)
Economic Waste: Bronze teams land on expensive H100s. Gold teams find no H100 capacity. Training jobs spread across zones. Service workloads bin-pack on one node. $960k/year wasted on $2.4M GPU infrastructure through poor placement.
Manual Complexity: Multiple scheduler profiles, complex pod specs, per-team configuration.
Automatic 3-Axis Placement:
β
WHO (Tenant Tier): GoldβH100, SilverβA100, BronzeβL40
β
WHAT (Workload Type): Trainingβbin pack, Serviceβspread
β
WHERE (Hardware): NUMA, NVSwitch, GPU topology optimization
One scheduler. Zero manual configuration.
# Every team needs custom pod specs
spec:
nodeSelector:
gpu-type: h100 # Manual per-team
schedulerName: training-scheduler # Multiple profiles# Just use namespace + scheduler name
metadata:
namespace: gold-team # Auto-detects tier
spec:
schedulerName: kubenexus-scheduler
# Automatically: GoldβH100, Trainingβbin-pack, NUMA-alignedTenantHardware + VRAMScheduler route teams to appropriate GPU tiers and match VRAM requirements.
# Gold tenant with 70B model (80GB VRAM)
metadata:
namespace: gold-team
labels:
vram.scheduling.kubenexus.io/required: "80Gi"
spec:
schedulerName: kubenexus-scheduler
# Result: Routes to H100-80GB, filters A100-40GBValue: $960k/year savings on $2.4M infrastructure through optimal placement.
π Details
Native K8s: Pick ONE strategy (spread OR bin-pack) for ALL pods.
KubeNexus: Adapts per workload automatically.
# Training β Bin pack (GPU locality)
workload.kubenexus.io/type: training
# Service β Spread (high availability)
workload.kubenexus.io/type: serviceValue: Optimal placement without multiple scheduler profiles.
π Details
All-or-nothing scheduling with cross-plugin awareness.
metadata:
labels:
pod-group.scheduling.sigs.k8s.io/name: distributed-training
pod-group.scheduling.sigs.k8s.io/min-available: "64"
# Gang of 64 GPUs schedules atomically or waits
# Works with: ResourceReservation, BackfillScoring, WorkloadAwareValue: Prevents partial gang placement and deadlock.
π Kubeflow Integration | Spark Integration | Details
2-3x faster GPU training through CPU/Memory/GPU topology alignment.
annotations:
numa.scheduling.kubenexus.io/policy: "single-numa"
numa.scheduling.kubenexus.io/resources: "cpu,memory,nvidia.com/gpu"Policies: single-numa, restricted, best-effort, none
π NUMA Guide | Quick Reference
Keeps distributed training within NVSwitch/NVLink domains (100 score) vs Ethernet (50 score).
π Details
Standalone capabilities (no admission controller needed):
- Tenant-aware placement: Goldβpremium GPUs, Bronzeβeconomy GPUs
- Fragmentation prevention: Blocks interference (Bronze jobs don't fragment Gold's 8-GPU pools)
- Preemption hierarchy: Gold can preempt Silver/Bronze
- Starvation prevention: Age-based priority boost after 60s
- Backfill placement: Bronze uses idle Gold capacity (preempted when Gold returns)
Example: Fragmentation interference AFTER admission
# Scenario: Both tenants admitted by Kueue (quotas satisfied β
)
# Tenant A: 100x 1-GPU inference pods
# Tenant B: 10x 8-GPU training jobs
# Without KubeNexus (after Kueue admits both):
# - Default scheduler places A's 1-GPU pods randomly
# - Result: Every node has 1-2 free GPUs, none has 8 contiguous
# - B's jobs fail to schedule despite passing quota check
# With KubeNexus (after Kueue admits both):
# - A's 1-GPU pods: Fill nodes completely OR leave clean 8-GPU islands
# - B's 8-GPU jobs: Find preserved contiguous GPU sets
# - Both tenants place successfullyMulti-tenancy at the placement-quality layer: Kueue ensures fair admission; KubeNexus prevents topology fragmentation that breaks feasibility.
With Kueue integration (adds admission control):
- Quotas & fairness: ResourceQuotas, cohort borrowing, weighted fair share
- Queue management: Prevents cluster flooding, prioritizes admission
- Kueue FlavorFungibility: Kueue admits, KubeNexus optimizes node placement within flavor
# Kueue admits pod (quota check) β KubeNexus schedules (topology optimization)
apiVersion: kueue.x-k8s.io/v1beta1
kind: LocalQueue
metadata:
annotations:
scheduling.kubenexus.io/tier: "gold"π Details | Architecture
# 1. Install CRDs
kubectl apply -f config/crd-workload.yaml
kubectl apply -f config/crd-resourcereservation.yaml
# 2. Deploy KubeNexus Scheduler
kubectl apply -f deploy/kubenexus-scheduler.yaml
# 3. Label namespaces with tenant tiers
kubectl label namespace gold-team scheduling.kubenexus.io/tier=gold
kubectl label namespace bronze-team scheduling.kubenexus.io/tier=bronze
# 4. Use in pods
apiVersion: v1
kind: Pod
metadata:
namespace: gold-team
spec:
schedulerName: kubenexus-scheduler
containers:
- name: training
resources:
requests:
nvidia.com/gpu: 8π Complete Installation Guide | GPU Cluster Guide
Every pod gets classified into a SchedulingProfile (WHO + WHAT):
type SchedulingProfile struct {
TenantTier TenantTier // gold / silver / bronze
WorkloadType WorkloadType // training / service / batch
// ... more fields
}All plugins read this profile for intelligent decisions.
PreFilter β ProfileClassifier (classify WHO + WHAT)
β
Filter β ResourceReservation, NUMATopology (feasibility)
β
Score β TenantHardware, WorkloadAware, VRAMScheduler, NetworkFabric (optimization)
β
Permit β Coscheduling (gang coordination)
β
PostFilter β GangPreemption (atomic preemption)
π Full Architecture | Design Decisions
Architecture: Kueue (admission control) + KubeNexus (placement optimization)
apiVersion: kueue.x-k8s.io/v1beta1
kind: LocalQueue
metadata:
annotations:
scheduling.kubenexus.io/tier: "gold"Flow:
- Kueue checks quota β Admits pod to cluster
- KubeNexus optimizes node placement (topology, fragmentation, NUMA)
- Kubeflow Training/MPI Operators: Gang scheduling + intelligent placement
- Spark Operator: Driver/executor anti-affinity
- Ray Operator: Head/worker placement strategies
- PyTorch/TensorFlow Operators: Distributed training optimization
π Kubeflow Integration | Spark Integration | Operator Support
| Feature | KubeNexus | Volcano | YuniKorn | Kueue | Native K8s |
|---|---|---|---|---|---|
| Multi-Tenant GPU Routing | β Automatic | β Manual nodeSelector | β Manual | β (FlavorFungibility only) | β Manual |
| Workload-Aware Placement | β Auto per-pod | β Global policy | β Global | β | β |
| NUMA Topology | β CPU+Mem+GPU | Basic | β | β | β |
| GPU Fragmentation Prevention | β Tenant-aware | β | β | β | β |
| VRAM Scheduling | β Utilization-based | β | β | β | β |
| Gang Scheduling | β Cross-plugin | β Basic | β Basic | β | β |
| Admission Control | β Via Kueue | β Built-in | β Built-in | β Core feature | ResourceQuota |
| Best For | Multi-tenant heterogeneous GPU | Batch jobs | Large multi-tenant | Quota management | Simple workloads |
π Detailed Comparison | vs Upstream | Competitive Advantage
- User Guide: docs/USER_GUIDE.md
- GPU Cluster Setup: docs/GPU_CLUSTER_GUIDE.md
- NUMA Scheduling: docs/NUMA_SCHEDULING_GUIDE.md
- Features Deep Dive: docs/FEATURES.md
- Architecture: docs/ARCHITECTURE.md
- Testing Guide: docs/TESTING_GUIDE.md
- Kubeflow Integration: docs/KUBEFLOW_INTEGRATION.md
- Spark Integration: docs/SPARK_OPERATOR_INTEGRATION.md
v0.2 (Current):
- β ProfileClassifier (tenant + workload classification)
- β Gang scheduling with cross-plugin awareness
- β NUMA topology scheduling
- β Network fabric-aware placement
- β Kueue integration (layered architecture)
v0.3 (Next - Topology & Placement):
- β DRA (Dynamic Resource Allocation) integration for GPU topology discovery
- β 3-tier fallback strategy (DRA β NFD β Manual Labels) for K8s version compatibility
- β³ Enhanced preemption with checkpoint/restore awareness
- β³ Multi-cluster scheduling federation
- β³ Advanced metrics & observability dashboards
v0.4+ (Advanced Placement):
- GPU time-slicing with topology awareness
- NFD integration for auto-discovery
- NodeResourceTopology CRD support
- Enhanced Kueue interop (read ClusterQueue quotas for better backfill)
Not on Roadmap (Use Kueue Instead):
- β DRF & Weighted Fair Share β Use Kueue ClusterQueue fair sharing
- β Global quota enforcement β Use Kueue ResourceQuota/ResourceFlavor
- β Admission control β Use Kueue admission policies
Architectural Principle: KubeNexus optimizes placement (WHERE on nodes), Kueue handles admission (WHO gets resources). Don't reinvent the wheel.
We welcome contributions! See CONTRIBUTING.md for guidelines.
Issues, PRs, and feedback: github.com/kubenexus/scheduler
- Documentation: docs/
- Discussions: GitHub Discussions
- Issues: GitHub Issues
- Security: SECURITY.md
Apache License 2.0 - See LICENSE for details.