Skip to content

kube-nexus/kubenexus-scheduler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

41 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

KubeNexus Scheduler

License Go Version Kubernetes PRs Welcome Status

Workload-Aware, Topology-Optimized Scheduler for Heterogeneous Kubernetes Clusters

KubeNexus optimizes last-mile placement across CPU and GPU fleets. It adapts placement strategy (pack vs spread) based on workload intent and preserves accelerator topology (NVLink/NUMA) to prevent fragmentation and tenant interference.

Works standalone or layered with Kueue for admission and fairness.

⚠️ Beta Status: Ready for testing in dev/staging. Production use should be carefully evaluated.


Positioning

KubeNexus provides topology- and fragmentation-aware placement for multi-tenant GPU fleets under mixed workload intents.

The differentiator: Even after Kueue constrains nodes via FlavorFungibility, critical placement decisions remain:

  • Heterogeneity within allowed nodes: GPU contiguity (8 free vs 6+2 fragmented), NVSwitch island locality, NUMA/NIC paths
  • Workload-intent heterogeneity: Training (pack + preserve islands) vs Inference (spread + stability) vs Batch (opportunistic)
  • Fragmentation prevention: Node affinity can't express "prefer nodes where 8 GPUs are contiguous AND preserve future 8-GPU placements"

Layered architecture:

  • Standalone: Workload-aware placement + topology/interference control using native Kubernetes primitives (PriorityClasses, ResourceQuotas, namespaces)
  • With Kueue: Kueue handles admission/quotas/flavors; KubeNexus optimizes node-level and topology-aware placement within admitted intent

The Problem

Modern AI/ML infrastructure requires:

  • Multiple Teams (Gold/Silver/Bronze tiers)
  • Multiple Workload Types (Training/Inference/Service/Batch)
  • Multiple Hardware Tiers (H100/A100/L40 GPUs)

Economic Waste: Bronze teams land on expensive H100s. Gold teams find no H100 capacity. Training jobs spread across zones. Service workloads bin-pack on one node. $960k/year wasted on $2.4M GPU infrastructure through poor placement.

Manual Complexity: Multiple scheduler profiles, complex pod specs, per-team configuration.

KubeNexus Solution

Automatic 3-Axis Placement:

βœ… WHO (Tenant Tier): Goldβ†’H100, Silverβ†’A100, Bronzeβ†’L40
βœ… WHAT (Workload Type): Trainingβ†’bin pack, Serviceβ†’spread
βœ… WHERE (Hardware): NUMA, NVSwitch, GPU topology optimization

One scheduler. Zero manual configuration.


Quick Example

Before (Manual Configuration)

# Every team needs custom pod specs
spec:
  nodeSelector:
    gpu-type: h100          # Manual per-team
  schedulerName: training-scheduler  # Multiple profiles

After (Automatic)

# Just use namespace + scheduler name
metadata:
  namespace: gold-team      # Auto-detects tier
spec:
  schedulerName: kubenexus-scheduler
# Automatically: Gold→H100, Training→bin-pack, NUMA-aligned

Key Features

πŸ’° Economic Multi-Tenant GPU Scheduling

TenantHardware + VRAMScheduler route teams to appropriate GPU tiers and match VRAM requirements.

# Gold tenant with 70B model (80GB VRAM)
metadata:
  namespace: gold-team
  labels:
    vram.scheduling.kubenexus.io/required: "80Gi"
spec:
  schedulerName: kubenexus-scheduler
# Result: Routes to H100-80GB, filters A100-40GB

Value: $960k/year savings on $2.4M infrastructure through optimal placement.

πŸ“– Details

πŸ”„ Workload-Aware Placement

Native K8s: Pick ONE strategy (spread OR bin-pack) for ALL pods.
KubeNexus: Adapts per workload automatically.

# Training β†’ Bin pack (GPU locality)
workload.kubenexus.io/type: training

# Service β†’ Spread (high availability)
workload.kubenexus.io/type: service

Value: Optimal placement without multiple scheduler profiles.

πŸ“– Details

🎯 Gang Scheduling

All-or-nothing scheduling with cross-plugin awareness.

metadata:
  labels:
    pod-group.scheduling.sigs.k8s.io/name: distributed-training
    pod-group.scheduling.sigs.k8s.io/min-available: "64"
# Gang of 64 GPUs schedules atomically or waits
# Works with: ResourceReservation, BackfillScoring, WorkloadAware

Value: Prevents partial gang placement and deadlock.

πŸ“– Kubeflow Integration | Spark Integration | Details

🧠 NUMA-Aware Scheduling

2-3x faster GPU training through CPU/Memory/GPU topology alignment.

annotations:
  numa.scheduling.kubenexus.io/policy: "single-numa"
  numa.scheduling.kubenexus.io/resources: "cpu,memory,nvidia.com/gpu"

Policies: single-numa, restricted, best-effort, none

πŸ“– NUMA Guide | Quick Reference

🌐 Network Fabric-Aware

Keeps distributed training within NVSwitch/NVLink domains (100 score) vs Ethernet (50 score).

πŸ“– Details

βš–οΈ Multi-Tenant Placement Quality

Standalone capabilities (no admission controller needed):

  • Tenant-aware placement: Goldβ†’premium GPUs, Bronzeβ†’economy GPUs
  • Fragmentation prevention: Blocks interference (Bronze jobs don't fragment Gold's 8-GPU pools)
  • Preemption hierarchy: Gold can preempt Silver/Bronze
  • Starvation prevention: Age-based priority boost after 60s
  • Backfill placement: Bronze uses idle Gold capacity (preempted when Gold returns)

Example: Fragmentation interference AFTER admission

# Scenario: Both tenants admitted by Kueue (quotas satisfied βœ…)
# Tenant A: 100x 1-GPU inference pods
# Tenant B: 10x 8-GPU training jobs

# Without KubeNexus (after Kueue admits both):
# - Default scheduler places A's 1-GPU pods randomly
# - Result: Every node has 1-2 free GPUs, none has 8 contiguous
# - B's jobs fail to schedule despite passing quota check

# With KubeNexus (after Kueue admits both):
# - A's 1-GPU pods: Fill nodes completely OR leave clean 8-GPU islands
# - B's 8-GPU jobs: Find preserved contiguous GPU sets
# - Both tenants place successfully

Multi-tenancy at the placement-quality layer: Kueue ensures fair admission; KubeNexus prevents topology fragmentation that breaks feasibility.

With Kueue integration (adds admission control):

  • Quotas & fairness: ResourceQuotas, cohort borrowing, weighted fair share
  • Queue management: Prevents cluster flooding, prioritizes admission
  • Kueue FlavorFungibility: Kueue admits, KubeNexus optimizes node placement within flavor
# Kueue admits pod (quota check) β†’ KubeNexus schedules (topology optimization)
apiVersion: kueue.x-k8s.io/v1beta1
kind: LocalQueue
metadata:
  annotations:
    scheduling.kubenexus.io/tier: "gold"

πŸ“– Details | Architecture


Installation

# 1. Install CRDs
kubectl apply -f config/crd-workload.yaml
kubectl apply -f config/crd-resourcereservation.yaml

# 2. Deploy KubeNexus Scheduler
kubectl apply -f deploy/kubenexus-scheduler.yaml

# 3. Label namespaces with tenant tiers
kubectl label namespace gold-team scheduling.kubenexus.io/tier=gold
kubectl label namespace bronze-team scheduling.kubenexus.io/tier=bronze

# 4. Use in pods
apiVersion: v1
kind: Pod
metadata:
  namespace: gold-team
spec:
  schedulerName: kubenexus-scheduler
  containers:
  - name: training
    resources:
      requests:
        nvidia.com/gpu: 8

πŸ“– Complete Installation Guide | GPU Cluster Guide


Architecture

ProfileClassifier: Tenant + Workload Identity

Every pod gets classified into a SchedulingProfile (WHO + WHAT):

type SchedulingProfile struct {
    TenantTier    TenantTier   // gold / silver / bronze
    WorkloadType  WorkloadType // training / service / batch
    // ... more fields
}

All plugins read this profile for intelligent decisions.

Plugin Pipeline

PreFilter β†’ ProfileClassifier (classify WHO + WHAT)
  ↓
Filter β†’ ResourceReservation, NUMATopology (feasibility)
  ↓
Score β†’ TenantHardware, WorkloadAware, VRAMScheduler, NetworkFabric (optimization)
  ↓
Permit β†’ Coscheduling (gang coordination)
  ↓
PostFilter β†’ GangPreemption (atomic preemption)

πŸ“– Full Architecture | Design Decisions


Integrations

Kueue Integration

Architecture: Kueue (admission control) + KubeNexus (placement optimization)

apiVersion: kueue.x-k8s.io/v1beta1
kind: LocalQueue
metadata:
  annotations:
    scheduling.kubenexus.io/tier: "gold"

Flow:

  1. Kueue checks quota β†’ Admits pod to cluster
  2. KubeNexus optimizes node placement (topology, fragmentation, NUMA)

πŸ“– Kueue Integration Guide

Operator Support

  • Kubeflow Training/MPI Operators: Gang scheduling + intelligent placement
  • Spark Operator: Driver/executor anti-affinity
  • Ray Operator: Head/worker placement strategies
  • PyTorch/TensorFlow Operators: Distributed training optimization

πŸ“– Kubeflow Integration | Spark Integration | Operator Support


Comparison

Feature KubeNexus Volcano YuniKorn Kueue Native K8s
Multi-Tenant GPU Routing βœ… Automatic ❌ Manual nodeSelector ❌ Manual ❌ (FlavorFungibility only) ❌ Manual
Workload-Aware Placement βœ… Auto per-pod ❌ Global policy ❌ Global ❌ ❌
NUMA Topology βœ… CPU+Mem+GPU Basic ❌ ❌ ❌
GPU Fragmentation Prevention βœ… Tenant-aware ❌ ❌ ❌ ❌
VRAM Scheduling βœ… Utilization-based ❌ ❌ ❌ ❌
Gang Scheduling βœ… Cross-plugin βœ… Basic βœ… Basic βœ… ❌
Admission Control βž• Via Kueue βœ… Built-in βœ… Built-in βœ… Core feature ResourceQuota
Best For Multi-tenant heterogeneous GPU Batch jobs Large multi-tenant Quota management Simple workloads

πŸ“– Detailed Comparison | vs Upstream | Competitive Advantage


Documentation


Roadmap

v0.2 (Current):

  • βœ… ProfileClassifier (tenant + workload classification)
  • βœ… Gang scheduling with cross-plugin awareness
  • βœ… NUMA topology scheduling
  • βœ… Network fabric-aware placement
  • βœ… Kueue integration (layered architecture)

v0.3 (Next - Topology & Placement):

  • βœ… DRA (Dynamic Resource Allocation) integration for GPU topology discovery
  • βœ… 3-tier fallback strategy (DRA β†’ NFD β†’ Manual Labels) for K8s version compatibility
  • ⏳ Enhanced preemption with checkpoint/restore awareness
  • ⏳ Multi-cluster scheduling federation
  • ⏳ Advanced metrics & observability dashboards

v0.4+ (Advanced Placement):

  • GPU time-slicing with topology awareness
  • NFD integration for auto-discovery
  • NodeResourceTopology CRD support
  • Enhanced Kueue interop (read ClusterQueue quotas for better backfill)

Not on Roadmap (Use Kueue Instead):

  • ❌ DRF & Weighted Fair Share β†’ Use Kueue ClusterQueue fair sharing
  • ❌ Global quota enforcement β†’ Use Kueue ResourceQuota/ResourceFlavor
  • ❌ Admission control β†’ Use Kueue admission policies

Architectural Principle: KubeNexus optimizes placement (WHERE on nodes), Kueue handles admission (WHO gets resources). Don't reinvent the wheel.


Contributing

We welcome contributions! See CONTRIBUTING.md for guidelines.

Issues, PRs, and feedback: github.com/kubenexus/scheduler


Community & Support

  • Documentation: docs/
  • Discussions: GitHub Discussions
  • Issues: GitHub Issues
  • Security: SECURITY.md

License

Apache License 2.0 - See LICENSE for details.