KubeNexus Scheduler

Workload-Aware, Topology-Optimized Scheduler for Heterogeneous Kubernetes Clusters

KubeNexus optimizes last-mile placement across CPU and GPU fleets. It adapts placement strategy (pack vs spread) based on workload intent and preserves accelerator topology (NVLink/NUMA) to prevent fragmentation and tenant interference.

Works standalone or layered with Kueue for admission and fairness.

⚠️ Beta Status: Ready for testing in dev/staging. Production use should be carefully evaluated.

Positioning

KubeNexus provides topology- and fragmentation-aware placement for multi-tenant GPU fleets under mixed workload intents.

The differentiator: Even after Kueue constrains nodes via FlavorFungibility, critical placement decisions remain:

Heterogeneity within allowed nodes: GPU contiguity (8 free vs 6+2 fragmented), NVSwitch island locality, NUMA/NIC paths
Workload-intent heterogeneity: Training (pack + preserve islands) vs Inference (spread + stability) vs Batch (opportunistic)
Fragmentation prevention: Node affinity can't express "prefer nodes where 8 GPUs are contiguous AND preserve future 8-GPU placements"

Layered architecture:

Standalone: Workload-aware placement + topology/interference control using native Kubernetes primitives (PriorityClasses, ResourceQuotas, namespaces)
With Kueue: Kueue handles admission/quotas/flavors; KubeNexus optimizes node-level and topology-aware placement within admitted intent

The Problem

Modern AI/ML infrastructure requires:

Multiple Teams (Gold/Silver/Bronze tiers)
Multiple Workload Types (Training/Inference/Service/Batch)
Multiple Hardware Tiers (H100/A100/L40 GPUs)

Economic Waste: Bronze teams land on expensive H100s. Gold teams find no H100 capacity. Training jobs spread across zones. Service workloads bin-pack on one node. $960k/year wasted on $2.4M GPU infrastructure through poor placement.

Manual Complexity: Multiple scheduler profiles, complex pod specs, per-team configuration.

KubeNexus Solution

Automatic 3-Axis Placement:

✅ WHO (Tenant Tier): Gold→H100, Silver→A100, Bronze→L40
✅ WHAT (Workload Type): Training→bin pack, Service→spread
✅ WHERE (Hardware): NUMA, NVSwitch, GPU topology optimization

One scheduler. Zero manual configuration.

Quick Example

Before (Manual Configuration)

# Every team needs custom pod specs
spec:
  nodeSelector:
    gpu-type: h100          # Manual per-team
  schedulerName: training-scheduler  # Multiple profiles

After (Automatic)

# Just use namespace + scheduler name
metadata:
  namespace: gold-team      # Auto-detects tier
spec:
  schedulerName: kubenexus-scheduler
# Automatically: Gold→H100, Training→bin-pack, NUMA-aligned

Key Features

💰 Economic Multi-Tenant GPU Scheduling

TenantHardware + VRAMScheduler route teams to appropriate GPU tiers and match VRAM requirements.

# Gold tenant with 70B model (80GB VRAM)
metadata:
  namespace: gold-team
  labels:
    vram.scheduling.kubenexus.io/required: "80Gi"
spec:
  schedulerName: kubenexus-scheduler
# Result: Routes to H100-80GB, filters A100-40GB

Value: $960k/year savings on $2.4M infrastructure through optimal placement.

📖 Details

🔄 Workload-Aware Placement

Native K8s: Pick ONE strategy (spread OR bin-pack) for ALL pods.
KubeNexus: Adapts per workload automatically.

# Training → Bin pack (GPU locality)
workload.kubenexus.io/type: training

# Service → Spread (high availability)
workload.kubenexus.io/type: service

Value: Optimal placement without multiple scheduler profiles.

📖 Details

🎯 Gang Scheduling

All-or-nothing scheduling with cross-plugin awareness.

metadata:
  labels:
    pod-group.scheduling.sigs.k8s.io/name: distributed-training
    pod-group.scheduling.sigs.k8s.io/min-available: "64"
# Gang of 64 GPUs schedules atomically or waits
# Works with: ResourceReservation, BackfillScoring, WorkloadAware

Value: Prevents partial gang placement and deadlock.

📖 Kubeflow Integration | Spark Integration | Details

🧠 NUMA-Aware Scheduling

2-3x faster GPU training through CPU/Memory/GPU topology alignment.

annotations:
  numa.scheduling.kubenexus.io/policy: "single-numa"
  numa.scheduling.kubenexus.io/resources: "cpu,memory,nvidia.com/gpu"

Policies: single-numa, restricted, best-effort, none

📖 NUMA Guide | Quick Reference

🌐 Network Fabric-Aware

Keeps distributed training within NVSwitch/NVLink domains (100 score) vs Ethernet (50 score).

📖 Details

⚖️ Multi-Tenant Placement Quality

Standalone capabilities (no admission controller needed):

Tenant-aware placement: Gold→premium GPUs, Bronze→economy GPUs
Fragmentation prevention: Blocks interference (Bronze jobs don't fragment Gold's 8-GPU pools)
Preemption hierarchy: Gold can preempt Silver/Bronze
Starvation prevention: Age-based priority boost after 60s
Backfill placement: Bronze uses idle Gold capacity (preempted when Gold returns)

Example: Fragmentation interference AFTER admission

# Scenario: Both tenants admitted by Kueue (quotas satisfied ✅)
# Tenant A: 100x 1-GPU inference pods
# Tenant B: 10x 8-GPU training jobs

# Without KubeNexus (after Kueue admits both):
# - Default scheduler places A's 1-GPU pods randomly
# - Result: Every node has 1-2 free GPUs, none has 8 contiguous
# - B's jobs fail to schedule despite passing quota check

# With KubeNexus (after Kueue admits both):
# - A's 1-GPU pods: Fill nodes completely OR leave clean 8-GPU islands
# - B's 8-GPU jobs: Find preserved contiguous GPU sets
# - Both tenants place successfully

Multi-tenancy at the placement-quality layer: Kueue ensures fair admission; KubeNexus prevents topology fragmentation that breaks feasibility.

With Kueue integration (adds admission control):

Quotas & fairness: ResourceQuotas, cohort borrowing, weighted fair share
Queue management: Prevents cluster flooding, prioritizes admission
Kueue FlavorFungibility: Kueue admits, KubeNexus optimizes node placement within flavor

# Kueue admits pod (quota check) → KubeNexus schedules (topology optimization)
apiVersion: kueue.x-k8s.io/v1beta1
kind: LocalQueue
metadata:
  annotations:
    scheduling.kubenexus.io/tier: "gold"

📖 Details | Architecture

Installation

# 1. Install CRDs
kubectl apply -f config/crd-workload.yaml
kubectl apply -f config/crd-resourcereservation.yaml

# 2. Deploy KubeNexus Scheduler
kubectl apply -f deploy/kubenexus-scheduler.yaml

# 3. Label namespaces with tenant tiers
kubectl label namespace gold-team scheduling.kubenexus.io/tier=gold
kubectl label namespace bronze-team scheduling.kubenexus.io/tier=bronze

# 4. Use in pods
apiVersion: v1
kind: Pod
metadata:
  namespace: gold-team
spec:
  schedulerName: kubenexus-scheduler
  containers:
  - name: training
    resources:
      requests:
        nvidia.com/gpu: 8

📖 Complete Installation Guide | GPU Cluster Guide

Architecture

ProfileClassifier: Tenant + Workload Identity

Every pod gets classified into a SchedulingProfile (WHO + WHAT):

type SchedulingProfile struct {
    TenantTier    TenantTier   // gold / silver / bronze
    WorkloadType  WorkloadType // training / service / batch
    // ... more fields
}

All plugins read this profile for intelligent decisions.

Plugin Pipeline

PreFilter → ProfileClassifier (classify WHO + WHAT)
  ↓
Filter → ResourceReservation, NUMATopology (feasibility)
  ↓
Score → TenantHardware, WorkloadAware, VRAMScheduler, NetworkFabric (optimization)
  ↓
Permit → Coscheduling (gang coordination)
  ↓
PostFilter → GangPreemption (atomic preemption)

📖 Full Architecture | Design Decisions

Integrations

Kueue Integration

Architecture: Kueue (admission control) + KubeNexus (placement optimization)

apiVersion: kueue.x-k8s.io/v1beta1
kind: LocalQueue
metadata:
  annotations:
    scheduling.kubenexus.io/tier: "gold"

Flow:

Kueue checks quota → Admits pod to cluster
KubeNexus optimizes node placement (topology, fragmentation, NUMA)

📖 Kueue Integration Guide

Operator Support

Kubeflow Training/MPI Operators: Gang scheduling + intelligent placement
Spark Operator: Driver/executor anti-affinity
Ray Operator: Head/worker placement strategies
PyTorch/TensorFlow Operators: Distributed training optimization

📖 Kubeflow Integration | Spark Integration | Operator Support

Comparison

Feature	KubeNexus	Volcano	YuniKorn	Kueue	Native K8s
Multi-Tenant GPU Routing	✅ Automatic	❌ Manual nodeSelector	❌ Manual	❌ (FlavorFungibility only)	❌ Manual
Workload-Aware Placement	✅ Auto per-pod	❌ Global policy	❌ Global	❌	❌
NUMA Topology	✅ CPU+Mem+GPU	Basic	❌	❌	❌
GPU Fragmentation Prevention	✅ Tenant-aware	❌	❌	❌	❌
VRAM Scheduling	✅ Utilization-based	❌	❌	❌	❌
Gang Scheduling	✅ Cross-plugin	✅ Basic	✅ Basic	✅	❌
Admission Control	➕ Via Kueue	✅ Built-in	✅ Built-in	✅ Core feature	ResourceQuota
Best For	Multi-tenant heterogeneous GPU	Batch jobs	Large multi-tenant	Quota management	Simple workloads

📖 Detailed Comparison | vs Upstream | Competitive Advantage

Documentation

User Guide: docs/USER_GUIDE.md
GPU Cluster Setup: docs/GPU_CLUSTER_GUIDE.md
NUMA Scheduling: docs/NUMA_SCHEDULING_GUIDE.md
Features Deep Dive: docs/FEATURES.md
Architecture: docs/ARCHITECTURE.md
Testing Guide: docs/TESTING_GUIDE.md
Kubeflow Integration: docs/KUBEFLOW_INTEGRATION.md
Spark Integration: docs/SPARK_OPERATOR_INTEGRATION.md

Roadmap

v0.2 (Current):

✅ ProfileClassifier (tenant + workload classification)
✅ Gang scheduling with cross-plugin awareness
✅ NUMA topology scheduling
✅ Network fabric-aware placement
✅ Kueue integration (layered architecture)

v0.3 (Next - Topology & Placement):

✅ DRA (Dynamic Resource Allocation) integration for GPU topology discovery
✅ 3-tier fallback strategy (DRA → NFD → Manual Labels) for K8s version compatibility
⏳ Enhanced preemption with checkpoint/restore awareness
⏳ Multi-cluster scheduling federation
⏳ Advanced metrics & observability dashboards

v0.4+ (Advanced Placement):

GPU time-slicing with topology awareness
NFD integration for auto-discovery
NodeResourceTopology CRD support
Enhanced Kueue interop (read ClusterQueue quotas for better backfill)

Not on Roadmap (Use Kueue Instead):

❌ DRF & Weighted Fair Share → Use Kueue ClusterQueue fair sharing
❌ Global quota enforcement → Use Kueue ResourceQuota/ResourceFlavor
❌ Admission control → Use Kueue admission policies

Architectural Principle: KubeNexus optimizes placement (WHERE on nodes), Kueue handles admission (WHO gets resources). Don't reinvent the wheel.

Contributing

We welcome contributions! See CONTRIBUTING.md for guidelines.

Issues, PRs, and feedback: github.com/kubenexus/scheduler

Community & Support

Documentation: docs/
Discussions: GitHub Discussions
Issues: GitHub Issues
Security: SECURITY.md

License

Apache License 2.0 - See LICENSE for details.

Name		Name	Last commit message	Last commit date
Latest commit History 41 Commits
.github		.github
cmd		cmd
config		config
deploy		deploy
docs		docs
hack		hack
pkg		pkg
test		test
.codecov.yml		.codecov.yml
.gitattributes		.gitattributes
.gitignore		.gitignore
.golangci.yml		.golangci.yml
.pre-commit-config.yaml		.pre-commit-config.yaml
CHANGELOG.md		CHANGELOG.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
Dockerfile.simple		Dockerfile.simple
Dockerfile.webhook		Dockerfile.webhook
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
SECURITY.md		SECURITY.md
SUPPORT.md		SUPPORT.md
go.mod		go.mod
go.sum		go.sum

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

KubeNexus Scheduler

Positioning

The Problem

KubeNexus Solution

Quick Example

Before (Manual Configuration)

After (Automatic)

Key Features

💰 Economic Multi-Tenant GPU Scheduling

🔄 Workload-Aware Placement

🎯 Gang Scheduling

🧠 NUMA-Aware Scheduling

🌐 Network Fabric-Aware

⚖️ Multi-Tenant Placement Quality

Installation

Architecture

ProfileClassifier: Tenant + Workload Identity

Plugin Pipeline

Integrations

Kueue Integration

Operator Support

Comparison

Documentation

Roadmap

Contributing

Community & Support

License

About

Uh oh!

Releases 2

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

KubeNexus Scheduler

Positioning

The Problem

KubeNexus Solution

Quick Example

Before (Manual Configuration)

After (Automatic)

Key Features

💰 Economic Multi-Tenant GPU Scheduling

🔄 Workload-Aware Placement

🎯 Gang Scheduling

🧠 NUMA-Aware Scheduling

🌐 Network Fabric-Aware

⚖️ Multi-Tenant Placement Quality

Installation

Architecture

ProfileClassifier: Tenant + Workload Identity

Plugin Pipeline

Integrations

Kueue Integration

Operator Support

Comparison

Documentation

Roadmap

Contributing

Community & Support

License

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages