This document explains how KubeNexus Scheduler works with Kubernetes Operators and their Custom Resource Definitions (CRDs), such as Spark Operator, Kubeflow Training Operator, and others.
Critical Understanding:
- Operators create CRDs (SparkApplication, PyTorchJob, etc.)
- Operators watch their CRDs and create Pods based on specs
- Schedulers only schedule Pods, never CRDs
User creates CRD
↓
Operator watches CRD
↓
Operator creates Pods (with labels from CRD)
↓
Scheduler schedules Pods ← KubeNexus works here!
# 1. User creates SparkApplication CRD
apiVersion: sparkoperator.k8s.io/v1beta2
kind: SparkApplication
metadata:
name: spark-pi
spec:
driver:
labels:
pod-group.scheduling.kubenexus.io/name: "spark-pi"
executor:
instances: 10
labels:
pod-group.scheduling.kubenexus.io/name: "spark-pi"
# 2. Spark Operator creates Pods
apiVersion: v1
kind: Pod
metadata:
name: spark-pi-driver
labels:
pod-group.scheduling.kubenexus.io/name: "spark-pi" # Inherited!
spark-role: driver
spec:
schedulerName: kubenexus-scheduler # Set by operator or user
# 3. KubeNexus schedules the Pod (not the CRD!)Volcano uses its own CRDs:
# Volcano's approach: Wrap everything in VolcanoJob CRD
apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
name: spark-job
spec:
minAvailable: 11
schedulerName: volcano
tasks:
- replicas: 1
name: driver
template:
spec:
containers:
- name: spark
- replicas: 10
name: executor
template:
spec:
containers:
- name: sparkPros:
- ✅ Unified API for all workloads
- ✅ Rich gang scheduling features
- ✅ Status tracking and lifecycle management
Cons:
- ❌ Requires users to learn Volcano CRDs
- ❌ Doesn't work directly with Spark/Kubeflow operators
- ❌ Need integration layer (webhooks, controllers)
Volcano + Spark Operator Integration:
# Spark Operator creates pods with Volcano labels
apiVersion: sparkoperator.k8s.io/v1beta2
kind: SparkApplication
spec:
batchScheduler: "volcano" # Operator injects volcano.sh/* labels
driver:
# Operator automatically adds:
# labels:
# volcano.sh/job-name: spark-pi
# volcano.sh/queue-name: defaultYuniKorn uses Application Abstraction:
# YuniKorn expects applicationId label on pods
labels:
applicationId: "spark-app-001"
queue: "root.default"Operator Integration:
- Spark Operator: Built-in YuniKorn support (adds labels)
- Kubeflow: Training operator adds labels automatically
- Custom: Users add labels to CRD templates
Kueue uses Workload CRD:
# Kueue wraps other resources in a Workload
apiVersion: kueue.x-k8s.io/v1beta1
kind: Workload
metadata:
name: spark-job
spec:
podSets:
- count: 11
name: sparkIntegration:
- Webhook intercepts Pod creation
- Creates Workload CRD automatically
- Manages queue admission
"Work with existing operators, not against them"
- No custom CRDs required - Use standard Kubernetes labels
- Operator-agnostic - Works with any operator that creates pods
- Opt-in - Only affects pods with KubeNexus labels
- Simple - No additional API objects to manage
# Step 1: User configures Operator CRD with KubeNexus labels
apiVersion: sparkoperator.k8s.io/v1beta2
kind: SparkApplication
metadata:
name: spark-pi
spec:
driver:
labels:
pod-group.scheduling.kubenexus.io/name: "spark-pi"
pod-group.scheduling.kubenexus.io/min-available: "11"
schedulerName: kubenexus-scheduler
executor:
instances: 10
labels:
pod-group.scheduling.kubenexus.io/name: "spark-pi"
pod-group.scheduling.kubenexus.io/min-available: "11"
schedulerName: kubenexus-scheduler
# Step 2: Operator creates Pods with these labels
# (No changes to operator needed!)
# Step 3: KubeNexus scheduler sees the pods and applies gang schedulingapiVersion: sparkoperator.k8s.io/v1beta2
kind: SparkApplication
metadata:
name: spark-pi
spec:
driver:
labels:
pod-group.scheduling.kubenexus.io/name: "spark-pi"
pod-group.scheduling.kubenexus.io/min-available: "11"
schedulerName: kubenexus-scheduler
executor:
instances: 10
labels:
pod-group.scheduling.kubenexus.io/name: "spark-pi"
pod-group.scheduling.kubenexus.io/min-available: "11"
schedulerName: kubenexus-schedulerWorks because: Spark Operator passes labels from CRD to Pods
apiVersion: kubeflow.org/v1
kind: PyTorchJob
metadata:
name: pytorch-mnist
spec:
pytorchReplicaSpecs:
Master:
template:
metadata:
labels:
pod-group.scheduling.kubenexus.io/name: "mnist"
pod-group.scheduling.kubenexus.io/min-available: "9"
spec:
schedulerName: kubenexus-scheduler
Worker:
replicas: 8
template:
metadata:
labels:
pod-group.scheduling.kubenexus.io/name: "mnist"
pod-group.scheduling.kubenexus.io/min-available: "9"
spec:
schedulerName: kubenexus-schedulerWorks because: Training Operator allows custom labels on pod templates
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
name: distributed-workflow
spec:
entrypoint: main
templates:
- name: main
dag:
tasks:
- name: train
templateRef:
name: pytorch-training
template: worker
arguments:
parameters:
- name: replicas
value: "8"
# Gang scheduling via pod metadata
podSpecPatch: |
metadata:
labels:
pod-group.scheduling.kubenexus.io/name: "workflow-gang"
pod-group.scheduling.kubenexus.io/min-available: "8"
spec:
schedulerName: kubenexus-schedulerWorks because: Argo allows pod spec patches
apiVersion: ray.io/v1
kind: RayCluster
metadata:
name: ray-cluster
spec:
headGroupSpec:
template:
metadata:
labels:
pod-group.scheduling.kubenexus.io/name: "ray-cluster"
pod-group.scheduling.kubenexus.io/min-available: "11"
spec:
schedulerName: kubenexus-scheduler
workerGroupSpecs:
- replicas: 10
template:
metadata:
labels:
pod-group.scheduling.kubenexus.io/name: "ray-cluster"
pod-group.scheduling.kubenexus.io/min-available: "11"
spec:
schedulerName: kubenexus-schedulerWorks because: Ray Operator supports custom pod templates
| Approach | Volcano | YuniKorn | Kueue | KubeNexus |
|---|---|---|---|---|
| CRD Required | Yes (VolcanoJob) | No | Yes (Workload) | No |
| Operator Integration | Need webhooks | Labels only | Webhook | Labels only |
| Works Today | Need changes | ✅ Yes | Need changes | ✅ Yes |
| Complexity | High | Medium | Medium | Low |
| Flexibility | High | Medium | High | Medium |
| Learning Curve | Steep | Gentle | Steep | Gentle |
SparkApplication → Spark Operator → Driver Pod + Executor Pods
PyTorchJob → Training Operator → Master Pod + Worker Pods
Workflow → Argo → Task Pods
KubeNexus insight: We don't need to wrap these—just schedule the pods they create!
# All we need:
labels:
pod-group.scheduling.kubenexus.io/name: "my-job"
pod-group.scheduling.kubenexus.io/min-available: "8"
# This gives us:
✅ Gang scheduling
✅ Pod grouping
✅ Query capability (kubectl get pods -l pod-group.scheduling.kubenexus.io/name=my-job)
✅ Operator compatibilityWhat CRDs would give us:
- Validation (can use admission webhooks instead)
- Status tracking (operators already do this)
- Lifecycle management (operators already do this)
What CRDs would cost us:
- Installation complexity
- API version management
- Breaking changes risk
- Operator integration burden
v1.0: Labels only (current) v1.5: Optional webhook for auto-injection v2.0: Optional CRD for advanced features (backward compatible)
apiVersion: sparkoperator.k8s.io/v1beta2
kind: SparkApplication
metadata:
name: spark-pi
namespace: spark-jobs
spec:
type: Scala
mode: cluster
image: "gcr.io/spark-operator/spark:v3.5.0"
imagePullPolicy: Always
mainClass: org.apache.spark.examples.SparkPi
mainApplicationFile: "local:///opt/spark/examples/jars/spark-examples.jar"
# Spark configuration
sparkVersion: "3.5.0"
restartPolicy:
type: Never
# Driver configuration with KubeNexus labels
driver:
cores: 1
coreLimit: "1200m"
memory: "2g"
labels:
# Gang scheduling labels
pod-group.scheduling.kubenexus.io/name: "spark-pi"
pod-group.scheduling.kubenexus.io/min-available: "11" # 1 driver + 10 executors
version: "3.5.0"
app: spark-pi
annotations:
# NUMA scheduling (optional)
numa.scheduling.kubenexus.io/policy: "best-effort"
serviceAccount: spark
# Use KubeNexus scheduler
schedulerName: kubenexus-scheduler
# Executor configuration with KubeNexus labels
executor:
cores: 1
instances: 10
memory: "2g"
labels:
# MUST match driver gang name!
pod-group.scheduling.kubenexus.io/name: "spark-pi"
pod-group.scheduling.kubenexus.io/min-available: "11"
version: "3.5.0"
app: spark-pi
annotations:
numa.scheduling.kubenexus.io/policy: "best-effort"
# Use KubeNexus scheduler
schedulerName: kubenexus-scheduler1. User creates SparkApplication CRD
↓
2. Spark Operator watches and validates
↓
3. Operator creates:
- spark-pi-driver Pod (with gang labels)
- spark-pi-exec-1 Pod (with gang labels)
- spark-pi-exec-2 Pod (with gang labels)
- ... (10 total executor pods)
↓
4. KubeNexus sees 11 pods with pod-group.scheduling.kubenexus.io/name=spark-pi
↓
5. KubeNexus waits until ALL 11 pods can be scheduled
↓
6. KubeNexus schedules all 11 pods atomically
↓
7. Spark job runs successfully!
# Check SparkApplication
kubectl get sparkapplication -n spark-jobs
# Check pods (should all be scheduled together)
kubectl get pods -n spark-jobs -l pod-group.scheduling.kubenexus.io/name=spark-pi
# Check gang status
kubectl get pods -n spark-jobs -l pod-group.scheduling.kubenexus.io/name=spark-pi \
-o custom-columns=NAME:.metadata.name,STATUS:.status.phase,NODE:.spec.nodeName
# All should show "Running" on nodes at the same timeVolcano:
spec:
batchScheduler: "volcano"
batchSchedulerOptions:
queue: default
priorityClassName: highKubeNexus:
spec:
driver:
labels:
pod-group.scheduling.kubenexus.io/name: "spark-pi"
pod-group.scheduling.kubenexus.io/min-available: "11"
schedulerName: kubenexus-scheduler
executor:
labels:
pod-group.scheduling.kubenexus.io/name: "spark-pi"
pod-group.scheduling.kubenexus.io/min-available: "11"
schedulerName: kubenexus-schedulerYuniKorn:
spec:
driver:
labels:
applicationId: "spark-pi"
queue: "root.spark"
schedulerName: yunikornKubeNexus:
spec:
driver:
labels:
pod-group.scheduling.kubenexus.io/name: "spark-pi"
pod-group.scheduling.kubenexus.io/min-available: "11"
schedulerName: kubenexus-schedulerAuto-inject labels into operator-created pods:
# User creates simple SparkApplication
apiVersion: sparkoperator.k8s.io/v1beta2
kind: SparkApplication
metadata:
name: spark-pi
annotations:
kubenexus.io/gang-scheduling: "true" # Opt-in
spec:
driver:
schedulerName: kubenexus-scheduler
executor:
instances: 10
schedulerName: kubenexus-scheduler
# Webhook automatically adds:
# labels:
# pod-group.scheduling.kubenexus.io/name: "spark-pi"
# pod-group.scheduling.kubenexus.io/min-available: "11"For advanced features:
apiVersion: scheduling.kubenexus.io/v1alpha1
kind: PodGroup
metadata:
name: spark-pi
spec:
minMember: 11
scheduleTimeoutSeconds: 300
priorityClassName: high
queue: spark-queueBackward compatible: Labels still work!
- ✅ Operators create pods from their CRDs
- ✅ Pods inherit labels from CRD specs
- ✅ KubeNexus schedules pods (not CRDs)
- ✅ No operator changes needed
- ✅ Spark Operator
- ✅ Kubeflow Training Operator (PyTorchJob, TFJob, MPIJob, etc.)
- ✅ Argo Workflows
- ✅ Ray Operator
- ✅ Any operator that allows custom pod labels
"Schedulers schedule pods, not CRDs"
This makes KubeNexus:
- Simple to use
- Compatible with existing operators
- Easy to adopt
- No vendor lock-in
Last Updated: February 2026