The GPU scheduler is a Kubernetes extension that provides smart GPU allocation for workloads. It has three main components that work together:
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Scheduler │ │ Webhook │ │ Agent │
│ (Plugin) │ │ (Mutator) │ │ (DaemonSet) │
└─────────────┘ └─────────────┘ └─────────────┘
│ │ │
│ │ │
▼ ▼ ▼
┌──────────────────────────────────────────────────────┐
│ Kubernetes API Server │
│ - GpuClaim CRDs │
│ - GpuNodeStatus CRDs │
│ - Coordination Leases (for GPU locking) │
└──────────────────────────────────────────────────────┘
- User creates a GpuClaim defining their GPU needs (e.g., "I need 2 GPUs")
- User creates a Pod with an annotation pointing to that GpuClaim
apiVersion: gpu.scheduling/v1
kind: GpuClaim
metadata:
name: my-gpu-request
spec:
devices:
count: 2
policy: contiguous
---
apiVersion: v1
kind: Pod
metadata:
name: my-workload
annotations:
gpu.scheduling/claim: my-gpu-request # Links to the claim above
spec:
schedulerName: gpu-scheduler # Use our custom scheduler
containers:
- name: training
image: nvidia/cuda:12.4.1-runtime-ubuntu22.04The scheduler plugin runs through several phases:
- Reads the
gpu.scheduling/claimannotation - Validates the claim exists
- Stores request details (how many GPUs needed)
- Checks which nodes match the requirements
- Currently allows all nodes (MVP)
- Ranks nodes based on GPU availability
- Prefers nodes with contiguous GPUs in the same NVLink island
- Currently returns static score (topology scoring TODO)
- Atomically acquires GPU leases on the chosen node
- For each GPU ID (0-15), tries to create a Kubernetes Lease object
- Lease name format:
gpu-{nodeName}-{gpuID} - If the lease already exists, that GPU is busy → try next ID
- If not enough GPUs available, rolls back all acquired leases
This is how we prevent double-booking GPUs!
- Adds annotation to pod:
gpu.scheduling/allocated: node-a:0,1 - This tells the webhook which GPUs were assigned
When the pod is about to be created:
- Webhook sees the
gpu.scheduling/allocatedannotation - Parses it:
node-a:0,1means GPUs 0 and 1 on node-a - Injects
CUDA_VISIBLE_DEVICES=0,1into all containers - NVIDIA runtime uses this to restrict the container to only those GPUs
The agent runs as a DaemonSet on each node:
- Discovers available GPUs (currently placeholder, NVML integration TODO)
- Creates/updates a
GpuNodeStatusresource every 30 seconds - Reports GPU health, NVLink topology, and which pods are using which GPUs
We use Kubernetes Coordination Leases for atomic GPU allocation:
- Atomic: Creating a lease either succeeds (GPU is ours) or fails (GPU already taken)
- Simple: No need for custom locking mechanisms
- Kubernetes-native: Uses built-in resources
- Automatic cleanup: Leases can have expiration times
Annotations connect the scheduler and webhook:
gpu.scheduling/claim: User → Scheduler (which claim to use)gpu.scheduling/allocated: Scheduler → Webhook (which GPUs were assigned)
This decouples the two components while keeping them synchronized.
- Scheduler Plugin: Needs deep integration with Kubernetes scheduling framework
- Webhook: Separate service for admission control (can scale independently)
- Agent: Runs on each node to discover local GPU hardware
User creates Pod with claim annotation
↓
Scheduler reads claim, finds available GPUs
↓
Scheduler creates Leases (locks GPUs)
↓
Scheduler adds "allocated" annotation to Pod
↓
Webhook sees "allocated" annotation
↓
Webhook injects CUDA_VISIBLE_DEVICES env var
↓
Pod runs with correct GPUs visible
- Scheduler's
Unreservephase runs - All acquired leases are deleted
- GPUs become available for other pods
- Leases remain (they're not automatically tied to pod lifecycle)
- Need garbage collection (TODO) or lease expiration
- Agent stops reporting
- Leases remain until explicitly cleaned up
- This is a known limitation of the MVP
The system tracks GPU topology through GpuNodeStatus:
status:
devices:
- id: 0
island: "nvlink-group-0" # GPUs in same island have fast interconnect
bandwidthGBps: 600
- id: 1
island: "nvlink-group-0"
bandwidthGBps: 600
- id: 2
island: "nvlink-group-1" # Different island = slower communication
bandwidthGBps: 64Contiguous policy: Prefers GPUs 0,1 over 0,2 (same island, better interconnect)
The GpuClaim has a gangRef field for multi-pod workloads:
spec:
devices:
count: 4
gangRef: "my-distributed-training-job"All pods in the gang must be schedulable together, or none run. This prevents deadlocks in distributed training.
Status: Not implemented in MVP