A GpuClaim defines a declarative GPU allocation request.
- API Group:
gpu.scheduling/v1 - Kind:
GpuClaim - Scope: Namespaced
- Short Name:
gclaim
Describes GPU requirements.
| Field | Type | Description | Example |
|---|---|---|---|
count |
int | Number of GPUs needed | 2 |
policy |
string | Allocation strategy: contiguous, spread, or preferIds |
"contiguous" |
preferIds |
[]int | Specific GPU IDs to prefer (used with preferIds policy) |
[0, 1] |
exclusivity |
string | Sharing mode: Exclusive, Shared, or MIG |
"Exclusive" |
Policy Details:
contiguous: Allocate GPUs with adjacent IDs (0,1,2 not 0,2,4). Best for workloads with GPU-to-GPU communication.spread: Spread GPUs across different islands/buses. Best for independent parallel tasks.preferIds: Try to allocate specific GPU IDs. Falls back if not available.
Exclusivity Details:
Exclusive: GPU dedicated to one pod (recommended)Shared: Multiple pods can share GPU (no isolation guarantees)MIG: Multi-Instance GPU mode (not yet implemented)
Node selector to target specific nodes.
| Field | Type | Description | Example |
|---|---|---|---|
matchLabels |
map[string]string | Label selector for nodes | {"gpu-type": "a100"} |
NVLink bandwidth preferences.
| Field | Type | Description | Example |
|---|---|---|---|
mode |
string | Requirement level: Required, Preferred, or Ignore |
"Preferred" |
minBandwidthGBps |
int | Minimum interconnect bandwidth | 400 |
Mode Details:
Required: Pod won't schedule if topology requirements not metPreferred: Try to meet requirements, schedule anyway if not possibleIgnore: Don't consider topology
Reference to a gang/pod-group for multi-pod scheduling.
| Field | Type | Description | Example |
|---|---|---|---|
gangRef |
string | Name of pod group | "training-job-123" |
Status: Not implemented in MVP
Reflects scheduler progress.
| Field | Type | Description | Example |
|---|---|---|---|
phase |
string | Current state: Pending, Reserved, Bound, or Failed |
"Bound" |
nodeName |
string | Node where GPUs allocated | "node-a" |
gpuIds |
[]int | Allocated GPU IDs | [0, 1] |
allocated |
string | Combined node and GPU info | "node-a:0,1" |
message |
string | Human-readable status message | "Successfully allocated" |
apiVersion: gpu.scheduling/v1
kind: GpuClaim
metadata:
name: single-gpu
namespace: default
spec:
devices:
count: 1
exclusivity: ExclusiveapiVersion: gpu.scheduling/v1
kind: GpuClaim
metadata:
name: training-gpus
namespace: ml-workloads
spec:
devices:
count: 4
policy: contiguous
exclusivity: Exclusive
topology:
mode: Preferred
minBandwidthGBps: 400
selector:
matchLabels:
gpu-type: a100
nvlink: "true"apiVersion: gpu.scheduling/v1
kind: GpuClaim
metadata:
name: specific-gpus
spec:
devices:
count: 2
policy: preferIds
preferIds: [2, 3]
exclusivity: ExclusiveReports per-node GPU inventory and health. Created and updated by the agent DaemonSet.
- API Group:
gpu.scheduling/v1 - Kind:
GpuNodeStatus - Scope: Cluster
- Short Name:
gns
| Field | Type | Description | Example |
|---|---|---|---|
nodeName |
string | Kubernetes node name | "node-a" |
| Field | Type | Description |
|---|---|---|
devices |
[]Device | List of GPU devices on node |
total |
int | Total number of GPUs |
| Field | Type | Description | Example |
|---|---|---|---|
id |
int | GPU device ID | 0 |
inUseBy |
[]string | Pod UIDs using this GPU | ["abc-123", "def-456"] |
health |
string | Health status: Healthy, Unhealthy, or Unknown |
"Healthy" |
bandwidthGBps |
int | NVLink bandwidth to peers | 400 |
island |
string | NVLink island identifier | "nvlink-group-0" |
Island: GPUs in the same island have high-speed interconnect (NVLink). GPUs in different islands communicate through PCIe (slower).
apiVersion: gpu.scheduling/v1
kind: GpuNodeStatus
metadata:
name: node-a
spec:
nodeName: node-a
status:
total: 8
devices:
- id: 0
health: Healthy
bandwidthGBps: 400
island: nvlink-group-0
inUseBy: ["pod-abc-123"]
- id: 1
health: Healthy
bandwidthGBps: 400
island: nvlink-group-0
inUseBy: []
- id: 2
health: Healthy
bandwidthGBps: 400
island: nvlink-group-0
inUseBy: []
- id: 3
health: Healthy
bandwidthGBps: 400
island: nvlink-group-0
inUseBy: []
- id: 4
health: Healthy
bandwidthGBps: 200
island: nvlink-group-1
inUseBy: []
- id: 5
health: Healthy
bandwidthGBps: 200
island: nvlink-group-1
inUseBy: []
- id: 6
health: Healthy
bandwidthGBps: 200
island: nvlink-group-1
inUseBy: []
- id: 7
health: Unhealthy
bandwidthGBps: 0
island: nvlink-group-1
inUseBy: []In this example:
- GPUs 0-3 are in one NVLink island (400 GB/s interconnect)
- GPUs 4-7 are in another island (200 GB/s interconnect)
- GPU 7 is unhealthy and shouldn't be allocated
Set by: User Read by: Scheduler Purpose: Links a pod to a GpuClaim
Example:
metadata:
annotations:
gpu.scheduling/claim: my-gpu-requestSet by: Scheduler (PreBind phase) Read by: Webhook Purpose: Tells webhook which GPUs were allocated
Format: {nodeName}:{comma-separated-gpu-ids}
Examples:
node-a:0(single GPU)node-b:0,1,2,3(multiple GPUs)
The scheduler uses Kubernetes Coordination Leases for atomic GPU locking.
Format: gpu-{nodeName}-{gpuId}
Examples:
gpu-node-a-0gpu-node-b-3
| Field | Type | Description |
|---|---|---|
holderIdentity |
string | Pod UID that owns the GPU |
- Creation: Scheduler creates lease in Reserve phase
- Ownership: Pod UID stored in
holderIdentity - Deletion: Scheduler deletes lease in Unreserve phase (on failure) or manually
Note: Leases currently don't auto-delete when pods are removed. This is a known limitation.
apiVersion: coordination.k8s.io/v1
kind: Lease
metadata:
name: gpu-node-a-0
namespace: default
spec:
holderIdentity: "abc-123-def-456" # Pod UIDThe scheduler is configured via KubeSchedulerConfiguration.
The GpuClaimPlugin runs in these phases:
| Phase | Purpose |
|---|---|
| PreFilter | Read claim annotation, validate request |
| Filter | Check node selector (currently no-op) |
| Score | Rank nodes by GPU availability and topology |
| Reserve | Atomically acquire GPU leases |
| Unreserve | Release leases on failure |
| PreBind | Annotate pod with allocation |
apiVersion: kubescheduler.config.k8s.io/v1
kind: KubeSchedulerConfiguration
profiles:
- schedulerName: gpu-scheduler
plugins:
preFilter:
enabled:
- name: GpuClaimPlugin
filter:
enabled:
- name: GpuClaimPlugin
score:
enabled:
- name: GpuClaimPlugin
reserve:
enabled:
- name: GpuClaimPlugin
preBind:
enabled:
- name: GpuClaimPluginThe webhook mutates pods that have the gpu.scheduling/allocated annotation.
Endpoint: /mutate
Port: 8443 (HTTPS)
Failure Policy: Fail (pod won't be created if webhook fails)
The webhook adds CUDA_VISIBLE_DEVICES environment variable to all containers in the pod.
Example:
containers:
- name: training
env:
- name: CUDA_VISIBLE_DEVICES
value: "0,1,2"This tells CUDA runtime which GPUs the container can see.
# List claims
kubectl get gpuclaim
kubectl get gclaim # short form
# Describe claim
kubectl describe gpuclaim my-claim
# Get claim status
kubectl get gpuclaim my-claim -o jsonpath='{.status}'
# List node GPU status
kubectl get gpunodestatus
kubectl get gns # short form
# Get detailed node GPU info
kubectl get gns node-a -o yaml
# List GPU leases
kubectl get leases | grep gpu-
# Delete specific lease
kubectl delete lease gpu-node-a-0
# Watch claims
kubectl get gclaim -w# Install
helm install gpu-scheduler charts/gpu-scheduler
# Install with custom values
helm install gpu-scheduler charts/gpu-scheduler \
--set scheduler.image.tag=v0.2.0
# Upgrade
helm upgrade gpu-scheduler charts/gpu-scheduler
# Uninstall
helm uninstall gpu-scheduler
# View values
helm get values gpu-scheduler