- Go 1.24.x
- Docker
- Kubernetes cluster (kind recommended for local development)
- kubectl
- Helm 3
gpu-scheduler/
├── api/v1/ # CRD type definitions
│ ├── gpuclaim_types.go
│ └── gpunodestatus_types.go
├── cmd/ # Entry points
│ ├── scheduler/main.go # Scheduler binary
│ ├── webhook/main.go # Webhook binary
│ └── agent/main.go # Agent binary
├── internal/
│ ├── plugin/gpuclaim/ # Scheduler plugin implementation
│ ├── lease/ # GPU lease management
│ ├── topo/ # Topology scoring logic
│ └── util/ # Shared utilities
├── charts/gpu-scheduler/ # Helm chart
└── hack/ # Development scripts
# Scheduler
make docker
# Webhook
make docker-webhook
# Agent
make docker-agentThe Dockerfile uses CMD_PATH build argument:
# Custom image tags
docker build --build-arg CMD_PATH=cmd/scheduler \
-t my-registry/gpu-scheduler:dev .
docker build --build-arg CMD_PATH=cmd/webhook \
-t my-registry/gpu-webhook:dev .
docker build --build-arg CMD_PATH=cmd/agent \
-t my-registry/gpu-agent:dev .# Scheduler
go build -o bin/scheduler ./cmd/scheduler
# Webhook
go build -o bin/webhook ./cmd/webhook
# Agent
go build -o bin/agent ./cmd/agent# Create cluster with GPU support (requires nvidia-docker)
kind create cluster --config hack/kind-cluster.yaml
# Or basic cluster for testing
kind create cluster --name gpu-test# Build and load images into kind
make docker
kind load docker-image ghcr.io/restack/gpu-scheduler:dev --name gpu-test
make docker-webhook
kind load docker-image ghcr.io/restack/gpu-scheduler-webhook:dev --name gpu-test
make docker-agent
kind load docker-image ghcr.io/restack/gpu-scheduler-agent:dev --name gpu-test
# Deploy
helm install gpu-scheduler charts/gpu-scheduler# 1. Make code changes
vim internal/plugin/gpuclaim/plugin.go
# 2. Rebuild
make docker
# 3. Reload into kind
kind load docker-image ghcr.io/restack/gpu-scheduler:dev --name gpu-test
# 4. Restart pod
kubectl rollout restart deployment gpu-scheduler
# 5. Check logs
kubectl logs -f deployment/gpu-schedulergo test ./...go test -coverprofile=coverage.out ./...
go tool cover -html=coverage.outgo test ./internal/lease -v
go test ./internal/plugin/gpuclaim -vCreate test workloads:
# Apply test claim
kubectl apply -f - <<EOF
apiVersion: gpu.scheduling/v1
kind: GpuClaim
metadata:
name: test-claim
spec:
devices:
count: 1
exclusivity: Exclusive
EOF
# Create test pod
kubectl apply -f - <<EOF
apiVersion: v1
kind: Pod
metadata:
name: test-pod
annotations:
gpu.scheduling/claim: test-claim
spec:
schedulerName: gpu-scheduler
restartPolicy: Never
containers:
- name: test
image: busybox
command: ["sh", "-c", "echo CUDA_VISIBLE_DEVICES=\$CUDA_VISIBLE_DEVICES; sleep 30"]
EOF
# Check results
kubectl logs test-pod
kubectl get pod test-pod -o jsonpath='{.metadata.annotations}'kubectl logs -f deployment/gpu-scheduler
# With verbosity
kubectl logs -f deployment/gpu-scheduler | grep "v=4"kubectl logs -f deployment/gpu-scheduler-webhook# All agents
kubectl logs -f daemonset/gpu-scheduler-agent
# Specific node
kubectl logs -f daemonset/gpu-scheduler-agent -n default --selector=name=gpu-scheduler-agent --field-selector=spec.nodeName=node-aEdit the deployment to increase log verbosity:
kubectl edit deployment gpu-schedulerAdd to container args:
args:
- --v=4 # Kubernetes logging verbosity# List all claims with status
kubectl get gpuclaim -o yaml
# Specific claim
kubectl get gpuclaim my-claim -o jsonpath='{.status}'
# Watch claim status changes
kubectl get gpuclaim -wTest webhook locally:
# Port forward
kubectl port-forward svc/gpu-scheduler-webhook 8443:443
# Send test admission request (requires valid cert)
curl -k https://localhost:8443/mutate -d @test-admission-request.json# All GPU leases
kubectl get leases | grep gpu-
# Lease details
kubectl get lease gpu-node-a-0 -o yaml
# Watch lease creation/deletion
kubectl get leases -w | grep gpu--
Update the GpuClaim type in
api/v1/gpuclaim_types.go:type DeviceRequest struct { Policy string `json:"policy,omitempty"` // Add "newpolicy" to comment }
-
Implement logic in
internal/topo/topology.go:func ScoreNewPolicy(devs []DeviceInfo, count int) (score int, pick []int) { // Your scoring logic }
-
Update scheduler plugin in
internal/plugin/gpuclaim/plugin.go:func (p *Plugin) Score(ctx context.Context, ...) (int64, *framework.Status) { // Call your new scoring function based on policy }
-
Test:
go test ./internal/topo make docker kind load docker-image ghcr.io/restack/gpu-scheduler:dev kubectl rollout restart deployment gpu-scheduler
The agent currently uses placeholder GPU data. To integrate NVML:
-
Add NVML dependency to
go.mod:go get github.com/NVIDIA/go-nvml/pkg/nvml
-
Update
cmd/agent/main.go:import "github.com/NVIDIA/go-nvml/pkg/nvml" func discoverDevices() []apiv1.Device { nvml.Init() defer nvml.Shutdown() count, _ := nvml.DeviceGetCount() devices := make([]apiv1.Device, count) for i := 0; i < count; i++ { device, _ := nvml.DeviceGetHandleByIndex(i) // Populate device info } return devices }
-
Update Dockerfile to include NVML library:
FROM nvidia/cuda:12.4.1-base-ubuntu22.04 AS build # Install NVML headers
-
Update types in
api/v1/:type GpuClaimSpec struct { NewField string `json:"newField,omitempty"` }
-
Regenerate CRD manifests (requires controller-gen):
controller-gen crd paths=./api/v1 output:crd:dir=./charts/gpu-scheduler/templates
-
Apply updated CRDs:
kubectl apply -f charts/gpu-scheduler/templates/crds.yaml
- Use
gofmtfor formatting (automatically applied by editors) - Follow Effective Go
- Keep functions small and focused
- Add godoc comments to exported functions
# Install golangci-lint
go install github.com/golangci/golangci-lint/cmd/golangci-lint@latest
# Run linter
golangci-lint runCheck if scheduler is running with correct schedulerName:
kubectl get pods -o jsonpath='{.items[*].spec.schedulerName}'Verify webhook is registered:
kubectl get mutatingwebhookconfiguration
kubectl get mutatingwebhookconfiguration gpu-scheduler-webhook -o yamlCheck webhook service and endpoints:
kubectl get svc gpu-scheduler-webhook
kubectl get endpoints gpu-scheduler-webhookLeases don't auto-delete when pods are removed. Options:
- Add finalizers to pods
- Implement garbage collection controller
- Use lease duration/renewals
- Manual cleanup in development
Verify plugin registration in logs:
kubectl logs deployment/gpu-scheduler | grep GpuClaimPluginShould see: "Registered plugin" plugin="GpuClaimPlugin"
- Update version in
charts/gpu-scheduler/Chart.yaml - Build and tag images:
make docker SCHED_IMG=ghcr.io/restack/gpu-scheduler:v0.1.0 make docker-webhook WEBHOOK_IMG=ghcr.io/restack/gpu-scheduler-webhook:v0.1.0 make docker-agent AGENT_IMG=ghcr.io/restack/gpu-scheduler-agent:v0.1.0
- Push images
- Package Helm chart:
helm package charts/gpu-scheduler
- Create GitHub release with chart tarball
- Check logs first: scheduler, webhook, and agent
- Search issues on GitHub
- Enable debug logging (
--v=4) - Use
kubectl describeon pods and claims