Development Guide

Prerequisites

Go 1.24.x
Docker
Kubernetes cluster (kind recommended for local development)
kubectl
Helm 3

Project Structure

gpu-scheduler/
├── api/v1/                    # CRD type definitions
│   ├── gpuclaim_types.go
│   └── gpunodestatus_types.go
├── cmd/                       # Entry points
│   ├── scheduler/main.go      # Scheduler binary
│   ├── webhook/main.go        # Webhook binary
│   └── agent/main.go          # Agent binary
├── internal/
│   ├── plugin/gpuclaim/       # Scheduler plugin implementation
│   ├── lease/                 # GPU lease management
│   ├── topo/                  # Topology scoring logic
│   └── util/                  # Shared utilities
├── charts/gpu-scheduler/      # Helm chart
└── hack/                      # Development scripts

Building

Build all Docker images

# Scheduler
make docker

# Webhook
make docker-webhook

# Agent
make docker-agent

Build specific component

The Dockerfile uses CMD_PATH build argument:

# Custom image tags
docker build --build-arg CMD_PATH=cmd/scheduler \
  -t my-registry/gpu-scheduler:dev .

docker build --build-arg CMD_PATH=cmd/webhook \
  -t my-registry/gpu-webhook:dev .

docker build --build-arg CMD_PATH=cmd/agent \
  -t my-registry/gpu-agent:dev .

Build locally (without Docker)

# Scheduler
go build -o bin/scheduler ./cmd/scheduler

# Webhook
go build -o bin/webhook ./cmd/webhook

# Agent
go build -o bin/agent ./cmd/agent

Local Development

Setup kind cluster

# Create cluster with GPU support (requires nvidia-docker)
kind create cluster --config hack/kind-cluster.yaml

# Or basic cluster for testing
kind create cluster --name gpu-test

Deploy to kind

# Build and load images into kind
make docker
kind load docker-image ghcr.io/restack/gpu-scheduler:dev --name gpu-test

make docker-webhook
kind load docker-image ghcr.io/restack/gpu-scheduler-webhook:dev --name gpu-test

make docker-agent
kind load docker-image ghcr.io/restack/gpu-scheduler-agent:dev --name gpu-test

# Deploy
helm install gpu-scheduler charts/gpu-scheduler

Quick iteration loop

# 1. Make code changes
vim internal/plugin/gpuclaim/plugin.go

# 2. Rebuild
make docker

# 3. Reload into kind
kind load docker-image ghcr.io/restack/gpu-scheduler:dev --name gpu-test

# 4. Restart pod
kubectl rollout restart deployment gpu-scheduler

# 5. Check logs
kubectl logs -f deployment/gpu-scheduler

Testing

Run unit tests

go test ./...

Run tests with coverage

go test -coverprofile=coverage.out ./...
go tool cover -html=coverage.out

Test specific package

go test ./internal/lease -v
go test ./internal/plugin/gpuclaim -v

Integration testing

Create test workloads:

# Apply test claim
kubectl apply -f - <<EOF
apiVersion: gpu.scheduling/v1
kind: GpuClaim
metadata:
  name: test-claim
spec:
  devices:
    count: 1
    exclusivity: Exclusive
EOF

# Create test pod
kubectl apply -f - <<EOF
apiVersion: v1
kind: Pod
metadata:
  name: test-pod
  annotations:
    gpu.scheduling/claim: test-claim
spec:
  schedulerName: gpu-scheduler
  restartPolicy: Never
  containers:
    - name: test
      image: busybox
      command: ["sh", "-c", "echo CUDA_VISIBLE_DEVICES=\$CUDA_VISIBLE_DEVICES; sleep 30"]
EOF

# Check results
kubectl logs test-pod
kubectl get pod test-pod -o jsonpath='{.metadata.annotations}'

Debugging

View scheduler logs

kubectl logs -f deployment/gpu-scheduler

# With verbosity
kubectl logs -f deployment/gpu-scheduler | grep "v=4"

View webhook logs

kubectl logs -f deployment/gpu-scheduler-webhook

View agent logs

# All agents
kubectl logs -f daemonset/gpu-scheduler-agent

# Specific node
kubectl logs -f daemonset/gpu-scheduler-agent -n default --selector=name=gpu-scheduler-agent --field-selector=spec.nodeName=node-a

Enable debug mode

Edit the deployment to increase log verbosity:

kubectl edit deployment gpu-scheduler

Add to container args:

args:
  - --v=4  # Kubernetes logging verbosity

Inspect CRD status

# List all claims with status
kubectl get gpuclaim -o yaml

# Specific claim
kubectl get gpuclaim my-claim -o jsonpath='{.status}'

# Watch claim status changes
kubectl get gpuclaim -w

Debug webhook

Test webhook locally:

# Port forward
kubectl port-forward svc/gpu-scheduler-webhook 8443:443

# Send test admission request (requires valid cert)
curl -k https://localhost:8443/mutate -d @test-admission-request.json

Check lease state

# All GPU leases
kubectl get leases | grep gpu-

# Lease details
kubectl get lease gpu-node-a-0 -o yaml

# Watch lease creation/deletion
kubectl get leases -w | grep gpu-

Adding Features

Adding a new allocation policy

Update the GpuClaim type in api/v1/gpuclaim_types.go:

type DeviceRequest struct {
    Policy string `json:"policy,omitempty"` // Add "newpolicy" to comment
}

Implement logic in internal/topo/topology.go:

func ScoreNewPolicy(devs []DeviceInfo, count int) (score int, pick []int) {
    // Your scoring logic
}

Update scheduler plugin in internal/plugin/gpuclaim/plugin.go:

func (p *Plugin) Score(ctx context.Context, ...) (int64, *framework.Status) {
    // Call your new scoring function based on policy
}

Test:

go test ./internal/topo
make docker
kind load docker-image ghcr.io/restack/gpu-scheduler:dev
kubectl rollout restart deployment gpu-scheduler

Implementing NVML integration

The agent currently uses placeholder GPU data. To integrate NVML:

Add NVML dependency to go.mod:

go get github.com/NVIDIA/go-nvml/pkg/nvml

Update cmd/agent/main.go:

import "github.com/NVIDIA/go-nvml/pkg/nvml"

func discoverDevices() []apiv1.Device {
    nvml.Init()
    defer nvml.Shutdown()

    count, _ := nvml.DeviceGetCount()
    devices := make([]apiv1.Device, count)

    for i := 0; i < count; i++ {
        device, _ := nvml.DeviceGetHandleByIndex(i)
        // Populate device info
    }

    return devices
}

Update Dockerfile to include NVML library:

FROM nvidia/cuda:12.4.1-base-ubuntu22.04 AS build
# Install NVML headers

Adding CRD fields

Update types in api/v1/:

type GpuClaimSpec struct {
    NewField string `json:"newField,omitempty"`
}

Regenerate CRD manifests (requires controller-gen):

controller-gen crd paths=./api/v1 output:crd:dir=./charts/gpu-scheduler/templates

Apply updated CRDs:

kubectl apply -f charts/gpu-scheduler/templates/crds.yaml

Code Style

Go conventions

Use gofmt for formatting (automatically applied by editors)
Follow Effective Go
Keep functions small and focused
Add godoc comments to exported functions

Linting

# Install golangci-lint
go install github.com/golangci/golangci-lint/cmd/golangci-lint@latest

# Run linter
golangci-lint run

Common Pitfalls

Pod stuck in Pending with no errors

Check if scheduler is running with correct schedulerName:

kubectl get pods -o jsonpath='{.items[*].spec.schedulerName}'

Webhook not mutating pods

Verify webhook is registered:

kubectl get mutatingwebhookconfiguration
kubectl get mutatingwebhookconfiguration gpu-scheduler-webhook -o yaml

Check webhook service and endpoints:

kubectl get svc gpu-scheduler-webhook
kubectl get endpoints gpu-scheduler-webhook

Leases not cleaned up

Leases don't auto-delete when pods are removed. Options:

Add finalizers to pods
Implement garbage collection controller
Use lease duration/renewals
Manual cleanup in development

Scheduler plugin not loaded

Verify plugin registration in logs:

kubectl logs deployment/gpu-scheduler | grep GpuClaimPlugin

Should see: "Registered plugin" plugin="GpuClaimPlugin"

Release Process

Update version in charts/gpu-scheduler/Chart.yaml

Build and tag images:

make docker SCHED_IMG=ghcr.io/restack/gpu-scheduler:v0.1.0
make docker-webhook WEBHOOK_IMG=ghcr.io/restack/gpu-scheduler-webhook:v0.1.0
make docker-agent AGENT_IMG=ghcr.io/restack/gpu-scheduler-agent:v0.1.0

Push images
Package Helm chart:
```
helm package charts/gpu-scheduler
```
Create GitHub release with chart tarball

Getting Help

Check logs first: scheduler, webhook, and agent
Search issues on GitHub
Enable debug logging (--v=4)
Use kubectl describe on pods and claims

FilesExpand file tree

development.md

Latest commit

History