Quick start guide for local development using Kind (Kubernetes in Docker) with emulated GPU resources.
Note: This guide covers Kind-specific deployment for local testing. For a complete overview of deployment methods, Helm chart configuration, and the full configuration reference, see the main deployment guide.
- Prerequisites
- Quick Start
- Configuration Options
- Scripts
- Cluster Configuration
- Testing Locally
- Troubleshooting
- Development Workflow
- Clean Up
- Next Steps
- Docker
- Kind
- kubectl
- Helm
Deploy WVA with full llm-d infrastructure:
# From project root
make deploy-wva-emulated-on-kindThis creates:
- Kind cluster with 3 nodes, emulated GPUs (mixed vendors)
- WVA controller
- llm-d infrastructure (simulation mode)
- Prometheus monitoring
- vLLM emulator
For a complete list of environment variables and configuration options, see the Configuration Reference in the main deployment guide.
Key environment variables for Kind emulator:
export HF_TOKEN="hf_xxxxx" # Required: HuggingFace token
export MODEL_ID="unsloth/Meta-Llama-3.1-8B" # Model to deploy
export ACCELERATOR_TYPE="H100" # Emulated GPU type
export GATEWAY_PROVIDER="kgateway" # Gateway for Kind (kgateway recommended)
export ITL_AVERAGE_LATENCY_MS=20 # Average inter-token latency for the llm-d-inference-sim
export TTFT_AVERAGE_LATENCY_MS=200 # Average time-to-first-token for the llm-d-inference-sim
# Performance tuning (optional)
export VLLM_MAX_NUM_SEQS=64 # vLLM max concurrent sequences (batch size)
export HPA_STABILIZATION_SECONDS=240 # HPA stabilization window
# Image load (optional; auto-detected if unset)
export KIND_IMAGE_PLATFORM=linux/amd64 # Single platform for kind load (avoids "digest not found")Deployment flags:
export DEPLOY_PROMETHEUS=true # Deploy Prometheus stack
export DEPLOY_WVA=true # Deploy WVA controller
export DEPLOY_LLM_D=true # Deploy llm-d infrastructure (emulated)
export DEPLOY_PROMETHEUS_ADAPTER=true # Deploy Prometheus Adapter
export DEPLOY_HPA=true # Deploy HPA1. Create Kind cluster:
make create-kind-cluster
# With custom configuration
make create-kind-cluster KIND_ARGS="-t mix -n 4 -g 2"
# -t: vendor type (nvidia, amd, intel, mix)
# -n: number of nodes
# -g: GPUs per node2. Deploy WVA only:
export DEPLOY_WVA=true
export DEPLOY_LLM_D=false
export DEPLOY_PROMETHEUS=true # Prometheus is needed for WVA to scrape metrics
export VLLM_SVC_ENABLED=true
export DEPLOY_PROMETHEUS_ADAPTER=false
export DEPLOY_HPA=false
make deploy-wva-emulated-on-kind3. Deploy with llm-d (by default):
make deploy-wva-emulated-on-kind4. Testing configuration with fast saturation:
export VLLM_MAX_NUM_SEQS=8 # Low batch size for easy saturation
export HPA_STABILIZATION_SECONDS=30 # Fast scaling for testing
make deploy-wva-emulated-on-kindCreates Kind cluster with emulated GPU support.
./setup.sh -t mix -n 3 -g 2Options:
-t: Vendor type (nvidia|amd|intel|mix) - default: mix-n: Number of nodes - default: 3-g: GPUs per node - default: 2
Destroys the Kind cluster.
./teardown.shDefault cluster created by setup.sh:
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
nodes:
- role: control-plane
extraMounts:
- hostPath: /dev/null
containerPath: /dev/nvidia0
- role: worker
- role: workerGPUs are emulated using extended resources:
nvidia.com/gpuamd.com/gpuintel.com/gpu
Port-forward WVA metrics:
kubectl port-forward -n workload-variant-autoscaler-system \
svc/workload-variant-autoscaler-controller-manager-metrics 8080:8080Port-forward Prometheus:
kubectl port-forward -n workload-variant-autoscaler-monitoring \
svc/prometheus-operated 9090:9090Port-forward Inference Gateway:
kubectl port-forward -n llm-d-sim svc/infra-sim-inference-gateway 8000:80# Apply sample VariantAutoscaling
kubectl apply -f ../../config/samples/Option A — Run E2E tests (recommended)
The e2e suite deploys infra, creates resources, generates load, and validates scaling. No manual load tool needed.
# From repo root, after deploying (e.g. make deploy-wva-emulated-on-kind)
make deploy-e2e-infra # if not already done
make test-e2e-smoke # quick validation
# or
make test-e2e-full # full suite including saturation scalingSee Testing Guide and E2E Test Suite README.
Option B — Manual load with burst script
Use the script in the e2e fixtures (requires only curl; no Python). After port-forwarding the inference gateway or vLLM service to localhost:8000:
# From repo root
export TARGET_URL="http://localhost:8000/v1/chat/completions"
export MODEL_ID="unsloth/Meta-Llama-3.1-8B"
export TOTAL_REQUESTS=100
export BATCH_SIZE=10
./hack/burst_load_generator.shTune load with TOTAL_REQUESTS, BATCH_SIZE, and optional BATCH_SLEEP, MAX_TOKENS, CURL_TIMEOUT (see script header).
# Watch deployments scale
watch kubectl get deploy -n llm-d-sim
# Watch VariantAutoscaling status
watch kubectl get variantautoscalings.llmd.ai -A
# View controller logs
kubectl logs -n workload-variant-autoscaler-system \
-l control-plane=controller-manager -f# Clean up and retry
kind delete cluster --name kind-wva-gpu-cluster
make create-kind-cluster# Check controller logs
kubectl logs -n workload-variant-autoscaler-system \
deployment/workload-variant-autoscaler-controller-manager
# Verify CRDs installed
kubectl get crd variantautoscalings.llmd.ai
# Check RBAC
kubectl get clusterrole,clusterrolebinding -l app=workload-variant-autoscaler# Verify GPU labels on nodes
kubectl get nodes -o json | jq '.items[].status.capacity'
# Should see nvidia.com/gpu, amd.com/gpu, or intel.com/gpu# Kill existing port-forwards
pkill -f "kubectl port-forward"
# Verify pod is running before port-forwarding
kubectl get pods -n <namespace>This can happen when loading a multi-platform image into Kind: the image manifest references blobs for multiple platforms (e.g. linux/arm64, linux/amd64), but the stream that kind load feeds into containerd does not include all of them, so ctr reports a missing digest. See kubernetes-sigs/kind#3795 and kubernetes-sigs/kind#3845. The install script works around it by pulling a single-platform image before loading. If you still see the error or need a specific architecture, set the platform explicitly:
# Force linux/amd64 (e.g. for Intel or emulated nodes)
KIND_IMAGE_PLATFORM=linux/amd64 make deploy-wva-emulated-on-kind CREATE_CLUSTER=true DEPLOY_LLM_D=true
# Force linux/arm64 (e.g. for Apple Silicon with native arm64 nodes)
KIND_IMAGE_PLATFORM=linux/arm64 make deploy-wva-emulated-on-kind CREATE_CLUSTER=true DEPLOY_LLM_D=trueAlternatively, build the image locally and deploy with IfNotPresent so the script skips the registry pull and loads your local single-platform image:
make docker-build IMG=ghcr.io/llm-d/llm-d-workload-variant-autoscaler:latest
WVA_IMAGE_PULL_POLICY=IfNotPresent make deploy-wva-emulated-on-kind CREATE_CLUSTER=true DEPLOY_LLM_D=true-
Make code changes
-
Build new image:
make docker-build IMG=localhost:5000/wva:dev
-
Load image to Kind:
kind load docker-image localhost:5000/wva:dev --name kind-inferno-gpu-cluster
-
Update deployment:
kubectl set image deployment/workload-variant-autoscaler-controller-manager \ -n workload-variant-autoscaler-system \ manager=localhost:5000/wva:dev -
Verify changes:
kubectl logs -n workload-variant-autoscaler-system \ deployment/workload-variant-autoscaler-controller-manager -f
Remove deployments:
make undeploy-wva-emulated-on-kindRemove deployments and delete the Kind cluster:
make undeploy-wva-emulated-on-kind-delete-clusterDestroy cluster:
make destroy-kind-cluster