Complete guide for deploying the Workload-Variant-Autoscaler (WVA) on Kubernetes, OpenShift, and Kind clusters.
Central Documentation Hub: This is the main deployment guide containing comprehensive information about deployment methods, Helm chart configuration, and complete configuration reference. Platform-specific guides (Kubernetes, OpenShift, Kind) provide additional platform-specific details and examples.
- Overview
- Prerequisites
- Deployment Methods
- Platform-Specific Guides
- Configuration Reference
- Post-Deployment
- Troubleshooting
This guide covers two deployment procedures:
- Automated Script: Complete end-to-end and customizable deployment including WVA, llm-d infrastructure, Prometheus, and HPA
- Helm Chart: Deploy the WVA controller into an existing cluster
All deployment methods require:
- kubectl (v1.24+) - Kubernetes CLI
- helm (v3.8+) - Package manager for Kubernetes
- git - Git CLI
Optional but recommended:
- yq (v4+) - YAML processor for configuration
Platform-specific requirements:
- OpenShift:
ocCLI (v4.12+) - Kind:
kindCLI for local testing
Minimum cluster specifications:
- Kubernetes 1.24+ or OpenShift 4.12+
- Metrics server or Prometheus available
- For non-emulated LLM workloads: GPU availability
Cluster access:
- Cluster admin privileges (for full deployment script)
- Or namespace admin + ability to create ClusterRole/ClusterRoleBinding (for Helm-only)
- HuggingFace Token (for llm-d deployment): after getting access to a model, set a token on HuggingFace
- Required for: Full deployment script with llm-d
- Not required for: Helm chart only deployment
The deployment script provides a complete, automated setup including:
- WVA controller with RBAC configuration
- Prometheus stack (or connects to existing)
- llm-d infrastructure (Gateway, Scheduler, vLLM)
- Prometheus Adapter for external metrics
- ServiceMonitors for metric collection
- VariantAutoscaling custom resources
- HPA configuration
- Automatic GPU detection
- Environment-specific optimizations
# Set required environment variable
export HF_TOKEN="hf_xxxxxxxxxxxxx"
# Deploy to Kubernetes
make deploy-wva-on-k8s
# Deploy to OpenShift
make deploy-wva-on-openshift
# Deploy to Kind (with emulated GPUs)
make deploy-wva-emulated-on-kind# Navigate to deploy directory
cd deploy
# Set environment variables
export HF_TOKEN="hf_xxxxxxxxxxxxx"
export ENVIRONMENT="kubernetes" # or "openshift", "kind-emulator"
# Run deployment script
bash install.shThe script accepts both command-line flags and environment variables:
Command-line flags:
bash install.sh [OPTIONS]
Options:
-i, --wva-image IMAGE WVA container image (default: ghcr.io/llm-d/llm-d-workload-variant-autoscaler:latest)
-m, --model MODEL Model ID (default: unsloth/Meta-Llama-3.1-8B)
-a, --accelerator TYPE GPU type: A100, H100, L40S (auto-detected by default)
-u, --undeploy Undeploy all components
-e, --environment ENV Environment: kubernetes, openshift, kind-emulator
-h, --help Show helpEnvironment variables (see Configuration Reference for complete list):
# Core Configuration
export HF_TOKEN="hf_xxx" # HuggingFace token (required for llm-d)
export MODEL_ID="unsloth/Meta-Llama-3.1-8B" # Model to deploy
export ACCELERATOR_TYPE="A100" # GPU type (auto-detected)
export WVA_IMAGE_TAG="latest" # WVA image version
# Deployment Flags
export DEPLOY_PROMETHEUS=true # Deploy Prometheus stack
export DEPLOY_WVA=true # Deploy WVA controller
export DEPLOY_LLM_D=true # Deploy llm-d infrastructure
export DEPLOY_PROMETHEUS_ADAPTER=true # Deploy Prometheus Adapter
export DEPLOY_VA=true # Deploy VariantAutoscaling CR
export DEPLOY_HPA=true # Deploy HPA
# HPA Configuration
export HPA_STABILIZATION_SECONDS=240 # HPA stabilization window (default: 240s)
# Gateway Configuration
export GATEWAY_PROVIDER="istio" # Gateway provider: istio, kgateway
export INSTALL_GATEWAY_CTRLPLANE="true" # Install gateway control plane
export BENCHMARK_MODE="true" # Use benchmark configuration (for Istio only)
# SLO Targets
export SLO_TPOT=10 # Time per output token (ms) SLO
export SLO_TTFT=1000 # Time to first token (ms) SLOWhen running the script, it will prompt you about Gateway control plane installation:
Gateway Control Plane Configuration
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
The Gateway control plane (istio) is required to serve requests.
You can either:
1. Install the Gateway control plane (recommended for new clusters or emulated clusters)
2. Use an existing Gateway control plane in your cluster (recommended for production clusters)
Do you want to install the Gateway control plane? (y/n):To skip this prompt and automate the decision:
# Install gateway control plane automatically
export INSTALL_GATEWAY_CTRLPLANE="true"
bash install.sh
# Use existing gateway (don't install)
export INSTALL_GATEWAY_CTRLPLANE="false"
bash install.shexport HF_TOKEN="hf_xxxxx"
make deploy-wva-on-k8sexport HF_TOKEN="hf_xxxxx"
export MODEL_ID="meta-llama/Llama-2-7b-hf"
export SLO_TPOT=5
export SLO_TTFT=500
make deploy-wva-on-k8sexport DEPLOY_WVA=true
export DEPLOY_LLM_D=false
export DEPLOY_PROMETHEUS=true # Prometheus is needed for metrics - disable if it is already installed in your cluster
export DEPLOY_PROMETHEUS_ADAPTER=false
export DEPLOY_HPA=false
make deploy-wva-on-k8sexport HF_TOKEN="hf_xxxxx"
export HPA_STABILIZATION_SECONDS=30 # Fast scaling for dev/test (default: 240)
make deploy-wva-on-k8sexport HF_TOKEN="hf_xxxxx"
export HPA_STABILIZATION_SECONDS=0 # Immediate scaling for e2e tests
export VLLM_MAX_NUM_SEQS=8 # Low batch size for easy saturation
export E2E_TESTS_ENABLED=true
make deploy-wva-on-k8sexport HF_TOKEN="hf_xxxxx"
export VLLM_MAX_NUM_SEQS=64 # Match desired max batch size
export MODEL_ID="unsloth/Meta-Llama-3.1-8B"
make deploy-wva-on-k8sDeploy only the llm-d infrastructure and WVA controller without creating VariantAutoscaling or HPA resources. This is useful for e2e testing where tests dynamically create their own VA/HPA resources.
# Using command-line flag
export HF_TOKEN="hf_xxxxx"
./deploy/install.sh --infra-only
# Using environment variable
export HF_TOKEN="hf_xxxxx"
export INFRA_ONLY=true
make deploy-wva-emulated-on-kind
# Verify: Only WVA controller + llm-d infrastructure should exist
kubectl get variantautoscaling --all-namespaces # Should be empty
kubectl get hpa --all-namespaces | grep -v kube-system # Should be empty (except system HPAs)What gets deployed in infra-only mode:
- ✅ Prometheus stack (metrics collection)
- ✅ WVA controller
- ✅ llm-d infrastructure (Gateway, CRDs, RBAC, EPP)
- ✅ Prometheus Adapter (external metrics API)
- ❌ VariantAutoscaling CRs (tests create these)
- ❌ HPA resources (tests create these)
- ❌ Model services (tests create these)
### Method 2: Helm Chart
The WVA can be deployed as a standalone using Helm, assuming you have:
- Existing Prometheus server
- Existing vLLM deployment
- ServiceMonitors configured
- Prometheus Adapter (optional, for HPA)
This method is particularly useful when there is one (or more) existing llm-d infrastructure deployed
#### Helm Chart Quick Start
```bash
# Add WVA Helm repository (if published)
# helm repo add wva https://llm-d.github.io/workload-variant-autoscaler
# helm repo update
# Or install from local chart
cd charts/workload-variant-autoscaler
# Install with default values
helm install workload-variant-autoscaler . \
--namespace workload-variant-autoscaler-system \
--create-namespace
# Or install with custom values
helm install workload-variant-autoscaler . \
--namespace workload-variant-autoscaler-system \
--create-namespace \
--values my-values.yaml
The Helm chart has several configurable parameters. Here's a comprehensive example based on the default values:
Create my-values.yaml:
# WVA Controller Configuration
wva:
enabled: true
# Image configuration
image:
repository: ghcr.io/llm-d/llm-d-workload-variant-autoscaler
tag: latest
imagePullPolicy: Always
# Metrics configuration
metrics:
enabled: true
port: 8443 # Secure metrics port
secure: true # Enable secure metrics endpoint
# Reconciliation interval
reconcileInterval: 60s
# Prometheus configuration
prometheus:
monitoringNamespace: workload-variant-autoscaler-monitoring
baseURL: "https://prometheus-k8s.monitoring.svc.cluster.local:9090"
# TLS configuration
tls:
# CA certificate path inside container
caCertPath: "/etc/ssl/certs/prometheus-ca.crt"
# Set to true to skip TLS verification (not recommended for production)
insecureSkipVerify: false
# Provide CA certificate directly
# caCert: |
# -----BEGIN CERTIFICATE-----
# YOUR_CA_CERTIFICATE_HERE
# -----END CERTIFICATE-----
# Logging configuration
logging:
level: info # debug, info, warn, error
# llm-d Infrastructure Configuration
llmd:
namespace: llm-d-inference-scheduler
modelName: ms-inference-scheduling-llm-d-modelservice
modelID: "unsloth/Meta-Llama-3.1-8B"
# VariantAutoscaling Configuration
va:
enabled: true # Create VariantAutoscaling CR
accelerator: H100 # GPU type: A100, H100, L40S, etc.
sloTpot: 10 # Time per output token SLO (ms)
sloTtft: 1000 # Time to first token SLO (ms)
# HPA Configuration
hpa:
enabled: true # Create HPA resource
maxReplicas: 10 # Maximum number of replicas
targetAverageValue: "1" # Target value for external metric
# Scaling behavior configuration
behavior:
scaleUp:
stabilizationWindowSeconds: 240 # Wait 240s before scaling up (production default)
selectPolicy: Max # Use maximum scale from policies
policies:
- type: Pods
value: 10 # Scale up by max 10 pods
periodSeconds: 150 # Per 150 second period
scaleDown:
stabilizationWindowSeconds: 240 # Wait 240s before scaling down (production default)
selectPolicy: Max
policies:
- type: Pods
value: 10 # Scale down by max 10 pods
periodSeconds: 150 # Per 150 second period
# vLLM Service Configuration
vllmService:
enabled: true # Create Service for vLLM
nodePort: 30000 # NodePort for external access
interval: 15s # ServiceMonitor scrape interval
scheme: http # http or httpsInstall with custom values:
helm install workload-variant-autoscaler ./charts/workload-variant-autoscaler \
--namespace workload-variant-autoscaler-system \
--create-namespace \
--values my-values.yamlFor a minimal installation (just the controller, no VariantAutoscaling or HPA):
# Create minimal values file
cat > minimal-values.yaml <<EOF
wva:
enabled: true
image:
tag: latest
imagePullPolicy: Always
prometheus:
baseURL: "https://my-prometheus.monitoring.svc.cluster.local:9090"
monitoringNamespace: monitoring
tls:
insecureSkipVerify: true # Only for dev/testing
logging:
level: info
# Disable auto-creation of resources
va:
enabled: false
hpa:
enabled: false
vllmService:
enabled: false
EOF
# Install
helm install workload-variant-autoscaler ./charts/workload-variant-autoscaler \
-n workload-variant-autoscaler-system \
--create-namespace \
-f minimal-values.yamlIf you have an existing Prometheus (e.g., kube-prometheus-stack):
# prometheus-values.yaml
wva:
prometheus:
baseURL: "https://kube-prometheus-stack-prometheus.monitoring.svc.cluster.local:9090"
monitoringNamespace: monitoring
tls:
insecureSkipVerify: false # Use proper TLS verification
caCertPath: "/etc/ssl/certs/prometheus-ca.crt"
# Provide CA cert via file
# caCert: |
# -----BEGIN CERTIFICATE-----
# ...
# -----END CERTIFICATE-----
metrics:
enabled: true
port: 8443
secure: true
# Configure llm-d details
llmd:
namespace: my-llm-namespace
modelName: my-vllm-deployment
modelID: "meta-llama/Llama-2-7b-hf"
# Create VariantAutoscaling
va:
enabled: true
accelerator: A100
sloTpot: 10
sloTtft: 1000
# Create HPA
hpa:
enabled: true
maxReplicas: 10helm install workload-variant-autoscaler ./charts/workload-variant-autoscaler \
-n workload-variant-autoscaler-system \
--create-namespace \
-f prometheus-values.yamlIf you need to provide a CA certificate for Prometheus TLS:
# Create a CA certificate file
kubectl get secret prometheus-web-tls \
-n monitoring \
-o jsonpath='{.data.ca\.crt}' | base64 -d > /tmp/prometheus-ca.crt
# Install with CA certificate
helm install workload-variant-autoscaler ./charts/workload-variant-autoscaler \
-n workload-variant-autoscaler-system \
--create-namespace \
--set-file wva.prometheus.caCert=/tmp/prometheus-ca.crt \
--set wva.prometheus.baseURL="https://prometheus-k8s.monitoring.svc:9090" \
--set wva.prometheus.tls.insecureSkipVerify=falseIf you don't create VariantAutoscaling via Helm, create it manually:
cat <<EOF | kubectl apply -f -
apiVersion: llmd.ai/v1alpha1
kind: VariantAutoscaling
metadata:
name: my-vllm-deployment-decode
namespace: llm-d-inference-scheduler
labels:
inference.optimization/acceleratorName: A100
spec:
# Model identifier
modelID: "unsloth/Meta-Llama-3.1-8B"
EOFIf using HPA with external metrics:
cat <<EOF | kubectl apply -f -
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: vllm-deployment-hpa
namespace: llm-d-inference-scheduler
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: my-vllm-deployment-decode
maxReplicas: 10
# minReplicas: 0 # Scale to zero is an alpha feature
behavior:
scaleUp:
stabilizationWindowSeconds: 0 # Tune based on your needs
policies:
- type: Pods
value: 10
periodSeconds: 15
scaleDown:
stabilizationWindowSeconds: 0 # Tune based on your needs
policies:
- type: Pods
value: 10
periodSeconds: 15
metrics:
- type: External
external:
metric:
name: wva_desired_replicas
selector:
matchLabels:
variant_name: my-vllm-deployment-decode
target:
type: AverageValue
averageValue: "1"
EOF# Uninstall the release
helm uninstall workload-variant-autoscaler -n workload-variant-autoscaler-systemFor platform-specific instructions and considerations:
- Kubernetes Guide: Detailed Kubernetes-specific instructions including kube-prometheus-stack setup, GPU operator installation, and ServiceMonitor configuration
- OpenShift Guide: OpenShift-specific instructions including User Workload Monitoring (Thanos), Routes, Security Context Constraints (SCC), and GPU operator on OpenShift
- Kind Guide (Local Testing): Local development and testing with Kind clusters and emulated GPUs
Each guide includes platform-specific examples, troubleshooting, and quick start commands. All guides use the same Configuration Reference documented below.
| Variable | Description | Required For |
|---|---|---|
HF_TOKEN |
HuggingFace token | llm-d deployment |
| Variable | Description | Default |
|---|---|---|
ENVIRONMENT |
Deployment environment | kubernetes |
WELL_LIT_PATH_NAME |
llm-d path name | inference-scheduling |
MODEL_ID |
Model to deploy | unsloth/Meta-Llama-3.1-8B |
ACCELERATOR_TYPE |
GPU type | Auto-detected |
SLO_TPOT |
Time per output token SLO (ms) | 10 |
SLO_TTFT |
Time to first token SLO (ms) | 1000 |
| Variable | Description | Default |
|---|---|---|
WVA_IMAGE_REPO |
WVA image repository | ghcr.io/llm-d/llm-d-workload-variant-autoscaler |
WVA_IMAGE_TAG |
WVA image tag | latest |
WVA_IMAGE_PULL_POLICY |
Image pull policy | Always |
| Variable | Description | Default |
|---|---|---|
WVA_NS |
WVA controller namespace | workload-variant-autoscaler-system |
MONITORING_NAMESPACE |
Prometheus namespace | workload-variant-autoscaler-monitoring |
LLMD_NS |
llm-d namespace | llm-d-inference-scheduler |
| Variable | Description | Default |
|---|---|---|
GATEWAY_PROVIDER |
Gateway implementation | istio |
INSTALL_GATEWAY_CTRLPLANE |
Install gateway control plane | false (prompts user) |
BENCHMARK_MODE |
Use benchmark Istio config | true |
| Variable | Description | Default |
|---|---|---|
DEPLOY_PROMETHEUS |
Deploy Prometheus stack | true |
DEPLOY_WVA |
Deploy WVA controller | true |
DEPLOY_LLM_D |
Deploy llm-d infrastructure | true |
DEPLOY_PROMETHEUS_ADAPTER |
Deploy Prometheus Adapter | true |
DEPLOY_VA |
Deploy VariantAutoscaling CR | true |
DEPLOY_HPA |
Deploy HPA | true |
INFRA_ONLY |
Deploy only infrastructure (skip VA/HPA) | false |
SKIP_CHECKS |
Skip prerequisite checks | false |
| Variable | Description | Default |
|---|---|---|
HPA_STABILIZATION_SECONDS |
HPA stabilization window in seconds | 240 |
Best Practices:
- Production: 120-300 seconds (prevents flapping, ensures stability)
- Development: 30-60 seconds (faster iteration)
- E2E Tests: 0-30 seconds (rapid validation)
Examples:
# Production deployment (default)
HPA_STABILIZATION_SECONDS=240 ./deploy/install.sh
# Development deployment
HPA_STABILIZATION_SECONDS=60 ./deploy/install.sh
# E2E testing
HPA_STABILIZATION_SECONDS=30 ./deploy/install.sh| Variable | Description | Default |
|---|---|---|
SKIP_TLS_VERIFY |
Skip TLS verification | Auto-detected |
WVA_LOG_LEVEL |
WVA logging level | info |
VLLM_SVC_ENABLED |
Enable vLLM Service | true |
VLLM_SVC_NODEPORT |
vLLM NodePort | 30000 |
LLM_D_RELEASE |
llm-d version | v0.3.0 |
VLLM_MAX_NUM_SEQS |
vLLM max concurrent sequences per replica | (unset - uses vLLM default) |
vLLM Performance Tuning:
The VLLM_MAX_NUM_SEQS variable controls the maximum number of concurrent sequences (batch size) that each vLLM replica can handle. Lower values make the server easier to saturate, which is useful for testing autoscaling behavior.
Use cases:
- E2E Testing: Set to low values (e.g.,
8or16) to quickly trigger saturation and test autoscaling - Parameter Estimation: Match this to your desired maximum batch size (see Parameter Estimation Guide)
- Production: Leave unset to use vLLM's default based on available GPU memory
Example:
# E2E testing with easy saturation
export VLLM_MAX_NUM_SEQS=8
make deploy-wva-on-k8s
# Parameter estimation with batch size 64
export VLLM_MAX_NUM_SEQS=64
make deploy-wva-on-k8sSee the values.yaml file for complete Helm chart configuration options.
Check all components:
# WVA controller
kubectl get pods -n workload-variant-autoscaler-system
kubectl logs -n workload-variant-autoscaler-system -l app.kubernetes.io/name=workload-variant-autoscaler
# VariantAutoscaling resources
kubectl get variantautoscaling -A
kubectl describe variantautoscaling <name> -n <namespace>
# HPA (if deployed)
kubectl get hpa -A
# External metrics (if Prometheus Adapter deployed)
kubectl get --raw "/apis/external.metrics.k8s.io/v1beta1" | jqCheck metrics flow:
# 1. Verify vLLM is exposing metrics
kubectl port-forward -n <llm-namespace> <vllm-pod> 8000:8000
curl http://localhost:8000/metrics | grep vllm:
# 2. Verify Prometheus is scraping
kubectl port-forward -n <monitoring-namespace> svc/prometheus-k8s 9090:9090
# Visit http://localhost:9090 and query: vllm:num_requests_running
# 3. Verify WVA is collecting metrics
kubectl logs -n workload-variant-autoscaler-system -l app.kubernetes.io/name=workload-variant-autoscaler | grep "Collected metrics"
# 4. Verify external metrics API (if using HPA)
kubectl get --raw "/apis/external.metrics.k8s.io/v1beta1/namespaces/<namespace>/wva_desired_replicas" | jqView logs with filtering:
# All logs
kubectl logs -n workload-variant-autoscaler-system -l app.kubernetes.io/name=workload-variant-autoscaler -f
# Filter for optimization decisions
kubectl logs -n workload-variant-autoscaler-system -l app.kubernetes.io/name=workload-variant-autoscaler | \
grep "OptimizationComplete"
# Filter for metrics issues
kubectl logs -n workload-variant-autoscaler-system -l app.kubernetes.io/name=workload-variant-autoscaler | \
grep "MetricsMissing\|MetricsStale"
# JSON parsing (if using JSON logging)
kubectl logs -n workload-variant-autoscaler-system -l app.kubernetes.io/name=workload-variant-autoscaler | \
jq 'select(.level=="ERROR")'Check VariantAutoscaling status:
# List with custom columns
kubectl get variantautoscaling -A -o custom-columns=\
NAME:.metadata.name,\
NAMESPACE:.metadata.namespace,\
MODEL:.spec.model,\
CURRENT:.status.currentReplicas,\
DESIRED:.status.desiredReplicas,\
METRICS:.status.conditions[?(@.type=="MetricsAvailable")].status
# Detailed status
kubectl describe variantautoscaling <name> -n <namespace>For rapid testing of autoscaling behavior, configure vLLM with a low max-num-seqs value to make the server easy to saturate:
# Deploy with testing-friendly configuration
export VLLM_MAX_NUM_SEQS=8 # Low batch size triggers saturation quickly
export HPA_STABILIZATION_SECONDS=30 # Fast scaling for testing
make deploy-wva-on-k8s # or -on-openshift, -emulated-on-kindThis configuration helps you:
- Quickly verify autoscaling behavior without heavy load
- Test WVA's saturation detection
- Validate HPA integration
- Debug scaling issues faster
Generate load:
# Using guidellm (if available)
guidellm bench \
--url http://<vllm-service>:<port>/v1 \
--model <model-id> \
--rate 50 \
--duration 300
# Using simple loop
for i in {1..100}; do
curl -X POST http://<vllm-service>:<port>/v1/completions \
-H "Content-Type: application/json" \
-d '{"model":"<model-id>","prompt":"Hello","max_tokens":100}' &
doneWatch autoscaling:
# Watch VariantAutoscaling
watch kubectl get variantautoscaling -n <namespace>
# Watch HPA
watch kubectl get hpa -n <namespace>
# Watch pods
watch kubectl get pods -n <namespace>
# Watch external metrics
watch 'kubectl get --raw "/apis/external.metrics.k8s.io/v1beta1/namespaces/<namespace>/wva_desired_replicas" | jq'Symptoms:
kubectl get pods -n workload-variant-autoscaler-system
NAME READY STATUS RESTARTS
workload-variant-autoscaler-controller-xxx 0/1 Pending 0Diagnosis:
kubectl describe pod <pod-name> -n workload-variant-autoscaler-system
kubectl logs <pod-name> -n workload-variant-autoscaler-systemCommon causes:
- Insufficient resources: Check resource requests in values
- Image pull errors: validate image repository and tag
- TLS configuration: Verify TLS configuration is correct
- Prometheus configuration: Verify that the WVA can reach Prometheus
Solutions:
# Check image in the pod description
kubectl describe pods -n workload-variant-autoscaler-system pod-name
# Look for TLS- or Prometheus-related error logs
kubectl logs <pod-name> -n workload-variant-autoscaler-systemSymptoms:
- VariantAutoscaling shows
MetricsAvailable: False - WVA logs show "Metrics unavailable" warnings
Diagnosis:
# Check WVA logs for metrics errors
kubectl logs -n workload-variant-autoscaler-system -l app.kubernetes.io/name=workload-variant-autoscaler | \
grep -i "metrics"
# Check if vLLM is exposing metrics
kubectl port-forward -n <namespace> <vllm-pod> 8000:8000
curl -s http://localhost:8000/metrics
# Check ServiceMonitor
kubectl get servicemonitor -A
kubectl describe servicemonitor <name> -n <namespace>
# Check Prometheus targets
kubectl port-forward -n <monitoring-namespace> svc/prometheus-k8s 9090:9090
# Visit http://localhost:9090/targetsCommon causes:
- ServiceMonitor not created or in wrong namespace
- Service selector doesn't match vLLM pods
- Prometheus not scraping the namespace
- vLLM not exposing metrics on expected port
Solutions:
# Verify ServiceMonitor selector matches Service
kubectl get svc -n <namespace> --show-labels
kubectl get servicemonitor -n <monitoring-namespace> -o yaml | grep -A 5 selector
# Check Prometheus configuration
kubectl get prometheus -A -o yaml | grep serviceMonitorNamespaceSelector
# Manually test metrics endpoint
kubectl exec -it <vllm-pod> -n <namespace> -- curl localhost:8000/metricsSymptoms:
kubectl get hpa -n <namespace>
NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS
my-hpa Deployment/vllm <unknown>/1(avg) 1 10 1Diagnosis:
# Check HPA status
kubectl describe hpa <name> -n <namespace>
# Check external metrics API on the specified namespace
kubectl get --raw "/apis/external.metrics.k8s.io/v1beta1/namespaces/<your-namespace>/wva_desired_replicas" | jq
# Check Prometheus Adapter logs
kubectl logs -n <monitoring-namespace> deployment/prometheus-adapter
# Check if WVA is emitting the metric
kubectl logs -n workload-variant-autoscaler-system -l app.kubernetes.io/name=workload-variant-autoscaler | \
grep "wva_desired_replicas"Common causes:
- Prometheus Adapter not deployed
- External metrics API not registered
- Metric selector doesn't match emitted labels
- WVA not emitting metrics (due to metrics unavailability)
Solutions:
# Verify external metrics API
kubectl api-resources | grep external.metrics
# Check Prometheus Adapter configuration
kubectl get configmap prometheus-adapter -n <monitoring-namespace> -o yaml
# Verify metric exists in Prometheus
kubectl port-forward -n <monitoring-namespace> svc/prometheus-k8s 9090:9090
# Query: wva_desired_replicas{variant_name="<name>"}If you encounter issues not covered here:
- Check logs: WVA, Prometheus, Prometheus Adapter, vLLM
- Verify configuration: VariantAutoscaling spec, ServiceMonitor, HPA
- Test components individually: Metrics exposure, Prometheus scraping, external metrics API
- Review documentation: Platform-specific READMEs
- Open an issue: Include logs, configuration, and environment details
# === WVA Controller ===
kubectl get pods -n workload-variant-autoscaler-system
kubectl logs -n workload-variant-autoscaler-system -l app.kubernetes.io/name=workload-variant-autoscaler -f
kubectl describe deployment workload-variant-autoscaler-controller-manager -n workload-variant-autoscaler-system
# === VariantAutoscaling Resources ===
kubectl get variantautoscaling -A
kubectl describe variantautoscaling <name> -n <namespace>
kubectl get variantautoscaling <name> -n <namespace> -o yaml
# === Metrics and Monitoring ===
kubectl get servicemonitor -A
kubectl get --raw "/apis/external.metrics.k8s.io/v1beta1" | jq
kubectl port-forward -n <monitoring-namespace> svc/prometheus-k8s 9090:9090
# === HPA ===
kubectl get hpa -A
kubectl describe hpa <name> -n <namespace>
kubectl get hpa <name> -n <namespace> -o yaml
# === Prometheus Adapter ===
kubectl get pods -n <monitoring-namespace> | grep prometheus-adapter
kubectl logs -n <monitoring-namespace> deployment/prometheus-adapter
# === vLLM / Application ===
kubectl get pods -n <app-namespace>
kubectl logs -n <app-namespace> <vllm-pod>
kubectl port-forward -n <app-namespace> <vllm-pod> 8000:8000
# === Configuration ===
kubectl get configmap -n workload-variant-autoscaler-system
kubectl get configmap service-classes -n workload-variant-autoscaler-system -o yaml
kubectl get configmap model-accelerator-data -n workload-variant-autoscaler-system -o yaml- Main Project: README.md
- Kubernetes Guide: kubernetes/README.md
- OpenShift Guide: openshift/README.md
- Helm Chart: charts/workload-variant-autoscaler
- API Reference: api/v1alpha1
- Architecture: docs/design/modeling-optimization.md