This repository provisions an AWS EKS cluster with GPU nodes and deploys a FastAPI inference service running a Swin-Tiny image classification model. It enables horizontal scaling based on custom Prometheus metrics via KEDA and supports GPU time-slicing using NVIDIA GPU Operator to scale nodes.
- GPU-backed inference: PyTorch
swin_tmodel served with FastAPI/Uvicorn - Autoscaling: KEDA scales pods using custom latency metric from Prometheus
- Node autoscaling: Cluster Autoscaler scales GPU nodes
- Observability:
/metricsendpoint scraped by Prometheus and ServiceMonitor - GPU sharing: Optional time-slicing to run multiple pods per GPU
- Terraform creates VPC, EKS (managed node group with
g4dn.xlarge), and security rules. - Helm installs NVIDIA GPU Operator, Prometheus Stack, KEDA, and Cluster Autoscaler.
- Kubernetes manifests deploy the app
Deployment,Service(LoadBalancer),ServiceMonitor, and KEDAScaledObject.
- AWS account and credentials configured (Administrator or equivalent)
- macOS/Linux shell with the following installed:
awsCLIterraformkubectlhelm- Docker and a container registry account (Docker Hub or ECR)
- GPU quota in the chosen region (defaults to
us-east-1) forg4dn.xlarge
app/: FastAPI app,Dockerfile,pyproject.toml,run.sh, test clientkubernetes/: Deployment, Service, ServiceMonitor, KEDA ScaledObject, time-slicing configterraform/: VPC, EKS, IRSA for Cluster Autoscaler, security groups, outputs
The app code and Dockerfile are in app/. The image uses uv to install Python deps and runs uvicorn.
cd app
# Build
docker build -t <registry>/<repo>:<tag> .
# Login and push (Docker Hub example)
docker login
docker push <registry>/<repo>:<tag>Update the image reference in kubernetes/deployment.yaml under spec.template.spec.containers[0].image.
cd terraform
terraform init
terraform apply
# Update kubeconfig
aws eks update-kubeconfig --region us-east-1 --name swin-tiny-eks-clusterOutputs include the IAM role ARN for Cluster Autoscaler IRSA.
# Capture the IRSA role ARN created by Terraform
ROLE_ARN=$(terraform output -raw cluster_autoscaler_role_arn)
# Add autoscaler repo and install
helm repo add autoscaler https://kubernetes.github.io/autoscaler && helm repo update
helm upgrade --install cluster-autoscaler autoscaler/cluster-autoscaler \
--namespace kube-system --create-namespace \
--set autoDiscovery.clusterName=swin-tiny-eks-cluster \
--set awsRegion=us-east-1 \
--set rbac.serviceAccount.create=true \
--set rbac.serviceAccount.name=cluster-autoscaler \
--set rbac.serviceAccount.annotations."eks\.amazonaws\.com/role-arn"="${ROLE_ARN}"Note: Scale-down decisions may take ~30 minutes.
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia && helm repo update
helm install --wait --generate-name \
-n gpu-operator --create-namespace \
nvidia/gpu-operator \
--version=v25.3.2This lets multiple pods share a single GPU.
# Apply time-slicing config map
kubectl apply -f kubernetes/time-slicing-config-all.yaml -n gpu-operator
# Patch cluster policy to use it
kubectl patch clusterpolicies.nvidia.com/cluster-policy \
-n gpu-operator --type merge \
-p '{"spec": {"devicePlugin": {"config": {"name": "time-slicing-config-all", "default": "any"}}}}'It may take a few minutes for the configuration to take effect.
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts && helm repo update
helm install prometheus prometheus-community/kube-prometheus-stack \
--set prometheus.service.type=LoadBalancerkubernetes/serviceMonitor.yaml is configured to let Prometheus scrape the app’s /metrics endpoint.
helm repo add kedacore https://kedacore.github.io/charts && helm repo update
helm install keda kedacore/kedacd ..
kubectl apply -f kubernetes/This applies the Deployment, Service (type LoadBalancer on port 8000), ServiceMonitor, and ScaledObject.
POST /predict: multipart file upload under form-keyfile; returns predicted classGET /metrics: Prometheus metricsGET /health: health check
Example request with curl:
curl -X POST \
-F "file=@/path/to/image.jpg" \
http://<APP_LOADBALANCER_DNS>:8000/predictThe ScaledObject (kubernetes/keda-scaledobject.yaml) evaluates average request duration over a 2-minute window and scales pods between 1 and 100 replicas.
PromQL used:
avg(
rate(http_request_duration_seconds_sum[2m])
/
clamp_min(rate(http_request_duration_seconds_count[2m]), 1)
)
Threshold (default): 0.020 seconds. Adjust threshold, minReplicaCount, and maxReplicaCount as needed.
Pack replicas onto the same node so nodes can drain and scale down faster when load drops, while keeping latency steady. Use a strong preferred podAffinity (keep it preferred, not required) to co-locate replicas on the same hostname.
affinity:
podAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
topologyKey: kubernetes.io/hostname
labelSelector:
matchLabels:
app: swin-tiny-appA simple async load test script is provided at app/test.py.
# Edit API_URL in app/test.py to your app LoadBalancer DNS
uv run python app/test.py 50 5 # 50 concurrent requests, 5 rounds- Region/cluster/VPC names can be adjusted in
terraform/variables.tf - Node group instance type in
terraform/eks.tf(g4dn.xlargeby default) - App image in
kubernetes/deployment.yaml - Service type and ports in
kubernetes/service.yaml - Prometheus access: Terraform opens 9090 to the world by default in
terraform/security_group_rules.tf(tighten for production)
# Remove Helm releases (optional order)
helm uninstall keda
helm uninstall prometheus
helm -n gpu-operator uninstall <gpu-operator-release-name>
helm -n kube-system uninstall cluster-autoscaler
# Delete k8s app resources
kubectl delete -f kubernetes/
# Destroy infrastructure
cd terraform
terraform destroy- Pods stuck Pending: ensure GPU operator is installed and nodes are GPU-capable; check
nvidia.com/gpuresource requests. - Prometheus not scraping app: confirm
ServiceMonitorreleaselabel matches your Prometheus release name (prometheus) andServicelabels/ports match the monitor selector. - Scaling not triggered: verify Prometheus query results in the Prometheus UI and that KEDA has access to the Prometheus service address configured in
ScaledObject. - Slow scale-down: Kubernetes HPA/KEDA stabilization windows and Cluster Autoscaler timings can delay downscaling.