[Doc, Feat] basic KEDA support and tutorials (#487)

Romero027 · YuhanLiu11 · web-flow · commit 4133fce141a9 · 2025-06-20T10:24:35.000-07:00
* [Document, Feat] basic KEDA support and tutorials

Signed-off-by: Xiangfeng Zhu &lt;xzhu0027@gmail.com&gt;

* fix precommit

Signed-off-by: Xiangfeng Zhu &lt;xzhu0027@gmail.com&gt;

* minor updates

Signed-off-by: Xiangfeng Zhu &lt;xzhu0027@gmail.com&gt;

---------

Signed-off-by: Xiangfeng Zhu &lt;xzhu0027@gmail.com&gt;
Co-authored-by: Yuhan Liu &lt;32589867+YuhanLiu11@users.noreply.github.com&gt;
diff --git a/tutorials/19-keda-autoscaling.md b/tutorials/19-keda-autoscaling.md
@@ -0,0 +1,192 @@
+# Tutorial: Autoscale Your vLLM Deployment with KEDA
+
+## Introduction
+
+This tutorial shows you how to automatically scale a vLLM deployment using [KEDA](https://keda.sh/) and Prometheus-based metrics. You'll configure KEDA to monitor queue length and dynamically adjust the number of replicas based on load.
+
+## Table of Contents
+
+* [Introduction](#introduction)
+* [Prerequisites](#prerequisites)
+* [Steps](#steps)
+
+  * [1. Install the vLLM Production Stack](#1-install-the-vllm-production-stack)
+  * [2. Deploy the Observability Stack](#2-deploy-the-observability-stack)
+  * [3. Install KEDA](#3-install-keda)
+  * [4. Verify Metric Export](#4-verify-metric-export)
+  * [5. Configure the ScaledObject](#5-configure-the-scaledobject)
+  * [6. Test Autoscaling](#6-test-autoscaling)
+  * [7. Cleanup](#7-cleanup)
+* [Additional Resources](#additional-resources)
+
+---
+
+## Prerequisites
+
+* A working vLLM deployment on Kubernetes (see [01-minimal-helm-installation](01-minimal-helm-installation.md))
+* Access to a Kubernetes cluster with at least 2 GPUs
+* `kubectl` and `helm` installed
+* Basic understanding of Kubernetes and Prometheus metrics
+
+---
+
+## Steps
+
+### 1. Install the vLLM Production Stack
+
+Install the production stack using a single pod by following the instructions in [02-basic-vllm-config.md](02-basic-vllm-config.md).
+
+---
+
+### 2. Deploy the Observability Stack
+
+This stack includes Prometheus, Grafana, and necessary exporters.
+
+```bash
+cd observability
+bash install.sh
+```
+
+---
+
+### 3. Install KEDA
+
+```bash
+kubectl create namespace keda
+helm repo add kedacore https://kedacore.github.io/charts
+helm repo update
+helm install keda kedacore/keda --namespace keda
+```
+
+---
+
+### 4. Verify Metric Export
+
+Check that Prometheus is scraping the queue length metric `vllm:num_requests_waiting`.
+
+```bash
+kubectl port-forward svc/prometheus-operated -n monitoring 9090:9090
+```
+
+In a separate terminal:
+
+```bash
+curl -G 'http://localhost:9090/api/v1/query' --data-urlencode 'query=vllm:num_requests_waiting'
+```
+
+Example output:
+
+```json
+{
+  "status": "success",
+  "data": {
+    "result": [
+      {
+        "metric": {
+          "__name__": "vllm:num_requests_waiting",
+          "pod": "vllm-llama3-deployment-vllm-xxxxx"
+        },
+        "value": [ 1749077215.034, "0" ]
+      }
+    ]
+  }
+}
+```
+
+This means that at the given timestamp, there were 0 pending requests in the queue.
+
+---
+
+### 5. Configure the ScaledObject
+
+The following `ScaledObject` configuration is provided in `tutorials/assets/values-19-keda.yaml`. Review its contents:
+
+```yaml
+apiVersion: keda.sh/v1alpha1
+kind: ScaledObject
+metadata:
+  name: vllm-scaledobject
+  namespace: default
+spec:
+  scaleTargetRef:
+    name: vllm-llama3-deployment-vllm
+  minReplicaCount: 1
+  maxReplicaCount: 2
+  pollingInterval: 15
+  cooldownPeriod: 30
+  triggers:
+    - type: prometheus
+      metadata:
+        serverAddress: http://prometheus-operated.monitoring.svc:9090
+        metricName: vllm:num_requests_waiting
+        query: vllm:num_requests_waiting
+        threshold: '5'
+```
+
+Apply the ScaledObject:
+
+```bash
+cd ../tutorials
+kubectl apply -f assets/values-19-keda.yaml
+```
+
+This tells KEDA to:
+
+* Monitor `vllm:num_requests_waiting`
+* Scale between 1 and 2 replicas
+* Scale up when the queue exceeds 5 requests
+
+---
+
+### 6. Test Autoscaling
+
+Watch the deployment:
+
+```bash
+kubectl get hpa -n default -w
+```
+
+You should initially see:
+
+```plaintext
+NAME                         REFERENCE                                TARGETS     MINPODS   MAXPODS   REPLICAS
+keda-hpa-vllm-scaledobject   Deployment/vllm-llama3-deployment-vllm   0/5 (avg)   1         2         1
+```
+
+`TARGETS` shows the current metric value vs. the target threshold.
+`0/5 (avg)` means the current value of `vllm:num_requests_waiting` is 0, and the threshold is 5.
+
+Generate load:
+
+```bash
+kubectl port-forward svc/vllm-router-service 30080:80
+```
+
+In a separate terminal:
+
+```bash
+python3 assets/example-10-load-generator.py --num-requests 100 --prompt-len 3000
+```
+
+Within a few minutes, the `REPLICAS` value should increase to 2.
+
+---
+
+### 7. Cleanup
+
+To remove KEDA configuration and observability components:
+
+```bash
+kubectl delete -f assets/values-19-keda.yaml
+helm uninstall keda -n keda
+kubectl delete namespace keda
+
+cd ../observability
+bash uninstall.sh
+```
+
+---
+
+## Additional Resources
+
+* [KEDA Documentation](https://keda.sh/docs/)
diff --git a/tutorials/assets/values-19-keda.yaml b/tutorials/assets/values-19-keda.yaml
@@ -0,0 +1,29 @@
+# KEDA ScaledObject for vLLM deployment
+# This configuration enables automatic scaling of vLLM pods based on queue length metrics
+apiVersion: keda.sh/v1alpha1
+kind: ScaledObject
+metadata:
+  name: vllm-scaledobject
+  namespace: default
+spec:
+  scaleTargetRef:
+    name: vllm-llama3-deployment-vllm
+  minReplicaCount: 1
+  maxReplicaCount: 2
+  # How often KEDA should check the metrics (in seconds)
+  pollingInterval: 15
+  # How long to wait before scaling down after scaling up (in seconds)
+  cooldownPeriod: 30
+  # Scaling triggers configuration
+  triggers:
+    - type: prometheus
+      metadata:
+        # Prometheus server address within the cluster
+        serverAddress: http://prometheus-operated.monitoring.svc:9090
+        # Name of the metric to monitor
+        metricName: vllm:num_requests_waiting
+        # Prometheus query to fetch the metric
+        query: vllm:num_requests_waiting
+        # Threshold value that triggers scaling
+        # When queue length exceeds this value, KEDA will scale up
+        threshold: '5'