Skip to content

Commit 9cdee21

Browse files
Romero027YuhanLiu11
authored andcommitted
[Doc, Feat] basic KEDA support and tutorials (#487)
* [Document, Feat] basic KEDA support and tutorials Signed-off-by: Xiangfeng Zhu <[email protected]> * fix precommit Signed-off-by: Xiangfeng Zhu <[email protected]> * minor updates Signed-off-by: Xiangfeng Zhu <[email protected]> --------- Signed-off-by: Xiangfeng Zhu <[email protected]> Co-authored-by: Yuhan Liu <[email protected]> Signed-off-by: David Gao <[email protected]>
1 parent 12b9c12 commit 9cdee21

File tree

2 files changed

+221
-0
lines changed

2 files changed

+221
-0
lines changed

tutorials/19-keda-autoscaling.md

Lines changed: 192 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,192 @@
1+
# Tutorial: Autoscale Your vLLM Deployment with KEDA
2+
3+
## Introduction
4+
5+
This tutorial shows you how to automatically scale a vLLM deployment using [KEDA](https://keda.sh/) and Prometheus-based metrics. You'll configure KEDA to monitor queue length and dynamically adjust the number of replicas based on load.
6+
7+
## Table of Contents
8+
9+
* [Introduction](#introduction)
10+
* [Prerequisites](#prerequisites)
11+
* [Steps](#steps)
12+
13+
* [1. Install the vLLM Production Stack](#1-install-the-vllm-production-stack)
14+
* [2. Deploy the Observability Stack](#2-deploy-the-observability-stack)
15+
* [3. Install KEDA](#3-install-keda)
16+
* [4. Verify Metric Export](#4-verify-metric-export)
17+
* [5. Configure the ScaledObject](#5-configure-the-scaledobject)
18+
* [6. Test Autoscaling](#6-test-autoscaling)
19+
* [7. Cleanup](#7-cleanup)
20+
* [Additional Resources](#additional-resources)
21+
22+
---
23+
24+
## Prerequisites
25+
26+
* A working vLLM deployment on Kubernetes (see [01-minimal-helm-installation](01-minimal-helm-installation.md))
27+
* Access to a Kubernetes cluster with at least 2 GPUs
28+
* `kubectl` and `helm` installed
29+
* Basic understanding of Kubernetes and Prometheus metrics
30+
31+
---
32+
33+
## Steps
34+
35+
### 1. Install the vLLM Production Stack
36+
37+
Install the production stack using a single pod by following the instructions in [02-basic-vllm-config.md](02-basic-vllm-config.md).
38+
39+
---
40+
41+
### 2. Deploy the Observability Stack
42+
43+
This stack includes Prometheus, Grafana, and necessary exporters.
44+
45+
```bash
46+
cd observability
47+
bash install.sh
48+
```
49+
50+
---
51+
52+
### 3. Install KEDA
53+
54+
```bash
55+
kubectl create namespace keda
56+
helm repo add kedacore https://kedacore.github.io/charts
57+
helm repo update
58+
helm install keda kedacore/keda --namespace keda
59+
```
60+
61+
---
62+
63+
### 4. Verify Metric Export
64+
65+
Check that Prometheus is scraping the queue length metric `vllm:num_requests_waiting`.
66+
67+
```bash
68+
kubectl port-forward svc/prometheus-operated -n monitoring 9090:9090
69+
```
70+
71+
In a separate terminal:
72+
73+
```bash
74+
curl -G 'http://localhost:9090/api/v1/query' --data-urlencode 'query=vllm:num_requests_waiting'
75+
```
76+
77+
Example output:
78+
79+
```json
80+
{
81+
"status": "success",
82+
"data": {
83+
"result": [
84+
{
85+
"metric": {
86+
"__name__": "vllm:num_requests_waiting",
87+
"pod": "vllm-llama3-deployment-vllm-xxxxx"
88+
},
89+
"value": [ 1749077215.034, "0" ]
90+
}
91+
]
92+
}
93+
}
94+
```
95+
96+
This means that at the given timestamp, there were 0 pending requests in the queue.
97+
98+
---
99+
100+
### 5. Configure the ScaledObject
101+
102+
The following `ScaledObject` configuration is provided in `tutorials/assets/values-19-keda.yaml`. Review its contents:
103+
104+
```yaml
105+
apiVersion: keda.sh/v1alpha1
106+
kind: ScaledObject
107+
metadata:
108+
name: vllm-scaledobject
109+
namespace: default
110+
spec:
111+
scaleTargetRef:
112+
name: vllm-llama3-deployment-vllm
113+
minReplicaCount: 1
114+
maxReplicaCount: 2
115+
pollingInterval: 15
116+
cooldownPeriod: 30
117+
triggers:
118+
- type: prometheus
119+
metadata:
120+
serverAddress: http://prometheus-operated.monitoring.svc:9090
121+
metricName: vllm:num_requests_waiting
122+
query: vllm:num_requests_waiting
123+
threshold: '5'
124+
```
125+
126+
Apply the ScaledObject:
127+
128+
```bash
129+
cd ../tutorials
130+
kubectl apply -f assets/values-19-keda.yaml
131+
```
132+
133+
This tells KEDA to:
134+
135+
* Monitor `vllm:num_requests_waiting`
136+
* Scale between 1 and 2 replicas
137+
* Scale up when the queue exceeds 5 requests
138+
139+
---
140+
141+
### 6. Test Autoscaling
142+
143+
Watch the deployment:
144+
145+
```bash
146+
kubectl get hpa -n default -w
147+
```
148+
149+
You should initially see:
150+
151+
```plaintext
152+
NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS
153+
keda-hpa-vllm-scaledobject Deployment/vllm-llama3-deployment-vllm 0/5 (avg) 1 2 1
154+
```
155+
156+
`TARGETS` shows the current metric value vs. the target threshold.
157+
`0/5 (avg)` means the current value of `vllm:num_requests_waiting` is 0, and the threshold is 5.
158+
159+
Generate load:
160+
161+
```bash
162+
kubectl port-forward svc/vllm-router-service 30080:80
163+
```
164+
165+
In a separate terminal:
166+
167+
```bash
168+
python3 assets/example-10-load-generator.py --num-requests 100 --prompt-len 3000
169+
```
170+
171+
Within a few minutes, the `REPLICAS` value should increase to 2.
172+
173+
---
174+
175+
### 7. Cleanup
176+
177+
To remove KEDA configuration and observability components:
178+
179+
```bash
180+
kubectl delete -f assets/values-19-keda.yaml
181+
helm uninstall keda -n keda
182+
kubectl delete namespace keda
183+
184+
cd ../observability
185+
bash uninstall.sh
186+
```
187+
188+
---
189+
190+
## Additional Resources
191+
192+
* [KEDA Documentation](https://keda.sh/docs/)

tutorials/assets/values-19-keda.yaml

Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,29 @@
1+
# KEDA ScaledObject for vLLM deployment
2+
# This configuration enables automatic scaling of vLLM pods based on queue length metrics
3+
apiVersion: keda.sh/v1alpha1
4+
kind: ScaledObject
5+
metadata:
6+
name: vllm-scaledobject
7+
namespace: default
8+
spec:
9+
scaleTargetRef:
10+
name: vllm-llama3-deployment-vllm
11+
minReplicaCount: 1
12+
maxReplicaCount: 2
13+
# How often KEDA should check the metrics (in seconds)
14+
pollingInterval: 15
15+
# How long to wait before scaling down after scaling up (in seconds)
16+
cooldownPeriod: 30
17+
# Scaling triggers configuration
18+
triggers:
19+
- type: prometheus
20+
metadata:
21+
# Prometheus server address within the cluster
22+
serverAddress: http://prometheus-operated.monitoring.svc:9090
23+
# Name of the metric to monitor
24+
metricName: vllm:num_requests_waiting
25+
# Prometheus query to fetch the metric
26+
query: vllm:num_requests_waiting
27+
# Threshold value that triggers scaling
28+
# When queue length exceeds this value, KEDA will scale up
29+
threshold: '5'

0 commit comments

Comments
 (0)