Skip to content

Commit be0c250

Browse files
committed
add scaledobject support
Signed-off-by: eladmotola <eladmotola95@gmail.com>
1 parent 63c056e commit be0c250

File tree

5 files changed

+324
-90
lines changed

5 files changed

+324
-90
lines changed
Lines changed: 171 additions & 89 deletions
Original file line numberDiff line numberDiff line change
@@ -1,64 +1,63 @@
11
Autoscaling with KEDA
22
=====================
33

4-
This tutorial shows you how to automatically scale a vLLM deployment using `KEDA <https://keda.sh/>`_ and Prometheus-based metrics. You'll configure KEDA to monitor queue length and dynamically adjust the number of replicas based on load.
4+
This tutorial shows you how to automatically scale a vLLM deployment using `KEDA <https://keda.sh/>`_ and Prometheus-based metrics. With the vLLM Production Stack Helm chart (v0.1.9+), KEDA autoscaling is integrated directly into the chart, allowing you to enable it through simple ``values.yaml`` configuration.
55

66
Table of Contents
77
-----------------
88

99
- Prerequisites_
1010
- Steps_
1111

12-
- `1. Install the vLLM Production Stack`_
12+
- `1. Install KEDA`_
1313
- `2. Deploy the Observability Stack`_
14-
- `3. Install KEDA`_
15-
- `4. Verify Metric Export`_
16-
- `5. Configure the ScaledObject`_
17-
- `6. Test Autoscaling`_
14+
- `3. Configure and Deploy vLLM with KEDA`_
15+
- `4. Verify KEDA ScaledObject Creation`_
16+
- `5. Test Autoscaling`_
17+
- `6. Advanced Configuration`_
1818
- `7. Cleanup`_
1919

2020
- `Additional Resources`_
2121

2222
Prerequisites
2323
-------------
2424

25-
- A working vLLM deployment on Kubernetes (see :doc:`../getting_started/quickstart`)
2625
- Access to a Kubernetes cluster with at least 2 GPUs
27-
- ``kubectl`` and ``helm`` installed
26+
- ``kubectl`` and ``helm`` installed (v3.0+)
2827
- Basic understanding of Kubernetes and Prometheus metrics
2928

3029
Steps
3130
-----
3231

33-
1. Install the vLLM Production Stack
34-
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
32+
1. Install KEDA
33+
~~~~~~~~~~~~~~~
3534

36-
Install the production stack using a single pod by following the instructions in :doc:`../deployment/helm`.
35+
KEDA must be installed in your cluster before enabling autoscaling in the vLLM chart.
3736

38-
2. Deploy the Observability Stack
39-
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
37+
.. code-block:: bash
38+
39+
kubectl create namespace keda
40+
helm repo add kedacore https://kedacore.github.io/charts
41+
helm repo update
42+
helm install keda kedacore/keda --namespace keda
4043
41-
This stack includes Prometheus, Grafana, and necessary exporters.
44+
Verify KEDA is running:
4245

4346
.. code-block:: bash
4447
45-
cd observability
46-
bash install.sh
48+
kubectl get pods -n keda
4749
48-
3. Install KEDA
49-
~~~~~~~~~~~~~~~
50+
2. Deploy the Observability Stack
51+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
5052

51-
.. code-block:: bash
53+
The observability stack (Prometheus, Grafana) is required for KEDA to query metrics.
5254

53-
kubectl create namespace keda
54-
helm repo add kedacore https://kedacore.github.io/charts
55-
helm repo update
56-
helm install keda kedacore/keda --namespace keda
55+
.. code-block:: bash
5756
58-
4. Verify Metric Export
59-
~~~~~~~~~~~~~~~~~~~~~~~
57+
cd observability
58+
bash install.sh
6059
61-
Check that Prometheus is scraping the queue length metric ``vllm:num_requests_waiting``.
60+
Verify Prometheus is scraping the queue length metric ``vllm:num_requests_waiting``:
6261

6362
.. code-block:: bash
6463
@@ -70,115 +69,198 @@ In a separate terminal:
7069
7170
curl -G 'http://localhost:9090/api/v1/query' --data-urlencode 'query=vllm:num_requests_waiting'
7271
73-
Example output:
72+
3. Configure and Deploy vLLM with KEDA
73+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
7474

75-
.. code-block:: json
75+
Create a ``values.yaml`` file with KEDA autoscaling enabled for your model:
7676

77-
{
78-
"status": "success",
79-
"data": {
80-
"result": [
81-
{
82-
"metric": {
83-
"__name__": "vllm:num_requests_waiting",
84-
"pod": "vllm-llama3-deployment-vllm-xxxxx"
85-
},
86-
"value": [ 1749077215.034, "0" ]
87-
}
88-
]
89-
}
90-
}
77+
.. code-block:: yaml
9178
92-
This means that at the given timestamp, there were 0 pending requests in the queue.
79+
servingEngineSpec:
80+
enableEngine: true
81+
modelSpec:
82+
- name: "llama3"
83+
repository: "lmcache/vllm-openai"
84+
tag: "latest"
85+
modelURL: "meta-llama/Llama-3.1-8B-Instruct"
86+
replicaCount: 1
87+
requestCPU: 10
88+
requestMemory: "64Gi"
89+
requestGPU: 1
90+
91+
# Enable KEDA autoscaling
92+
keda:
93+
enabled: true
94+
minReplicaCount: 1
95+
maxReplicaCount: 3
96+
pollingInterval: 15
97+
cooldownPeriod: 360
98+
triggers:
99+
- type: prometheus
100+
metadata:
101+
serverAddress: http://prometheus-operated.monitoring.svc:9090
102+
metricName: vllm:num_requests_waiting
103+
query: vllm:num_requests_waiting
104+
threshold: '5'
105+
106+
Deploy or upgrade the chart:
93107

94-
5. Configure the ScaledObject
95-
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
108+
.. code-block:: bash
96109
97-
The following ``ScaledObject`` configuration is provided in ``tutorials/assets/values-19-keda.yaml``. Review its contents:
110+
helm upgrade --install vllm vllm/vllm-stack -f values.yaml
98111
99-
.. code-block:: yaml
112+
This configuration tells KEDA to:
100113

101-
apiVersion: keda.sh/v1alpha1
102-
kind: ScaledObject
103-
metadata:
104-
name: vllm-scaledobject
105-
namespace: default
106-
spec:
107-
scaleTargetRef:
108-
name: vllm-llama3-deployment-vllm
109-
minReplicaCount: 1
110-
maxReplicaCount: 2
111-
pollingInterval: 15
112-
cooldownPeriod: 30
113-
triggers:
114-
- type: prometheus
115-
metadata:
116-
serverAddress: http://prometheus-operated.monitoring.svc:9090
117-
metricName: vllm:num_requests_waiting
118-
query: vllm:num_requests_waiting
119-
threshold: '5'
114+
- Monitor the ``vllm:num_requests_waiting`` metric from Prometheus
115+
- Maintain between 1 and 3 replicas
116+
- Scale up when the queue exceeds 5 pending requests
117+
- Check metrics every 15 seconds
118+
- Wait 360 seconds before scaling down after scaling up
119+
120+
4. Verify KEDA ScaledObject Creation
121+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
120122

121-
Apply the ScaledObject:
123+
Check that the Helm chart created the ScaledObject resource:
122124

123125
.. code-block:: bash
124126
125-
cd ../tutorials
126-
kubectl apply -f assets/values-19-keda.yaml
127+
kubectl get scaledobjects
127128
128-
This tells KEDA to:
129+
You should see:
129130

130-
- Monitor ``vllm:num_requests_waiting``
131-
- Scale between 1 and 2 replicas
132-
- Scale up when the queue exceeds 5 requests
131+
.. code-block:: text
133132
134-
6. Test Autoscaling
135-
~~~~~~~~~~~~~~~~~~~
133+
NAME SCALETARGETKIND SCALETARGETNAME MIN MAX TRIGGERS AUTHENTICATION READY ACTIVE FALLBACK PAUSED AGE
134+
vllm-llama3-scaledobject apps/v1.Deployment vllm-llama3-deployment-vllm 1 3 prometheus True False Unknown Unknown 30s
136135
137-
Watch the deployment:
136+
View the created HPA:
138137

139138
.. code-block:: bash
140139
141-
kubectl get hpa -n default -w
140+
kubectl get hpa
142141
143-
You should initially see:
142+
Expected output:
144143

145144
.. code-block:: text
146145
147-
NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS
148-
keda-hpa-vllm-scaledobject Deployment/vllm-llama3-deployment-vllm 0/5 (avg) 1 2 1
146+
NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS
147+
keda-hpa-vllm-llama3-scaledobject Deployment/vllm-llama3-deployment-vllm 0/5 (avg) 1 3 1
148+
149+
5. Test Autoscaling
150+
~~~~~~~~~~~~~~~~~~~
151+
152+
Watch the HPA in real-time:
149153

150-
``TARGETS`` shows the current metric value vs. the target threshold.
151-
``0/5 (avg)`` means the current value of ``vllm:num_requests_waiting`` is 0, and the threshold is 5.
154+
.. code-block:: bash
155+
156+
kubectl get hpa -n default -w
152157
153-
Generate load:
158+
Generate load to trigger autoscaling. Port-forward to the router service:
154159

155160
.. code-block:: bash
156161
157162
kubectl port-forward svc/vllm-router-service 30080:80
158163
159-
In a separate terminal:
164+
In a separate terminal, run a load generator:
160165

161166
.. code-block:: bash
162167
163-
python3 assets/example-10-load-generator.py --num-requests 100 --prompt-len 3000
168+
python3 tutorials/assets/example-10-load-generator.py --num-requests 100 --prompt-len 3000
169+
170+
Within a few minutes, you should see the ``REPLICAS`` value increase as KEDA scales up to handle the load.
171+
172+
6. Advanced Configuration
173+
~~~~~~~~~~~~~~~~~~~~~~~~~
174+
175+
Scale-to-Zero
176+
^^^^^^^^^^^^^
177+
178+
Enable scale-to-zero by setting ``minReplicaCount: 0`` and adding a traffic-based keepalive trigger:
179+
180+
.. code-block:: yaml
181+
182+
keda:
183+
enabled: true
184+
minReplicaCount: 0 # Allow scaling to zero
185+
maxReplicaCount: 5
186+
triggers:
187+
# Queue-based scaling
188+
- type: prometheus
189+
metadata:
190+
serverAddress: http://prometheus-operated.monitoring.svc:9090
191+
metricName: vllm:num_requests_waiting
192+
query: vllm:num_requests_waiting
193+
threshold: '5'
194+
# Traffic-based keepalive (prevents scale-to-zero when traffic exists)
195+
- type: prometheus
196+
metadata:
197+
serverAddress: http://prometheus-operated.monitoring.svc:9090
198+
metricName: vllm:incoming_keepalive
199+
query: sum(rate(vllm:num_incoming_requests_total[1m]) > bool 0)
200+
threshold: "1"
201+
202+
Custom HPA Behavior
203+
^^^^^^^^^^^^^^^^^^^
204+
205+
Control scaling behavior with custom HPA policies:
206+
207+
.. code-block:: yaml
208+
209+
keda:
210+
enabled: true
211+
minReplicaCount: 1
212+
maxReplicaCount: 5
213+
advanced:
214+
horizontalPodAutoscalerConfig:
215+
behavior:
216+
scaleDown:
217+
stabilizationWindowSeconds: 300
218+
policies:
219+
- type: Percent
220+
value: 50
221+
periodSeconds: 60
222+
223+
Fallback Configuration
224+
^^^^^^^^^^^^^^^^^^^^^^
225+
226+
Configure fallback behavior when metrics are unavailable:
227+
228+
.. code-block:: yaml
229+
230+
keda:
231+
enabled: true
232+
fallback:
233+
failureThreshold: 3
234+
replicas: 2
164235
165-
Within a few minutes, the ``REPLICAS`` value should increase to 2.
236+
For more configuration options, see the `Helm chart README <https://github.com/vllm-project/production-stack/blob/main/helm/README.md#keda-autoscaling-configuration>`_.
166237

167238
7. Cleanup
168239
~~~~~~~~~~
169240

170-
To remove KEDA configuration and observability components:
241+
To disable KEDA autoscaling, update your ``values.yaml`` to set ``keda.enabled: false`` and upgrade:
242+
243+
.. code-block:: bash
244+
245+
helm upgrade vllm vllm/vllm-stack -f values.yaml
246+
247+
To completely remove KEDA from the cluster:
171248

172249
.. code-block:: bash
173250
174-
kubectl delete -f assets/values-19-keda.yaml
175251
helm uninstall keda -n keda
176252
kubectl delete namespace keda
177253
178-
cd ../observability
254+
To remove the observability stack:
255+
256+
.. code-block:: bash
257+
258+
cd observability
179259
bash uninstall.sh
180260
181261
Additional Resources
182262
--------------------
183263

184264
- `KEDA Documentation <https://keda.sh/docs/>`_
265+
- `KEDA ScaledObject Specification <https://keda.sh/docs/2.18/reference/scaledobject-spec/>`_
266+
- `Helm Chart KEDA Configuration <https://github.com/vllm-project/production-stack/blob/main/helm/README.md#keda-autoscaling-configuration>`_

helm/Chart.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,7 @@ type: application
1515
# This is the chart version. This version number should be incremented each time you make changes
1616
# to the chart and its templates, including the app version.
1717
# Versions are expected to follow Semantic Versioning (https://semver.org/)
18-
version: 0.1.8
18+
version: 0.1.9
1919

2020
maintainers:
2121
- name: apostac

0 commit comments

Comments
 (0)