You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This tutorial shows you how to automatically scale a vLLM deployment using `KEDA <https://keda.sh/>`_ and Prometheus-based metrics. You'll configure KEDA to monitor queue length and dynamically adjust the number of replicas based on load.
4
+
This tutorial shows you how to automatically scale a vLLM deployment using `KEDA <https://keda.sh/>`_ and Prometheus-based metrics. With the vLLM Production Stack Helm chart (v0.1.9+), KEDA autoscaling is integrated directly into the chart, allowing you to enable it through simple ``values.yaml`` configuration.
5
5
6
6
Table of Contents
7
7
-----------------
8
8
9
9
- Prerequisites_
10
10
- Steps_
11
11
12
-
- `1. Install the vLLM Production Stack`_
12
+
- `1. Install KEDA`_
13
13
- `2. Deploy the Observability Stack`_
14
-
- `3. Install KEDA`_
15
-
- `4. Verify Metric Export`_
16
-
- `5. Configure the ScaledObject`_
17
-
- `6. Test Autoscaling`_
14
+
- `3. Configure and Deploy vLLM with KEDA`_
15
+
- `4. Verify KEDA ScaledObject Creation`_
16
+
- `5. Test Autoscaling`_
17
+
- `6. Advanced Configuration`_
18
18
- `7. Cleanup`_
19
19
20
20
- `Additional Resources`_
21
21
22
22
Prerequisites
23
23
-------------
24
24
25
-
- A working vLLM deployment on Kubernetes (see :doc:`../getting_started/quickstart`)
26
25
- Access to a Kubernetes cluster with at least 2 GPUs
27
-
- ``kubectl`` and ``helm`` installed
26
+
- ``kubectl`` and ``helm`` installed (v3.0+)
28
27
- Basic understanding of Kubernetes and Prometheus metrics
29
28
30
29
Steps
31
30
-----
32
31
33
-
1. Install the vLLM Production Stack
34
-
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
32
+
1. Install KEDA
33
+
~~~~~~~~~~~~~~~
35
34
36
-
Install the production stack using a single pod by following the instructions in :doc:`../deployment/helm`.
35
+
KEDA must be installed in your cluster before enabling autoscaling in the vLLM chart.
Control scaling behavior with custom HPA policies:
206
+
207
+
.. code-block:: yaml
208
+
209
+
keda:
210
+
enabled: true
211
+
minReplicaCount: 1
212
+
maxReplicaCount: 5
213
+
advanced:
214
+
horizontalPodAutoscalerConfig:
215
+
behavior:
216
+
scaleDown:
217
+
stabilizationWindowSeconds: 300
218
+
policies:
219
+
- type: Percent
220
+
value: 50
221
+
periodSeconds: 60
222
+
223
+
Fallback Configuration
224
+
^^^^^^^^^^^^^^^^^^^^^^
225
+
226
+
Configure fallback behavior when metrics are unavailable:
227
+
228
+
.. code-block:: yaml
229
+
230
+
keda:
231
+
enabled: true
232
+
fallback:
233
+
failureThreshold: 3
234
+
replicas: 2
164
235
165
-
Within a few minutes, the ``REPLICAS`` value should increase to 2.
236
+
For more configuration options, see the `Helm chart README <https://github.com/vllm-project/production-stack/blob/main/helm/README.md#keda-autoscaling-configuration>`_.
166
237
167
238
7. Cleanup
168
239
~~~~~~~~~~
169
240
170
-
To remove KEDA configuration and observability components:
241
+
To disable KEDA autoscaling, update your ``values.yaml`` to set ``keda.enabled: false`` and upgrade:
0 commit comments