Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
300 changes: 213 additions & 87 deletions docs/source/use_cases/autoscaling-keda.rst
Original file line number Diff line number Diff line change
@@ -1,64 +1,89 @@
Autoscaling with KEDA
=====================

This tutorial shows you how to automatically scale a vLLM deployment using `KEDA <https://keda.sh/>`_ and Prometheus-based metrics. You'll configure KEDA to monitor queue length and dynamically adjust the number of replicas based on load.
This tutorial shows you how to automatically scale a vLLM deployment using `KEDA <https://keda.sh/>`_ and Prometheus-based metrics. With the vLLM Production Stack Helm chart (v0.1.9+), KEDA autoscaling is integrated directly into the chart, allowing you to enable it through simple ``values.yaml`` configuration.

Table of Contents
-----------------

- Prerequisites_
- Steps_

- `1. Install the vLLM Production Stack`_
- `2. Deploy the Observability Stack`_
- `1. Deploy the Observability Stack`_
- `2. Configure and Deploy vLLM`_
- `3. Install KEDA`_
- `4. Verify Metric Export`_
- `5. Configure the ScaledObject`_
- `4. Enable KEDA Autoscaling for vLLM`_
- `5. Verify KEDA ScaledObject Creation`_
- `6. Test Autoscaling`_
- `7. Cleanup`_
- `7. Advanced Configuration`_
- `8. Cleanup`_

- `Additional Resources`_

Prerequisites
-------------

- A working vLLM deployment on Kubernetes (see :doc:`../getting_started/quickstart`)
- Access to a Kubernetes cluster with at least 2 GPUs
- ``kubectl`` and ``helm`` installed
- ``kubectl`` and ``helm`` installed (v3.0+)
- Basic understanding of Kubernetes and Prometheus metrics

Steps
-----

1. Install the vLLM Production Stack
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Install the production stack using a single pod by following the instructions in :doc:`../deployment/helm`.

2. Deploy the Observability Stack
1. Deploy the Observability Stack
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

This stack includes Prometheus, Grafana, and necessary exporters.
The observability stack (Prometheus, Grafana) is required for KEDA to query metrics.

.. code-block:: bash
cd observability
bash install.sh
3. Install KEDA
~~~~~~~~~~~~~~~
Verify Prometheus is scraping the queue length metric ``vllm:num_requests_waiting``:

.. code-block:: bash
kubectl create namespace keda
helm repo add kedacore https://kedacore.github.io/charts
helm repo update
helm install keda kedacore/keda --namespace keda
kubectl port-forward svc/prometheus-operated -n monitoring 9090:9090
In a separate terminal:

.. code-block:: bash
curl -G 'http://localhost:9090/api/v1/query' --data-urlencode 'query=vllm:num_requests_waiting'
2. Configure and Deploy vLLM
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Create a ``values.yaml`` file to deploy vLLM. Note that we'll enable KEDA autoscaling in a later step after KEDA is installed:

.. code-block:: yaml
servingEngineSpec:
enableEngine: true
modelSpec:
- name: "llama3"
repository: "lmcache/vllm-openai"
tag: "latest"
modelURL: "meta-llama/Llama-3.1-8B-Instruct"
replicaCount: 1
requestCPU: 10
requestMemory: "64Gi"
requestGPU: 1
Deploy the chart:

.. code-block:: bash
helm install vllm vllm/vllm-stack -f values.yaml
4. Verify Metric Export
~~~~~~~~~~~~~~~~~~~~~~~
Wait for the vLLM deployment to be ready and verify that metrics are being exposed:

Check that Prometheus is scraping the queue length metric ``vllm:num_requests_waiting``.
.. code-block:: bash
kubectl wait --for=condition=ready pod -l model=llama3 --timeout=300s
Verify Prometheus is scraping the vLLM metrics:

.. code-block:: bash
Expand All @@ -70,115 +95,216 @@ In a separate terminal:
curl -G 'http://localhost:9090/api/v1/query' --data-urlencode 'query=vllm:num_requests_waiting'
Example output:
3. Install KEDA
~~~~~~~~~~~~~~~

.. code-block:: json
Now that vLLM is running and exposing metrics, install KEDA to enable autoscaling:

{
"status": "success",
"data": {
"result": [
{
"metric": {
"__name__": "vllm:num_requests_waiting",
"pod": "vllm-llama3-deployment-vllm-xxxxx"
},
"value": [ 1749077215.034, "0" ]
}
]
}
}
.. code-block:: bash
This means that at the given timestamp, there were 0 pending requests in the queue.
kubectl create namespace keda
helm repo add kedacore https://kedacore.github.io/charts
helm repo update
helm install keda kedacore/keda --namespace keda
5. Configure the ScaledObject
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Verify KEDA is running:

.. code-block:: bash
The following ``ScaledObject`` configuration is provided in ``tutorials/assets/values-19-keda.yaml``. Review its contents:
kubectl get pods -n keda
4. Enable KEDA Autoscaling for vLLM
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Update your ``values.yaml`` file to enable KEDA autoscaling:

.. code-block:: yaml
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: vllm-scaledobject
namespace: default
spec:
scaleTargetRef:
name: vllm-llama3-deployment-vllm
minReplicaCount: 1
maxReplicaCount: 2
pollingInterval: 15
cooldownPeriod: 30
triggers:
- type: prometheus
metadata:
serverAddress: http://prometheus-operated.monitoring.svc:9090
metricName: vllm:num_requests_waiting
query: vllm:num_requests_waiting
threshold: '5'
servingEngineSpec:
enableEngine: true
modelSpec:
- name: "llama3"
repository: "lmcache/vllm-openai"
tag: "latest"
modelURL: "meta-llama/Llama-3.1-8B-Instruct"
replicaCount: 1
requestCPU: 10
requestMemory: "64Gi"
requestGPU: 1
# Enable KEDA autoscaling
keda:
enabled: true
minReplicaCount: 1
maxReplicaCount: 3
pollingInterval: 15
cooldownPeriod: 360
triggers:
- type: prometheus
metadata:
serverAddress: http://prometheus-operated.monitoring.svc:9090
metricName: vllm:num_requests_waiting
query: vllm:num_requests_waiting
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The Prometheus query in the example values.yaml is not model-specific. In a multi-model deployment, this would cause KEDA to scale based on the aggregate queue length of all models, which is likely not the desired behavior. The query should be filtered by the model name.

This also applies to the Scale-to-Zero example, where both the queue-based and traffic-based triggers should be model-specific to work correctly. For the keepalive trigger, a query like sum(increase(vllm:num_incoming_requests_total{model="llama3"}[1m])) with a threshold of 1 would be more appropriate to prevent scaling to zero as long as there is traffic to the specific model.

Suggested change
query: vllm:num_requests_waiting
query: 'vllm:num_requests_waiting{model="llama3"}'

threshold: '5'
Upgrade the chart to enable KEDA autoscaling:

.. code-block:: bash
Apply the ScaledObject:
helm upgrade vllm vllm/vllm-stack -f values.yaml
This configuration tells KEDA to:

- Monitor the ``vllm:num_requests_waiting`` metric from Prometheus
- Maintain between 1 and 3 replicas
- Scale up when the queue exceeds 5 pending requests
- Check metrics every 15 seconds
- Wait 360 seconds before scaling down after scaling up

5. Verify KEDA ScaledObject Creation
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Check that the Helm chart created the ScaledObject resource:

.. code-block:: bash
cd ../tutorials
kubectl apply -f assets/values-19-keda.yaml
kubectl get scaledobjects
This tells KEDA to:
You should see:

- Monitor ``vllm:num_requests_waiting``
- Scale between 1 and 2 replicas
- Scale up when the queue exceeds 5 requests
.. code-block:: text
6. Test Autoscaling
~~~~~~~~~~~~~~~~~~~
NAME SCALETARGETKIND SCALETARGETNAME MIN MAX TRIGGERS AUTHENTICATION READY ACTIVE FALLBACK PAUSED AGE
vllm-llama3-scaledobject apps/v1.Deployment vllm-llama3-deployment-vllm 1 3 prometheus True False Unknown Unknown 30s
Watch the deployment:
View the created HPA:

.. code-block:: bash
kubectl get hpa -n default -w
kubectl get hpa
You should initially see:
Expected output:

.. code-block:: text
NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS
keda-hpa-vllm-scaledobject Deployment/vllm-llama3-deployment-vllm 0/5 (avg) 1 2 1
NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS
keda-hpa-vllm-llama3-scaledobject Deployment/vllm-llama3-deployment-vllm 0/5 (avg) 1 3 1
6. Test Autoscaling
~~~~~~~~~~~~~~~~~~~

Watch the HPA in real-time:

``TARGETS`` shows the current metric value vs. the target threshold.
``0/5 (avg)`` means the current value of ``vllm:num_requests_waiting`` is 0, and the threshold is 5.
.. code-block:: bash
kubectl get hpa -n default -w
Generate load:
Generate load to trigger autoscaling. Port-forward to the router service:

.. code-block:: bash
kubectl port-forward svc/vllm-router-service 30080:80
In a separate terminal:
In a separate terminal, run a load generator:

.. code-block:: bash
python3 assets/example-10-load-generator.py --num-requests 100 --prompt-len 3000
python3 tutorials/assets/example-10-load-generator.py --num-requests 100 --prompt-len 3000
Within a few minutes, you should see the ``REPLICAS`` value increase as KEDA scales up to handle the load.

7. Advanced Configuration
~~~~~~~~~~~~~~~~~~~~~~~~~

Scale-to-Zero
^^^^^^^^^^^^^

Within a few minutes, the ``REPLICAS`` value should increase to 2.
Enable scale-to-zero by setting ``minReplicaCount: 0`` and adding a traffic-based keepalive trigger:

7. Cleanup
.. code-block:: yaml
keda:
enabled: true
minReplicaCount: 0 # Allow scaling to zero
maxReplicaCount: 5
triggers:
# Queue-based scaling
- type: prometheus
metadata:
serverAddress: http://prometheus-operated.monitoring.svc:9090
metricName: vllm:num_requests_waiting
query: vllm:num_requests_waiting
threshold: '5'
# Traffic-based keepalive (prevents scale-to-zero when traffic exists)
- type: prometheus
metadata:
serverAddress: http://prometheus-operated.monitoring.svc:9090
metricName: vllm:incoming_keepalive
query: sum(rate(vllm:num_incoming_requests_total[1m]) > bool 0)
threshold: "1"
Custom HPA Behavior
^^^^^^^^^^^^^^^^^^^

Control scaling behavior with custom HPA policies:

.. code-block:: yaml
keda:
enabled: true
minReplicaCount: 1
maxReplicaCount: 5
advanced:
horizontalPodAutoscalerConfig:
behavior:
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Percent
value: 50
periodSeconds: 60
Fallback Configuration
^^^^^^^^^^^^^^^^^^^^^^

Configure fallback behavior when metrics are unavailable:

.. code-block:: yaml
keda:
enabled: true
fallback:
failureThreshold: 3
replicas: 2
For more configuration options, see the `Helm chart README <https://github.com/vllm-project/production-stack/blob/main/helm/README.md#keda-autoscaling-configuration>`_.

8. Cleanup
~~~~~~~~~~

To remove KEDA configuration and observability components:
To disable KEDA autoscaling, update your ``values.yaml`` to set ``keda.enabled: false`` and upgrade:

.. code-block:: bash
helm upgrade vllm vllm/vllm-stack -f values.yaml
To completely remove KEDA from the cluster:

.. code-block:: bash
kubectl delete -f assets/values-19-keda.yaml
helm uninstall keda -n keda
kubectl delete namespace keda
cd ../observability
To remove the observability stack:

.. code-block:: bash
cd observability
bash uninstall.sh
Additional Resources
--------------------

- `KEDA Documentation <https://keda.sh/docs/>`_
- `KEDA ScaledObject Specification <https://keda.sh/docs/2.18/reference/scaledobject-spec/>`_
- `Helm Chart KEDA Configuration <https://github.com/vllm-project/production-stack/blob/main/helm/README.md#keda-autoscaling-configuration>`_
Loading