vllm-project
diff --git a/‎docs/source/use_cases/autoscaling-keda.rst‎
Lines changed: 171 additions & 89 deletions b/‎docs/source/use_cases/autoscaling-keda.rst‎
Lines changed: 171 additions & 89 deletions
diff --git a/‎helm/Chart.yaml‎
Lines changed: 1 addition & 1 deletion b/‎helm/Chart.yaml‎
Lines changed: 1 addition & 1 deletion
@@ -1,64 +1,63 @@
 Autoscaling with KEDA
 =====================
 
-This tutorial shows you how to automatically scale a vLLM deployment using `KEDA <https://keda.sh/>`_ and Prometheus-based metrics. You'll configure KEDA to monitor queue length and dynamically adjust the number of replicas based on load.
+This tutorial shows you how to automatically scale a vLLM deployment using `KEDA <https://keda.sh/>`_ and Prometheus-based metrics. With the vLLM Production Stack Helm chart (v0.1.9+), KEDA autoscaling is integrated directly into the chart, allowing you to enable it through simple ``values.yaml`` configuration.
 
 Table of Contents
 -----------------
 
 - Prerequisites_
 - Steps_
 
-  - `1. Install the vLLM Production Stack`_
+  - `1. Install KEDA`_
   - `2. Deploy the Observability Stack`_
-  - `3. Install KEDA`_
-  - `4. Verify Metric Export`_
-  - `5. Configure the ScaledObject`_
-  - `6. Test Autoscaling`_
+  - `3. Configure and Deploy vLLM with KEDA`_
+  - `4. Verify KEDA ScaledObject Creation`_
+  - `5. Test Autoscaling`_
+  - `6. Advanced Configuration`_
   - `7. Cleanup`_
 
 - `Additional Resources`_
 
 Prerequisites
 -------------
 
-- A working vLLM deployment on Kubernetes (see :doc:`../getting_started/quickstart`)
 - Access to a Kubernetes cluster with at least 2 GPUs
-- ``kubectl`` and ``helm`` installed
+- ``kubectl`` and ``helm`` installed (v3.0+)
 - Basic understanding of Kubernetes and Prometheus metrics
 
 Steps
 -----
 
-1. Install the vLLM Production Stack
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+1. Install KEDA
+~~~~~~~~~~~~~~~
 
-Install the production stack using a single pod by following the instructions in :doc:`../deployment/helm`.
+KEDA must be installed in your cluster before enabling autoscaling in the vLLM chart.
 
-2. Deploy the Observability Stack
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+.. code-block:: bash
+
+   kubectl create namespace keda
+   helm repo add kedacore https://kedacore.github.io/charts
+   helm repo update
+   helm install keda kedacore/keda --namespace keda
 
-This stack includes Prometheus, Grafana, and necessary exporters.
+Verify KEDA is running:
 
 .. code-block:: bash
 
-   cd observability
-   bash install.sh
+   kubectl get pods -n keda
 
-3. Install KEDA
-~~~~~~~~~~~~~~~
+2. Deploy the Observability Stack
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
-.. code-block:: bash
+The observability stack (Prometheus, Grafana) is required for KEDA to query metrics.
 
-   kubectl create namespace keda
-   helm repo add kedacore https://kedacore.github.io/charts
-   helm repo update
-   helm install keda kedacore/keda --namespace keda
+.. code-block:: bash
 
-4. Verify Metric Export
-~~~~~~~~~~~~~~~~~~~~~~~
+   cd observability
+   bash install.sh
 
-Check that Prometheus is scraping the queue length metric ``vllm:num_requests_waiting``.
+Verify Prometheus is scraping the queue length metric ``vllm:num_requests_waiting``:
 
 .. code-block:: bash
 
@@ -70,115 +69,198 @@ In a separate terminal:
 
    curl -G 'http://localhost:9090/api/v1/query' --data-urlencode 'query=vllm:num_requests_waiting'
 
-Example output:
+3. Configure and Deploy vLLM with KEDA
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
-.. code-block:: json
+Create a ``values.yaml`` file with KEDA autoscaling enabled for your model:
 
-   {
-     "status": "success",
-     "data": {
-       "result": [
-         {
-           "metric": {
-             "__name__": "vllm:num_requests_waiting",
-             "pod": "vllm-llama3-deployment-vllm-xxxxx"
-           },
-           "value": [ 1749077215.034, "0" ]
-         }
-       ]
-     }
-   }
+.. code-block:: yaml
 
-This means that at the given timestamp, there were 0 pending requests in the queue.
+   servingEngineSpec:
+     enableEngine: true
+     modelSpec:
+       - name: "llama3"
+         repository: "lmcache/vllm-openai"
+         tag: "latest"
+         modelURL: "meta-llama/Llama-3.1-8B-Instruct"
+         replicaCount: 1
+         requestCPU: 10
+         requestMemory: "64Gi"
+         requestGPU: 1
+         
+         # Enable KEDA autoscaling
+         keda:
+           enabled: true
+           minReplicaCount: 1
+           maxReplicaCount: 3
+           pollingInterval: 15
+           cooldownPeriod: 360
+           triggers:
+             - type: prometheus
+               metadata:
+                 serverAddress: http://prometheus-operated.monitoring.svc:9090
+                 metricName: vllm:num_requests_waiting
+                 query: vllm:num_requests_waiting
+                 threshold: '5'
+
+Deploy or upgrade the chart:
 
-5. Configure the ScaledObject
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+.. code-block:: bash
 
-The following ``ScaledObject`` configuration is provided in ``tutorials/assets/values-19-keda.yaml``. Review its contents:
+   helm upgrade --install vllm vllm/vllm-stack -f values.yaml
 
-.. code-block:: yaml
+This configuration tells KEDA to:
 
-   apiVersion: keda.sh/v1alpha1
-   kind: ScaledObject
-   metadata:
-     name: vllm-scaledobject
-     namespace: default
-   spec:
-     scaleTargetRef:
-       name: vllm-llama3-deployment-vllm
-     minReplicaCount: 1
-     maxReplicaCount: 2
-     pollingInterval: 15
-     cooldownPeriod: 30
-     triggers:
-       - type: prometheus
-         metadata:
-           serverAddress: http://prometheus-operated.monitoring.svc:9090
-           metricName: vllm:num_requests_waiting
-           query: vllm:num_requests_waiting
-           threshold: '5'
+- Monitor the ``vllm:num_requests_waiting`` metric from Prometheus
+- Maintain between 1 and 3 replicas
+- Scale up when the queue exceeds 5 pending requests
+- Check metrics every 15 seconds
+- Wait 360 seconds before scaling down after scaling up
+
+4. Verify KEDA ScaledObject Creation
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
-Apply the ScaledObject:
+Check that the Helm chart created the ScaledObject resource:
 
 .. code-block:: bash
 
-   cd ../tutorials
-   kubectl apply -f assets/values-19-keda.yaml
+   kubectl get scaledobjects
 
-This tells KEDA to:
+You should see:
 
-- Monitor ``vllm:num_requests_waiting``
-- Scale between 1 and 2 replicas
-- Scale up when the queue exceeds 5 requests
+.. code-block:: text
 
-6. Test Autoscaling
-~~~~~~~~~~~~~~~~~~~
+   NAME                        SCALETARGETKIND      SCALETARGETNAME                  MIN   MAX   TRIGGERS     AUTHENTICATION   READY   ACTIVE   FALLBACK   PAUSED    AGE
+   vllm-llama3-scaledobject   apps/v1.Deployment   vllm-llama3-deployment-vllm      1     3     prometheus                    True    False    Unknown    Unknown   30s
 
-Watch the deployment:
+View the created HPA:
 
 .. code-block:: bash
 
-   kubectl get hpa -n default -w
+   kubectl get hpa
 
-You should initially see:
+Expected output:
 
 .. code-block:: text
 
-   NAME                         REFERENCE                                TARGETS     MINPODS   MAXPODS   REPLICAS
-   keda-hpa-vllm-scaledobject   Deployment/vllm-llama3-deployment-vllm   0/5 (avg)   1         2         1
+   NAME                            REFERENCE                                TARGETS     MINPODS   MAXPODS   REPLICAS
+   keda-hpa-vllm-llama3-scaledobject   Deployment/vllm-llama3-deployment-vllm   0/5 (avg)   1         3         1
+
+5. Test Autoscaling
+~~~~~~~~~~~~~~~~~~~
+
+Watch the HPA in real-time:
 
-``TARGETS`` shows the current metric value vs. the target threshold.
-``0/5 (avg)`` means the current value of ``vllm:num_requests_waiting`` is 0, and the threshold is 5.
+.. code-block:: bash
+
+   kubectl get hpa -n default -w
 
-Generate load:
+Generate load to trigger autoscaling. Port-forward to the router service:
 
 .. code-block:: bash
 
    kubectl port-forward svc/vllm-router-service 30080:80
 
-In a separate terminal:
+In a separate terminal, run a load generator:
 
 .. code-block:: bash
 
-   python3 assets/example-10-load-generator.py --num-requests 100 --prompt-len 3000
+   python3 tutorials/assets/example-10-load-generator.py --num-requests 100 --prompt-len 3000
+
+Within a few minutes, you should see the ``REPLICAS`` value increase as KEDA scales up to handle the load.
+
+6. Advanced Configuration
+~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Scale-to-Zero
+^^^^^^^^^^^^^
+
+Enable scale-to-zero by setting ``minReplicaCount: 0`` and adding a traffic-based keepalive trigger:
+
+.. code-block:: yaml
+
+   keda:
+     enabled: true
+     minReplicaCount: 0  # Allow scaling to zero
+     maxReplicaCount: 5
+     triggers:
+       # Queue-based scaling
+       - type: prometheus
+         metadata:
+           serverAddress: http://prometheus-operated.monitoring.svc:9090
+           metricName: vllm:num_requests_waiting
+           query: vllm:num_requests_waiting
+           threshold: '5'
+       # Traffic-based keepalive (prevents scale-to-zero when traffic exists)
+       - type: prometheus
+         metadata:
+           serverAddress: http://prometheus-operated.monitoring.svc:9090
+           metricName: vllm:incoming_keepalive
+           query: sum(rate(vllm:num_incoming_requests_total[1m]) > bool 0)
+           threshold: "1"
+
+Custom HPA Behavior
+^^^^^^^^^^^^^^^^^^^
+
+Control scaling behavior with custom HPA policies:
+
+.. code-block:: yaml
+
+   keda:
+     enabled: true
+     minReplicaCount: 1
+     maxReplicaCount: 5
+     advanced:
+       horizontalPodAutoscalerConfig:
+         behavior:
+           scaleDown:
+             stabilizationWindowSeconds: 300
+             policies:
+               - type: Percent
+                 value: 50
+                 periodSeconds: 60
+
+Fallback Configuration
+^^^^^^^^^^^^^^^^^^^^^^
+
+Configure fallback behavior when metrics are unavailable:
+
+.. code-block:: yaml
+
+   keda:
+     enabled: true
+     fallback:
+       failureThreshold: 3
+       replicas: 2
 
-Within a few minutes, the ``REPLICAS`` value should increase to 2.
+For more configuration options, see the `Helm chart README <https://github.com/vllm-project/production-stack/blob/main/helm/README.md#keda-autoscaling-configuration>`_.
 
 7. Cleanup
 ~~~~~~~~~~
 
-To remove KEDA configuration and observability components:
+To disable KEDA autoscaling, update your ``values.yaml`` to set ``keda.enabled: false`` and upgrade:
+
+.. code-block:: bash
+
+   helm upgrade vllm vllm/vllm-stack -f values.yaml
+
+To completely remove KEDA from the cluster:
 
 .. code-block:: bash
 
-   kubectl delete -f assets/values-19-keda.yaml
    helm uninstall keda -n keda
    kubectl delete namespace keda
 
-   cd ../observability
+To remove the observability stack:
+
+.. code-block:: bash
+
+   cd observability
    bash uninstall.sh
 
 Additional Resources
 --------------------
 
 - `KEDA Documentation <https://keda.sh/docs/>`_
+- `KEDA ScaledObject Specification <https://keda.sh/docs/2.18/reference/scaledobject-spec/>`_
+- `Helm Chart KEDA Configuration <https://github.com/vllm-project/production-stack/blob/main/helm/README.md#keda-autoscaling-configuration>`_
@@ -15,7 +15,7 @@ type: application
 # This is the chart version. This version number should be incremented each time you make changes
 # to the chart and its templates, including the app version.
 # Versions are expected to follow Semantic Versioning (https://semver.org/)
-version: 0.1.8
+version: 0.1.9
 
 maintainers:
   - name: apostac