diff --git a/docs/source/use_cases/autoscaling-keda.rst b/docs/source/use_cases/autoscaling-keda.rst
index 1515acf17..624327cc5 100644
--- a/docs/source/use_cases/autoscaling-keda.rst
+++ b/docs/source/use_cases/autoscaling-keda.rst
@@ -1,7 +1,7 @@
Autoscaling with KEDA
=====================
-This tutorial shows you how to automatically scale a vLLM deployment using `KEDA `_ and Prometheus-based metrics. You'll configure KEDA to monitor queue length and dynamically adjust the number of replicas based on load.
+This tutorial shows you how to automatically scale a vLLM deployment using `KEDA `_ and Prometheus-based metrics. With the vLLM Production Stack Helm chart (v0.1.9+), KEDA autoscaling is integrated directly into the chart, allowing you to enable it through simple ``values.yaml`` configuration.
Table of Contents
-----------------
@@ -9,56 +9,81 @@ Table of Contents
- Prerequisites_
- Steps_
- - `1. Install the vLLM Production Stack`_
- - `2. Deploy the Observability Stack`_
+ - `1. Deploy the Observability Stack`_
+ - `2. Configure and Deploy vLLM`_
- `3. Install KEDA`_
- - `4. Verify Metric Export`_
- - `5. Configure the ScaledObject`_
+ - `4. Enable KEDA Autoscaling for vLLM`_
+ - `5. Verify KEDA ScaledObject Creation`_
- `6. Test Autoscaling`_
- - `7. Cleanup`_
+ - `7. Advanced Configuration`_
+ - `8. Cleanup`_
- `Additional Resources`_
Prerequisites
-------------
-- A working vLLM deployment on Kubernetes (see :doc:`../getting_started/quickstart`)
- Access to a Kubernetes cluster with at least 2 GPUs
-- ``kubectl`` and ``helm`` installed
+- ``kubectl`` and ``helm`` installed (v3.0+)
- Basic understanding of Kubernetes and Prometheus metrics
Steps
-----
-1. Install the vLLM Production Stack
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-Install the production stack using a single pod by following the instructions in :doc:`../deployment/helm`.
-
-2. Deploy the Observability Stack
+1. Deploy the Observability Stack
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-This stack includes Prometheus, Grafana, and necessary exporters.
+The observability stack (Prometheus, Grafana) is required for KEDA to query metrics.
.. code-block:: bash
cd observability
bash install.sh
-3. Install KEDA
-~~~~~~~~~~~~~~~
+Verify Prometheus is scraping the queue length metric ``vllm:num_requests_waiting``:
.. code-block:: bash
- kubectl create namespace keda
- helm repo add kedacore https://kedacore.github.io/charts
- helm repo update
- helm install keda kedacore/keda --namespace keda
+ kubectl port-forward svc/prometheus-operated -n monitoring 9090:9090
+
+In a separate terminal:
+
+.. code-block:: bash
+
+ curl -G 'http://localhost:9090/api/v1/query' --data-urlencode 'query=vllm:num_requests_waiting'
+
+2. Configure and Deploy vLLM
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Create a ``values.yaml`` file to deploy vLLM. Note that we'll enable KEDA autoscaling in a later step after KEDA is installed:
+
+.. code-block:: yaml
+
+ servingEngineSpec:
+ enableEngine: true
+ modelSpec:
+ - name: "llama3"
+ repository: "lmcache/vllm-openai"
+ tag: "latest"
+ modelURL: "meta-llama/Llama-3.1-8B-Instruct"
+ replicaCount: 1
+ requestCPU: 10
+ requestMemory: "64Gi"
+ requestGPU: 1
+
+Deploy the chart:
+
+.. code-block:: bash
+
+ helm install vllm vllm/vllm-stack -f values.yaml
-4. Verify Metric Export
-~~~~~~~~~~~~~~~~~~~~~~~
+Wait for the vLLM deployment to be ready and verify that metrics are being exposed:
-Check that Prometheus is scraping the queue length metric ``vllm:num_requests_waiting``.
+.. code-block:: bash
+
+ kubectl wait --for=condition=ready pod -l model=llama3 --timeout=300s
+
+Verify Prometheus is scraping the vLLM metrics:
.. code-block:: bash
@@ -70,115 +95,216 @@ In a separate terminal:
curl -G 'http://localhost:9090/api/v1/query' --data-urlencode 'query=vllm:num_requests_waiting'
-Example output:
+3. Install KEDA
+~~~~~~~~~~~~~~~
-.. code-block:: json
+Now that vLLM is running and exposing metrics, install KEDA to enable autoscaling:
- {
- "status": "success",
- "data": {
- "result": [
- {
- "metric": {
- "__name__": "vllm:num_requests_waiting",
- "pod": "vllm-llama3-deployment-vllm-xxxxx"
- },
- "value": [ 1749077215.034, "0" ]
- }
- ]
- }
- }
+.. code-block:: bash
-This means that at the given timestamp, there were 0 pending requests in the queue.
+ kubectl create namespace keda
+ helm repo add kedacore https://kedacore.github.io/charts
+ helm repo update
+ helm install keda kedacore/keda --namespace keda
-5. Configure the ScaledObject
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+Verify KEDA is running:
+
+.. code-block:: bash
-The following ``ScaledObject`` configuration is provided in ``tutorials/assets/values-19-keda.yaml``. Review its contents:
+ kubectl get pods -n keda
+
+4. Enable KEDA Autoscaling for vLLM
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Update your ``values.yaml`` file to enable KEDA autoscaling:
.. code-block:: yaml
- apiVersion: keda.sh/v1alpha1
- kind: ScaledObject
- metadata:
- name: vllm-scaledobject
- namespace: default
- spec:
- scaleTargetRef:
- name: vllm-llama3-deployment-vllm
- minReplicaCount: 1
- maxReplicaCount: 2
- pollingInterval: 15
- cooldownPeriod: 30
- triggers:
- - type: prometheus
- metadata:
- serverAddress: http://prometheus-operated.monitoring.svc:9090
- metricName: vllm:num_requests_waiting
- query: vllm:num_requests_waiting
- threshold: '5'
+ servingEngineSpec:
+ enableEngine: true
+ modelSpec:
+ - name: "llama3"
+ repository: "lmcache/vllm-openai"
+ tag: "latest"
+ modelURL: "meta-llama/Llama-3.1-8B-Instruct"
+ replicaCount: 1
+ requestCPU: 10
+ requestMemory: "64Gi"
+ requestGPU: 1
+
+ # Enable KEDA autoscaling
+ keda:
+ enabled: true
+ minReplicaCount: 1
+ maxReplicaCount: 3
+ pollingInterval: 15
+ cooldownPeriod: 360
+ triggers:
+ - type: prometheus
+ metadata:
+ serverAddress: http://prometheus-operated.monitoring.svc:9090
+ metricName: vllm:num_requests_waiting
+ query: vllm:num_requests_waiting
+ threshold: '5'
+
+Upgrade the chart to enable KEDA autoscaling:
+
+.. code-block:: bash
-Apply the ScaledObject:
+ helm upgrade vllm vllm/vllm-stack -f values.yaml
+
+This configuration tells KEDA to:
+
+- Monitor the ``vllm:num_requests_waiting`` metric from Prometheus
+- Maintain between 1 and 3 replicas
+- Scale up when the queue exceeds 5 pending requests
+- Check metrics every 15 seconds
+- Wait 360 seconds before scaling down after scaling up
+
+5. Verify KEDA ScaledObject Creation
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Check that the Helm chart created the ScaledObject resource:
.. code-block:: bash
- cd ../tutorials
- kubectl apply -f assets/values-19-keda.yaml
+ kubectl get scaledobjects
-This tells KEDA to:
+You should see:
-- Monitor ``vllm:num_requests_waiting``
-- Scale between 1 and 2 replicas
-- Scale up when the queue exceeds 5 requests
+.. code-block:: text
-6. Test Autoscaling
-~~~~~~~~~~~~~~~~~~~
+ NAME SCALETARGETKIND SCALETARGETNAME MIN MAX TRIGGERS AUTHENTICATION READY ACTIVE FALLBACK PAUSED AGE
+ vllm-llama3-scaledobject apps/v1.Deployment vllm-llama3-deployment-vllm 1 3 prometheus True False Unknown Unknown 30s
-Watch the deployment:
+View the created HPA:
.. code-block:: bash
- kubectl get hpa -n default -w
+ kubectl get hpa
-You should initially see:
+Expected output:
.. code-block:: text
- NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS
- keda-hpa-vllm-scaledobject Deployment/vllm-llama3-deployment-vllm 0/5 (avg) 1 2 1
+ NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS
+ keda-hpa-vllm-llama3-scaledobject Deployment/vllm-llama3-deployment-vllm 0/5 (avg) 1 3 1
+
+6. Test Autoscaling
+~~~~~~~~~~~~~~~~~~~
+
+Watch the HPA in real-time:
-``TARGETS`` shows the current metric value vs. the target threshold.
-``0/5 (avg)`` means the current value of ``vllm:num_requests_waiting`` is 0, and the threshold is 5.
+.. code-block:: bash
+
+ kubectl get hpa -n default -w
-Generate load:
+Generate load to trigger autoscaling. Port-forward to the router service:
.. code-block:: bash
kubectl port-forward svc/vllm-router-service 30080:80
-In a separate terminal:
+In a separate terminal, run a load generator:
.. code-block:: bash
- python3 assets/example-10-load-generator.py --num-requests 100 --prompt-len 3000
+ python3 tutorials/assets/example-10-load-generator.py --num-requests 100 --prompt-len 3000
+
+Within a few minutes, you should see the ``REPLICAS`` value increase as KEDA scales up to handle the load.
+
+7. Advanced Configuration
+~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Scale-to-Zero
+^^^^^^^^^^^^^
-Within a few minutes, the ``REPLICAS`` value should increase to 2.
+Enable scale-to-zero by setting ``minReplicaCount: 0`` and adding a traffic-based keepalive trigger:
-7. Cleanup
+.. code-block:: yaml
+
+ keda:
+ enabled: true
+ minReplicaCount: 0 # Allow scaling to zero
+ maxReplicaCount: 5
+ triggers:
+ # Queue-based scaling
+ - type: prometheus
+ metadata:
+ serverAddress: http://prometheus-operated.monitoring.svc:9090
+ metricName: vllm:num_requests_waiting
+ query: vllm:num_requests_waiting
+ threshold: '5'
+ # Traffic-based keepalive (prevents scale-to-zero when traffic exists)
+ - type: prometheus
+ metadata:
+ serverAddress: http://prometheus-operated.monitoring.svc:9090
+ metricName: vllm:incoming_keepalive
+ query: sum(rate(vllm:num_incoming_requests_total[1m]) > bool 0)
+ threshold: "1"
+
+Custom HPA Behavior
+^^^^^^^^^^^^^^^^^^^
+
+Control scaling behavior with custom HPA policies:
+
+.. code-block:: yaml
+
+ keda:
+ enabled: true
+ minReplicaCount: 1
+ maxReplicaCount: 5
+ advanced:
+ horizontalPodAutoscalerConfig:
+ behavior:
+ scaleDown:
+ stabilizationWindowSeconds: 300
+ policies:
+ - type: Percent
+ value: 50
+ periodSeconds: 60
+
+Fallback Configuration
+^^^^^^^^^^^^^^^^^^^^^^
+
+Configure fallback behavior when metrics are unavailable:
+
+.. code-block:: yaml
+
+ keda:
+ enabled: true
+ fallback:
+ failureThreshold: 3
+ replicas: 2
+
+For more configuration options, see the `Helm chart README `_.
+
+8. Cleanup
~~~~~~~~~~
-To remove KEDA configuration and observability components:
+To disable KEDA autoscaling, update your ``values.yaml`` to set ``keda.enabled: false`` and upgrade:
+
+.. code-block:: bash
+
+ helm upgrade vllm vllm/vllm-stack -f values.yaml
+
+To completely remove KEDA from the cluster:
.. code-block:: bash
- kubectl delete -f assets/values-19-keda.yaml
helm uninstall keda -n keda
kubectl delete namespace keda
- cd ../observability
+To remove the observability stack:
+
+.. code-block:: bash
+
+ cd observability
bash uninstall.sh
Additional Resources
--------------------
- `KEDA Documentation `_
+- `KEDA ScaledObject Specification `_
+- `Helm Chart KEDA Configuration `_
diff --git a/helm/Chart.yaml b/helm/Chart.yaml
index 937ab7782..9592df114 100644
--- a/helm/Chart.yaml
+++ b/helm/Chart.yaml
@@ -15,7 +15,7 @@ type: application
# This is the chart version. This version number should be incremented each time you make changes
# to the chart and its templates, including the app version.
# Versions are expected to follow Semantic Versioning (https://semver.org/)
-version: 0.1.8
+version: 0.1.9
maintainers:
- name: apostac
diff --git a/helm/README.md b/helm/README.md
index 20188ba66..798c61ee3 100644
--- a/helm/README.md
+++ b/helm/README.md
@@ -152,6 +152,39 @@ This table documents all available configuration values for the Production Stack
| `servingEngineSpec.modelSpec[].lmcacheConfig.nixlPeerPort` | string | `"55555"` | NIXL peer port for KV transfer |
| `servingEngineSpec.modelSpec[].lmcacheConfig.nixlBufferSize` | string | `"1073741824"` | NIXL buffer size for KV transfer |
+#### KEDA Autoscaling Configuration
+
+> **Note**: Unless explicitly set, KEDA's default values will apply. The defaults shown below are KEDA's defaults, not values enforced by this Helm chart.
+
+| Field | Type | KEDA Default | Description |
+|-------|------|--------------|-------------|
+| `servingEngineSpec.modelSpec[].keda.enabled` | boolean | `false` | Enable KEDA autoscaling for this model deployment (requires KEDA installed in cluster) |
+| `servingEngineSpec.modelSpec[].keda.minReplicaCount` | integer | - | Minimum number of replicas (supports 0 for scale-to-zero); if not set, HPA minReplicas default applies |
+| `servingEngineSpec.modelSpec[].keda.maxReplicaCount` | integer | - | Maximum number of replicas; if not set, HPA maxReplicas default applies |
+| `servingEngineSpec.modelSpec[].keda.pollingInterval` | integer | `30` | How often KEDA checks metrics (in seconds) |
+| `servingEngineSpec.modelSpec[].keda.cooldownPeriod` | integer | `300` | Wait time before scaling down after scaling up (in seconds) |
+| `servingEngineSpec.modelSpec[].keda.idleReplicaCount` | integer | - | Number of replicas when no triggers are active |
+| `servingEngineSpec.modelSpec[].keda.initialCooldownPeriod` | integer | - | Initial cooldown period before scaling down after creation (in seconds) |
+| `servingEngineSpec.modelSpec[].keda.fallback` | map | - | Fallback configuration when scaler fails |
+| `servingEngineSpec.modelSpec[].keda.fallback.failureThreshold` | integer | - | Number of consecutive failures before fallback |
+| `servingEngineSpec.modelSpec[].keda.fallback.replicas` | integer | - | Number of replicas to scale to in fallback |
+| `servingEngineSpec.modelSpec[].keda.triggers` | list | See below | List of KEDA trigger configurations (Prometheus-based) |
+| `servingEngineSpec.modelSpec[].keda.triggers[].type` | string | - | Trigger type (e.g., `"prometheus"`) |
+| `servingEngineSpec.modelSpec[].keda.triggers[].metadata.serverAddress` | string | - | Prometheus server URL (e.g., ) |
+| `servingEngineSpec.modelSpec[].keda.triggers[].metadata.metricName` | string | - | Name of the metric to monitor |
+| `servingEngineSpec.modelSpec[].keda.triggers[].metadata.query` | string | - | PromQL query to fetch the metric |
+| `servingEngineSpec.modelSpec[].keda.triggers[].metadata.threshold` | string | - | Threshold value that triggers scaling |
+| `servingEngineSpec.modelSpec[].keda.advanced` | map | - | Advanced KEDA configuration options |
+| `servingEngineSpec.modelSpec[].keda.advanced.restoreToOriginalReplicaCount` | boolean | `false` | Restore original replica count when ScaledObject is deleted |
+| `servingEngineSpec.modelSpec[].keda.advanced.horizontalPodAutoscalerConfig` | map | - | HPA-specific configuration |
+| `servingEngineSpec.modelSpec[].keda.advanced.horizontalPodAutoscalerConfig.name` | string | `keda-hpa-{scaled-object-name}` | Custom name for HPA resource |
+| `servingEngineSpec.modelSpec[].keda.advanced.horizontalPodAutoscalerConfig.behavior` | map | - | HPA scaling behavior configuration (see [K8s docs](https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/)) |
+| `servingEngineSpec.modelSpec[].keda.advanced.scalingModifiers` | map | - | Scaling modifiers for composite metrics |
+| `servingEngineSpec.modelSpec[].keda.advanced.scalingModifiers.target` | string | - | Target value for the composed metric |
+| `servingEngineSpec.modelSpec[].keda.advanced.scalingModifiers.activationTarget` | string | - | Activation target for the composed metric |
+| `servingEngineSpec.modelSpec[].keda.advanced.scalingModifiers.metricType` | string | `"AverageValue"` | Metric type (AverageValue or Value) |
+| `servingEngineSpec.modelSpec[].keda.advanced.scalingModifiers.formula` | string | - | Formula to compose metrics together |
+
### Router Configuration
| Field | Type | Default | Description |
diff --git a/helm/templates/scaledobject-vllm.yaml b/helm/templates/scaledobject-vllm.yaml
new file mode 100644
index 000000000..be813c03c
--- /dev/null
+++ b/helm/templates/scaledobject-vllm.yaml
@@ -0,0 +1,74 @@
+{{- if .Values.servingEngineSpec.enableEngine -}}
+{{- range $modelSpec := .Values.servingEngineSpec.modelSpec }}
+{{- if and (hasKey $modelSpec "keda") $modelSpec.keda.enabled }}
+{{- if not (hasKey $modelSpec "raySpec") }}
+{{- with $ -}}
+apiVersion: keda.sh/v1alpha1
+kind: ScaledObject
+metadata:
+ name: "{{ .Release.Name }}-{{$modelSpec.name}}-scaledobject"
+ namespace: {{ .Release.Namespace }}
+ labels:
+ model: {{ $modelSpec.name }}
+ helm-release-name: {{ .Release.Name }}
+ {{- include "chart.engineLabels" . | nindent 4 }}
+spec:
+ scaleTargetRef:
+ apiVersion: apps/v1
+ kind: Deployment
+ name: "{{ .Release.Name }}-{{$modelSpec.name}}-deployment-vllm"
+ {{- if hasKey $modelSpec.keda "minReplicaCount" }}
+ minReplicaCount: {{ $modelSpec.keda.minReplicaCount }}
+ {{- end }}
+ {{- if hasKey $modelSpec.keda "maxReplicaCount" }}
+ maxReplicaCount: {{ $modelSpec.keda.maxReplicaCount }}
+ {{- end }}
+ {{- if hasKey $modelSpec.keda "pollingInterval" }}
+ pollingInterval: {{ $modelSpec.keda.pollingInterval }}
+ {{- end }}
+ {{- if hasKey $modelSpec.keda "cooldownPeriod" }}
+ cooldownPeriod: {{ $modelSpec.keda.cooldownPeriod }}
+ {{- end }}
+ {{- if hasKey $modelSpec.keda "idleReplicaCount" }}
+ idleReplicaCount: {{ $modelSpec.keda.idleReplicaCount }}
+ {{- end }}
+ {{- if hasKey $modelSpec.keda "initialCooldownPeriod" }}
+ initialCooldownPeriod: {{ $modelSpec.keda.initialCooldownPeriod }}
+ {{- end }}
+ {{- if hasKey $modelSpec.keda "fallback" }}
+ fallback:
+ {{- toYaml $modelSpec.keda.fallback | nindent 4 }}
+ {{- end }}
+ {{- if hasKey $modelSpec.keda "advanced" }}
+ advanced:
+ {{- if hasKey $modelSpec.keda.advanced "restoreToOriginalReplicaCount" }}
+ restoreToOriginalReplicaCount: {{ $modelSpec.keda.advanced.restoreToOriginalReplicaCount }}
+ {{- end }}
+ {{- if hasKey $modelSpec.keda.advanced "horizontalPodAutoscalerConfig" }}
+ horizontalPodAutoscalerConfig:
+ {{- toYaml $modelSpec.keda.advanced.horizontalPodAutoscalerConfig | nindent 6 }}
+ {{- end }}
+ {{- if hasKey $modelSpec.keda.advanced "scalingModifiers" }}
+ scalingModifiers:
+ {{- toYaml $modelSpec.keda.advanced.scalingModifiers | nindent 6 }}
+ {{- end }}
+ {{- end }}
+ {{- if hasKey $modelSpec.keda "triggers" }}
+ triggers:
+ {{- toYaml $modelSpec.keda.triggers | nindent 4 }}
+ {{- else }}
+ # Default trigger configuration - monitors vllm queue length
+ triggers:
+ - type: prometheus
+ metadata:
+ serverAddress: http://prometheus-operated.monitoring.svc:9090
+ metricName: vllm:num_requests_waiting
+ query: vllm:num_requests_waiting{model="{{ $modelSpec.name }}"}
+ threshold: '5'
+ {{- end }}
+---
+{{- end }}
+{{- end }}
+{{- end }}
+{{- end }}
+{{- end }}
diff --git a/helm/tests/keda_test.yaml b/helm/tests/keda_test.yaml
new file mode 100644
index 000000000..402e1809d
--- /dev/null
+++ b/helm/tests/keda_test.yaml
@@ -0,0 +1,160 @@
+suite: test KEDA autoscaling configuration
+templates:
+ - scaledobject-vllm.yaml
+ - deployment-vllm-multi.yaml
+tests:
+ - it: should not create ScaledObject when keda is not enabled
+ set:
+ servingEngineSpec:
+ enableEngine: true
+ modelSpec:
+ - name: "test-model"
+ repository: "vllm/vllm-openai"
+ tag: "latest"
+ modelURL: "facebook/opt-125m"
+ replicaCount: 1
+ requestCPU: 1
+ requestMemory: "1Gi"
+ requestGPU: 1
+ keda:
+ enabled: false
+ asserts:
+ - template: scaledobject-vllm.yaml
+ hasDocuments:
+ count: 0
+
+ - it: should create ScaledObject when keda is enabled
+ set:
+ servingEngineSpec:
+ enableEngine: true
+ modelSpec:
+ - name: "test-model"
+ repository: "vllm/vllm-openai"
+ tag: "latest"
+ modelURL: "facebook/opt-125m"
+ replicaCount: 1
+ requestCPU: 1
+ requestMemory: "1Gi"
+ requestGPU: 1
+ keda:
+ enabled: true
+ minReplicaCount: 1
+ maxReplicaCount: 3
+ triggers:
+ - type: prometheus
+ metadata:
+ serverAddress: http://prometheus-operated.monitoring.svc:9090
+ metricName: vllm:num_requests_waiting
+ query: vllm:num_requests_waiting
+ threshold: '5'
+ asserts:
+ - template: scaledobject-vllm.yaml
+ hasDocuments:
+ count: 1
+ - template: scaledobject-vllm.yaml
+ equal:
+ path: metadata.name
+ value: RELEASE-NAME-test-model-scaledobject
+ - template: scaledobject-vllm.yaml
+ equal:
+ path: spec.minReplicaCount
+ value: 1
+ - template: scaledobject-vllm.yaml
+ equal:
+ path: spec.maxReplicaCount
+ value: 3
+
+ - it: should use default trigger when checks are not provided
+ set:
+ servingEngineSpec:
+ enableEngine: true
+ modelSpec:
+ - name: "test-model-default"
+ repository: "vllm/vllm-openai"
+ tag: "latest"
+ modelURL: "facebook/opt-125m"
+ replicaCount: 1
+ requestCPU: 1
+ requestMemory: "1Gi"
+ requestGPU: 1
+ keda:
+ enabled: true
+ minReplicaCount: 1
+ maxReplicaCount: 5
+ asserts:
+ - template: scaledobject-vllm.yaml
+ hasDocuments:
+ count: 1
+ - template: scaledobject-vllm.yaml
+ equal:
+ path: spec.triggers[0].type
+ value: prometheus
+ - template: scaledobject-vllm.yaml
+ equal:
+ path: spec.triggers[0].metadata.query
+ value: vllm:num_requests_waiting{model="test-model-default"}
+
+ - it: should support advanced KEDA configuration
+ set:
+ servingEngineSpec:
+ enableEngine: true
+ modelSpec:
+ - name: "test-model-advanced"
+ repository: "vllm/vllm-openai"
+ tag: "latest"
+ modelURL: "facebook/opt-125m"
+ replicaCount: 1
+ requestCPU: 1
+ requestMemory: "1Gi"
+ requestGPU: 1
+ keda:
+ enabled: true
+ minReplicaCount: 0
+ maxReplicaCount: 10
+ pollingInterval: 10
+ cooldownPeriod: 60
+ idleReplicaCount: 0
+ initialCooldownPeriod: 30
+ fallback:
+ failureThreshold: 5
+ replicas: 2
+ advanced:
+ restoreToOriginalReplicaCount: true
+ horizontalPodAutoscalerConfig:
+ name: my-hpa
+ behavior:
+ scaleDown:
+ stabilizationWindowSeconds: 300
+ asserts:
+ - template: scaledobject-vllm.yaml
+ equal:
+ path: spec.pollingInterval
+ value: 10
+ - template: scaledobject-vllm.yaml
+ equal:
+ path: spec.cooldownPeriod
+ value: 60
+ - template: scaledobject-vllm.yaml
+ equal:
+ path: spec.idleReplicaCount
+ value: 0
+ - template: scaledobject-vllm.yaml
+ equal:
+ path: spec.initialCooldownPeriod
+ value: 30
+ - template: scaledobject-vllm.yaml
+ equal:
+ path: spec.fallback.failureThreshold
+ value: 5
+ - template: scaledobject-vllm.yaml
+ equal:
+ path: spec.fallback.replicas
+ value: 2
+ - template: scaledobject-vllm.yaml
+ equal:
+ path: spec.advanced.restoreToOriginalReplicaCount
+ value: true
+ - template: scaledobject-vllm.yaml
+ equal:
+ path: spec.advanced.horizontalPodAutoscalerConfig.name
+ value: my-hpa
diff --git a/helm/values.schema.json b/helm/values.schema.json
index ef618fa5a..954d90e7c 100644
--- a/helm/values.schema.json
+++ b/helm/values.schema.json
@@ -306,6 +306,111 @@
}
}
}
+ },
+ "keda": {
+ "type": "object",
+ "description": "KEDA autoscaling configuration for this model deployment",
+ "properties": {
+ "enabled": {
+ "type": "boolean",
+ "description": "Whether to enable KEDA autoscaling for this model"
+ },
+ "minReplicaCount": {
+ "type": "integer",
+ "description": "Minimum number of replicas (supports 0 for scale-to-zero)",
+ "minimum": 0
+ },
+ "maxReplicaCount": {
+ "type": "integer",
+ "description": "Maximum number of replicas",
+ "minimum": 1
+ },
+ "pollingInterval": {
+ "type": "integer",
+ "description": "How often KEDA should check the metrics (in seconds)",
+ "minimum": 1
+ },
+ "cooldownPeriod": {
+ "type": "integer",
+ "description": "How long to wait before scaling down after scaling up (in seconds)",
+ "minimum": 0
+ },
+ "idleReplicaCount": {
+ "type": "integer",
+ "description": "Number of replicas to scale to when no triggers are active",
+ "minimum": 0
+ },
+ "initialCooldownPeriod": {
+ "type": "integer",
+ "description": "Initial cooldown period in seconds before scaling down after creation",
+ "minimum": 0
+ },
+ "fallback": {
+ "type": "object",
+ "description": "Fallback configuration when scaler fails",
+ "properties": {
+ "failureThreshold": {
+ "type": "integer",
+ "description": "Number of consecutive failures before fallback",
+ "minimum": 1
+ },
+ "replicas": {
+ "type": "integer",
+ "description": "Number of replicas to scale to in fallback",
+ "minimum": 0
+ }
+ },
+ "required": [
+ "failureThreshold",
+ "replicas"
+ ]
+ },
+ "triggers": {
+ "type": "array",
+ "description": "List of KEDA trigger configurations",
+ "items": {
+ "type": "object",
+ "properties": {
+ "type": {
+ "type": "string",
+ "description": "Trigger type (e.g., prometheus, cpu, memory)"
+ },
+ "metadata": {
+ "type": "object",
+ "description": "Trigger-specific metadata",
+ "additionalProperties": true
+ }
+ },
+ "required": [
+ "type",
+ "metadata"
+ ]
+ }
+ },
+ "advanced": {
+ "type": "object",
+ "description": "Advanced KEDA configuration",
+ "properties": {
+ "restoreToOriginalReplicaCount": {
+ "type": "boolean",
+ "description": "Restore original replica count when ScaledObject is deleted"
+ },
+ "horizontalPodAutoscalerConfig": {
+ "type": "object",
+ "description": "HPA-specific configuration",
+ "additionalProperties": true
+ },
+ "scalingModifiers": {
+ "type": "object",
+ "description": "Scaling modifiers for composite metrics",
+ "additionalProperties": true
+ }
+ }
+ }
+ },
+ "required": [
+ "enabled"
+ ]
}
},
"required": [
@@ -492,16 +597,26 @@
"type": "string"
}
},
- "required": ["name"]
+ "required": [
+ "name"
+ ]
}
},
"autoscaling": {
"type": "object",
"properties": {
- "enabled": {"type": "boolean"},
- "minReplicas": {"type": "integer"},
- "maxReplicas": {"type": "integer"},
- "targetCPUUtilizationPercentage": {"type": "integer"}
+ "enabled": {
+ "type": "boolean"
+ },
+ "minReplicas": {
+ "type": "integer"
+ },
+ "maxReplicas": {
+ "type": "integer"
+ },
+ "targetCPUUtilizationPercentage": {
+ "type": "integer"
+ }
}
},
"containerPort": {
@@ -517,19 +632,24 @@
"description": "Container-level security context configuration",
"additionalProperties": true
},
-
"servicePort": {
"type": "integer"
},
"serviceDiscovery": {
"type": "string",
"description": "Service discovery mode. Available values: k8s or static.",
- "enum": ["k8s", "static"]
+ "enum": [
+ "k8s",
+ "static"
+ ]
},
"k8sServiceDiscoveryType": {
"type": "string",
"description": "Kubernetes service discovery type. Available values: pod-ip and service-name.",
- "enum": ["pod-ip", "service-name"]
+ "enum": [
+ "pod-ip",
+ "service-name"
+ ]
},
"routingLogic": {
"type": "string"
diff --git a/helm/values.yaml b/helm/values.yaml
index f011ac552..ecd2c0607 100644
--- a/helm/values.yaml
+++ b/helm/values.yaml
@@ -119,6 +119,35 @@ servingEngineSpec:
# - shmSize: (optional, string) The size of the shared memory, e.g., "20Gi"
# - enableLoRA: (optional, bool) Whether to enable LoRA, e.g., true
#
+ # - keda: (optional, map) KEDA autoscaling configuration for this model deployment. Requires KEDA to be installed in the cluster.
+ # - enabled: (optional, bool) Whether to enable KEDA autoscaling for this model, e.g., true
+ # - minReplicaCount: (optional, int) Minimum number of replicas (supports 0 for scale-to-zero), e.g., 1
+ # - maxReplicaCount: (optional, int) Maximum number of replicas, e.g., 5
+ # - pollingInterval: (optional, int) How often KEDA should check the metrics (in seconds), e.g., 15
+ # - cooldownPeriod: (optional, int) How long to wait before scaling down after scaling up (in seconds), e.g., 360
+ # - idleReplicaCount: (optional, int) Number of replicas to scale to when no triggers are active, e.g., 0
+ # - initialCooldownPeriod: (optional, int) Initial cooldown period in seconds before scaling down after creation, e.g., 60
+ # - fallback: (optional, map) Fallback configuration when scaler fails
+ # - failureThreshold: (int) Number of consecutive failures before fallback, e.g., 3
+ # - replicas: (int) Number of replicas to scale to in fallback, e.g., 2
+ # - triggers: (optional, list) List of KEDA trigger configurations
+ # - type: (string) Trigger type, e.g., "prometheus"
+ # - metadata: (map) Trigger metadata
+ # - serverAddress: (string) Prometheus server address, e.g., "http://prometheus-operated.monitoring.svc:9090"
+ # - metricName: (string) Name of the metric to monitor, e.g., "vllm:num_requests_waiting"
+ # - query: (string) Prometheus query to fetch the metric, e.g., "vllm:num_requests_waiting"
+ # - threshold: (string) Threshold value that triggers scaling, e.g., "5"
+ # - advanced: (optional, map) Advanced KEDA configuration
+ # - restoreToOriginalReplicaCount: (optional, bool) Restore original replica count when ScaledObject is deleted, e.g., false
+ # - horizontalPodAutoscalerConfig: (optional, map) HPA-specific configuration
+ # - name: (optional, string) Custom name for the HPA resource, default: "keda-hpa-{scaled-object-name}"
+ # - behavior: (optional, map) HPA scaling behavior configuration, see https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/
+ # - scalingModifiers: (optional, map) Scaling modifiers for composite metrics
+ # - target: (string) Target value for the composed metric
+ # - activationTarget: (optional, string) Activation target for the composed metric
+ # - metricType: (optional, string) Metric type (AverageValue or Value), default: "AverageValue"
+ # - formula: (string) Formula to compose metrics together
+ #
# Example:
# vllmApiKey: "vllm_xxxxxxxxxxxxx"
# modelSpec:
@@ -177,6 +206,21 @@ servingEngineSpec:
# operator: "In"
# values:
# - "NVIDIA-RTX-A6000"
+ #
+ # keda:
+ # enabled: true
+ # minReplicaCount: 1
+ # maxReplicaCount: 3
+ # pollingInterval: 15
+ # cooldownPeriod: 360
+ # triggers:
+ # - type: prometheus
+ # metadata:
+ # serverAddress: http://prometheus-operated.monitoring.svc:9090
+ # metricName: vllm:num_requests_waiting
+ # query: vllm:num_requests_waiting
+ # threshold: '5'
+
# extraVolumes:
# - name: dev-fuse