diff --git a/docs/source/use_cases/autoscaling-keda.rst b/docs/source/use_cases/autoscaling-keda.rst
index 1515acf17..624327cc5 100644
--- a/docs/source/use_cases/autoscaling-keda.rst
+++ b/docs/source/use_cases/autoscaling-keda.rst
@@ -1,7 +1,7 @@
 Autoscaling with KEDA
 =====================
 
-This tutorial shows you how to automatically scale a vLLM deployment using `KEDA <https://keda.sh/>`_ and Prometheus-based metrics. You'll configure KEDA to monitor queue length and dynamically adjust the number of replicas based on load.
+This tutorial shows you how to automatically scale a vLLM deployment using `KEDA <https://keda.sh/>`_ and Prometheus-based metrics. With the vLLM Production Stack Helm chart (v0.1.9+), KEDA autoscaling is integrated directly into the chart, allowing you to enable it through simple ``values.yaml`` configuration.
 
 Table of Contents
 -----------------
@@ -9,56 +9,81 @@ Table of Contents
 - Prerequisites_
 - Steps_
 
-  - `1. Install the vLLM Production Stack`_
-  - `2. Deploy the Observability Stack`_
+  - `1. Deploy the Observability Stack`_
+  - `2. Configure and Deploy vLLM`_
   - `3. Install KEDA`_
-  - `4. Verify Metric Export`_
-  - `5. Configure the ScaledObject`_
+  - `4. Enable KEDA Autoscaling for vLLM`_
+  - `5. Verify KEDA ScaledObject Creation`_
   - `6. Test Autoscaling`_
-  - `7. Cleanup`_
+  - `7. Advanced Configuration`_
+  - `8. Cleanup`_
 
 - `Additional Resources`_
 
 Prerequisites
 -------------
 
-- A working vLLM deployment on Kubernetes (see :doc:`../getting_started/quickstart`)
 - Access to a Kubernetes cluster with at least 2 GPUs
-- ``kubectl`` and ``helm`` installed
+- ``kubectl`` and ``helm`` installed (v3.0+)
 - Basic understanding of Kubernetes and Prometheus metrics
 
 Steps
 -----
 
-1. Install the vLLM Production Stack
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-Install the production stack using a single pod by following the instructions in :doc:`../deployment/helm`.
-
-2. Deploy the Observability Stack
+1. Deploy the Observability Stack
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
-This stack includes Prometheus, Grafana, and necessary exporters.
+The observability stack (Prometheus, Grafana) is required for KEDA to query metrics.
 
 .. code-block:: bash
 
    cd observability
    bash install.sh
 
-3. Install KEDA
-~~~~~~~~~~~~~~~
+Verify Prometheus is scraping the queue length metric ``vllm:num_requests_waiting``:
 
 .. code-block:: bash
 
-   kubectl create namespace keda
-   helm repo add kedacore https://kedacore.github.io/charts
-   helm repo update
-   helm install keda kedacore/keda --namespace keda
+   kubectl port-forward svc/prometheus-operated -n monitoring 9090:9090
+
+In a separate terminal:
+
+.. code-block:: bash
+
+   curl -G 'http://localhost:9090/api/v1/query' --data-urlencode 'query=vllm:num_requests_waiting'
+
+2. Configure and Deploy vLLM
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Create a ``values.yaml`` file to deploy vLLM. Note that we'll enable KEDA autoscaling in a later step after KEDA is installed:
+
+.. code-block:: yaml
+
+   servingEngineSpec:
+     enableEngine: true
+     modelSpec:
+       - name: "llama3"
+         repository: "lmcache/vllm-openai"
+         tag: "latest"
+         modelURL: "meta-llama/Llama-3.1-8B-Instruct"
+         replicaCount: 1
+         requestCPU: 10
+         requestMemory: "64Gi"
+         requestGPU: 1
+
+Deploy the chart:
+
+.. code-block:: bash
+
+   helm install vllm vllm/vllm-stack -f values.yaml
 
-4. Verify Metric Export
-~~~~~~~~~~~~~~~~~~~~~~~
+Wait for the vLLM deployment to be ready and verify that metrics are being exposed:
 
-Check that Prometheus is scraping the queue length metric ``vllm:num_requests_waiting``.
+.. code-block:: bash
+
+   kubectl wait --for=condition=ready pod -l model=llama3 --timeout=300s
+
+Verify Prometheus is scraping the vLLM metrics:
 
 .. code-block:: bash
 
@@ -70,115 +95,216 @@ In a separate terminal:
 
    curl -G 'http://localhost:9090/api/v1/query' --data-urlencode 'query=vllm:num_requests_waiting'
 
-Example output:
+3. Install KEDA
+~~~~~~~~~~~~~~~
 
-.. code-block:: json
+Now that vLLM is running and exposing metrics, install KEDA to enable autoscaling:
 
-   {
-     "status": "success",
-     "data": {
-       "result": [
-         {
-           "metric": {
-             "__name__": "vllm:num_requests_waiting",
-             "pod": "vllm-llama3-deployment-vllm-xxxxx"
-           },
-           "value": [ 1749077215.034, "0" ]
-         }
-       ]
-     }
-   }
+.. code-block:: bash
 
-This means that at the given timestamp, there were 0 pending requests in the queue.
+   kubectl create namespace keda
+   helm repo add kedacore https://kedacore.github.io/charts
+   helm repo update
+   helm install keda kedacore/keda --namespace keda
 
-5. Configure the ScaledObject
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+Verify KEDA is running:
+
+.. code-block:: bash
 
-The following ``ScaledObject`` configuration is provided in ``tutorials/assets/values-19-keda.yaml``. Review its contents:
+   kubectl get pods -n keda
+
+4. Enable KEDA Autoscaling for vLLM
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Update your ``values.yaml`` file to enable KEDA autoscaling:
 
 .. code-block:: yaml
 
-   apiVersion: keda.sh/v1alpha1
-   kind: ScaledObject
-   metadata:
-     name: vllm-scaledobject
-     namespace: default
-   spec:
-     scaleTargetRef:
-       name: vllm-llama3-deployment-vllm
-     minReplicaCount: 1
-     maxReplicaCount: 2
-     pollingInterval: 15
-     cooldownPeriod: 30
-     triggers:
-       - type: prometheus
-         metadata:
-           serverAddress: http://prometheus-operated.monitoring.svc:9090
-           metricName: vllm:num_requests_waiting
-           query: vllm:num_requests_waiting
-           threshold: '5'
+   servingEngineSpec:
+     enableEngine: true
+     modelSpec:
+       - name: "llama3"
+         repository: "lmcache/vllm-openai"
+         tag: "latest"
+         modelURL: "meta-llama/Llama-3.1-8B-Instruct"
+         replicaCount: 1
+         requestCPU: 10
+         requestMemory: "64Gi"
+         requestGPU: 1
+
+         # Enable KEDA autoscaling
+         keda:
+           enabled: true
+           minReplicaCount: 1
+           maxReplicaCount: 3
+           pollingInterval: 15
+           cooldownPeriod: 360
+           triggers:
+             - type: prometheus
+               metadata:
+                 serverAddress: http://prometheus-operated.monitoring.svc:9090
+                 metricName: vllm:num_requests_waiting
+                 query: vllm:num_requests_waiting
+                 threshold: '5'
+
+Upgrade the chart to enable KEDA autoscaling:
+
+.. code-block:: bash
 
-Apply the ScaledObject:
+   helm upgrade vllm vllm/vllm-stack -f values.yaml
+
+This configuration tells KEDA to:
+
+- Monitor the ``vllm:num_requests_waiting`` metric from Prometheus
+- Maintain between 1 and 3 replicas
+- Scale up when the queue exceeds 5 pending requests
+- Check metrics every 15 seconds
+- Wait 360 seconds before scaling down after scaling up
+
+5. Verify KEDA ScaledObject Creation
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Check that the Helm chart created the ScaledObject resource:
 
 .. code-block:: bash
 
-   cd ../tutorials
-   kubectl apply -f assets/values-19-keda.yaml
+   kubectl get scaledobjects
 
-This tells KEDA to:
+You should see:
 
-- Monitor ``vllm:num_requests_waiting``
-- Scale between 1 and 2 replicas
-- Scale up when the queue exceeds 5 requests
+.. code-block:: text
 
-6. Test Autoscaling
-~~~~~~~~~~~~~~~~~~~
+   NAME                        SCALETARGETKIND      SCALETARGETNAME                  MIN   MAX   TRIGGERS     AUTHENTICATION   READY   ACTIVE   FALLBACK   PAUSED    AGE
+   vllm-llama3-scaledobject   apps/v1.Deployment   vllm-llama3-deployment-vllm      1     3     prometheus                    True    False    Unknown    Unknown   30s
 
-Watch the deployment:
+View the created HPA:
 
 .. code-block:: bash
 
-   kubectl get hpa -n default -w
+   kubectl get hpa
 
-You should initially see:
+Expected output:
 
 .. code-block:: text
 
-   NAME                         REFERENCE                                TARGETS     MINPODS   MAXPODS   REPLICAS
-   keda-hpa-vllm-scaledobject   Deployment/vllm-llama3-deployment-vllm   0/5 (avg)   1         2         1
+   NAME                            REFERENCE                                TARGETS     MINPODS   MAXPODS   REPLICAS
+   keda-hpa-vllm-llama3-scaledobject   Deployment/vllm-llama3-deployment-vllm   0/5 (avg)   1         3         1
+
+6. Test Autoscaling
+~~~~~~~~~~~~~~~~~~~
+
+Watch the HPA in real-time:
 
-``TARGETS`` shows the current metric value vs. the target threshold.
-``0/5 (avg)`` means the current value of ``vllm:num_requests_waiting`` is 0, and the threshold is 5.
+.. code-block:: bash
+
+   kubectl get hpa -n default -w
 
-Generate load:
+Generate load to trigger autoscaling. Port-forward to the router service:
 
 .. code-block:: bash
 
    kubectl port-forward svc/vllm-router-service 30080:80
 
-In a separate terminal:
+In a separate terminal, run a load generator:
 
 .. code-block:: bash
 
-   python3 assets/example-10-load-generator.py --num-requests 100 --prompt-len 3000
+   python3 tutorials/assets/example-10-load-generator.py --num-requests 100 --prompt-len 3000
+
+Within a few minutes, you should see the ``REPLICAS`` value increase as KEDA scales up to handle the load.
+
+7. Advanced Configuration
+~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Scale-to-Zero
+^^^^^^^^^^^^^
 
-Within a few minutes, the ``REPLICAS`` value should increase to 2.
+Enable scale-to-zero by setting ``minReplicaCount: 0`` and adding a traffic-based keepalive trigger:
 
-7. Cleanup
+.. code-block:: yaml
+
+   keda:
+     enabled: true
+     minReplicaCount: 0  # Allow scaling to zero
+     maxReplicaCount: 5
+     triggers:
+       # Queue-based scaling
+       - type: prometheus
+         metadata:
+           serverAddress: http://prometheus-operated.monitoring.svc:9090
+           metricName: vllm:num_requests_waiting
+           query: vllm:num_requests_waiting
+           threshold: '5'
+       # Traffic-based keepalive (prevents scale-to-zero when traffic exists)
+       - type: prometheus
+         metadata:
+           serverAddress: http://prometheus-operated.monitoring.svc:9090
+           metricName: vllm:incoming_keepalive
+           query: sum(rate(vllm:num_incoming_requests_total[1m]) > bool 0)
+           threshold: "1"
+
+Custom HPA Behavior
+^^^^^^^^^^^^^^^^^^^
+
+Control scaling behavior with custom HPA policies:
+
+.. code-block:: yaml
+
+   keda:
+     enabled: true
+     minReplicaCount: 1
+     maxReplicaCount: 5
+     advanced:
+       horizontalPodAutoscalerConfig:
+         behavior:
+           scaleDown:
+             stabilizationWindowSeconds: 300
+             policies:
+               - type: Percent
+                 value: 50
+                 periodSeconds: 60
+
+Fallback Configuration
+^^^^^^^^^^^^^^^^^^^^^^
+
+Configure fallback behavior when metrics are unavailable:
+
+.. code-block:: yaml
+
+   keda:
+     enabled: true
+     fallback:
+       failureThreshold: 3
+       replicas: 2
+
+For more configuration options, see the `Helm chart README <https://github.com/vllm-project/production-stack/blob/main/helm/README.md#keda-autoscaling-configuration>`_.
+
+8. Cleanup
 ~~~~~~~~~~
 
-To remove KEDA configuration and observability components:
+To disable KEDA autoscaling, update your ``values.yaml`` to set ``keda.enabled: false`` and upgrade:
+
+.. code-block:: bash
+
+   helm upgrade vllm vllm/vllm-stack -f values.yaml
+
+To completely remove KEDA from the cluster:
 
 .. code-block:: bash
 
-   kubectl delete -f assets/values-19-keda.yaml
    helm uninstall keda -n keda
    kubectl delete namespace keda
 
-   cd ../observability
+To remove the observability stack:
+
+.. code-block:: bash
+
+   cd observability
    bash uninstall.sh
 
 Additional Resources
 --------------------
 
 - `KEDA Documentation <https://keda.sh/docs/>`_
+- `KEDA ScaledObject Specification <https://keda.sh/docs/2.18/reference/scaledobject-spec/>`_
+- `Helm Chart KEDA Configuration <https://github.com/vllm-project/production-stack/blob/main/helm/README.md#keda-autoscaling-configuration>`_
diff --git a/helm/Chart.yaml b/helm/Chart.yaml
index 937ab7782..9592df114 100644
--- a/helm/Chart.yaml
+++ b/helm/Chart.yaml
@@ -15,7 +15,7 @@ type: application
 # This is the chart version. This version number should be incremented each time you make changes
 # to the chart and its templates, including the app version.
 # Versions are expected to follow Semantic Versioning (https://semver.org/)
-version: 0.1.8
+version: 0.1.9
 
 maintainers:
   - name: apostac
diff --git a/helm/README.md b/helm/README.md
index 20188ba66..798c61ee3 100644
--- a/helm/README.md
+++ b/helm/README.md
@@ -152,6 +152,39 @@ This table documents all available configuration values for the Production Stack
 | `servingEngineSpec.modelSpec[].lmcacheConfig.nixlPeerPort` | string | `"55555"` | NIXL peer port for KV transfer |
 | `servingEngineSpec.modelSpec[].lmcacheConfig.nixlBufferSize` | string | `"1073741824"` | NIXL buffer size for KV transfer |
 
+#### KEDA Autoscaling Configuration
+
+> **Note**: Unless explicitly set, KEDA's default values will apply. The defaults shown below are KEDA's defaults, not values enforced by this Helm chart.
+
+| Field | Type | KEDA Default | Description |
+|-------|------|--------------|-------------|
+| `servingEngineSpec.modelSpec[].keda.enabled` | boolean | `false` | Enable KEDA autoscaling for this model deployment (requires KEDA installed in cluster) |
+| `servingEngineSpec.modelSpec[].keda.minReplicaCount` | integer | - | Minimum number of replicas (supports 0 for scale-to-zero); if not set, HPA minReplicas default applies |
+| `servingEngineSpec.modelSpec[].keda.maxReplicaCount` | integer | - | Maximum number of replicas; if not set, HPA maxReplicas default applies |
+| `servingEngineSpec.modelSpec[].keda.pollingInterval` | integer | `30` | How often KEDA checks metrics (in seconds) |
+| `servingEngineSpec.modelSpec[].keda.cooldownPeriod` | integer | `300` | Wait time before scaling down after scaling up (in seconds) |
+| `servingEngineSpec.modelSpec[].keda.idleReplicaCount` | integer | - | Number of replicas when no triggers are active |
+| `servingEngineSpec.modelSpec[].keda.initialCooldownPeriod` | integer | - | Initial cooldown period before scaling down after creation (in seconds) |
+| `servingEngineSpec.modelSpec[].keda.fallback` | map | - | Fallback configuration when scaler fails |
+| `servingEngineSpec.modelSpec[].keda.fallback.failureThreshold` | integer | - | Number of consecutive failures before fallback |
+| `servingEngineSpec.modelSpec[].keda.fallback.replicas` | integer | - | Number of replicas to scale to in fallback |
+| `servingEngineSpec.modelSpec[].keda.triggers` | list | See below | List of KEDA trigger configurations (Prometheus-based) |
+| `servingEngineSpec.modelSpec[].keda.triggers[].type` | string | - | Trigger type (e.g., `"prometheus"`) |
+| `servingEngineSpec.modelSpec[].keda.triggers[].metadata.serverAddress` | string | - | Prometheus server URL (e.g., <http://prometheus-operated.monitoring.svc:9090>) |
+| `servingEngineSpec.modelSpec[].keda.triggers[].metadata.metricName` | string | - | Name of the metric to monitor |
+| `servingEngineSpec.modelSpec[].keda.triggers[].metadata.query` | string | - | PromQL query to fetch the metric |
+| `servingEngineSpec.modelSpec[].keda.triggers[].metadata.threshold` | string | - | Threshold value that triggers scaling |
+| `servingEngineSpec.modelSpec[].keda.advanced` | map | - | Advanced KEDA configuration options |
+| `servingEngineSpec.modelSpec[].keda.advanced.restoreToOriginalReplicaCount` | boolean | `false` | Restore original replica count when ScaledObject is deleted |
+| `servingEngineSpec.modelSpec[].keda.advanced.horizontalPodAutoscalerConfig` | map | - | HPA-specific configuration |
+| `servingEngineSpec.modelSpec[].keda.advanced.horizontalPodAutoscalerConfig.name` | string | `keda-hpa-{scaled-object-name}` | Custom name for HPA resource |
+| `servingEngineSpec.modelSpec[].keda.advanced.horizontalPodAutoscalerConfig.behavior` | map | - | HPA scaling behavior configuration (see [K8s docs](https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/)) |
+| `servingEngineSpec.modelSpec[].keda.advanced.scalingModifiers` | map | - | Scaling modifiers for composite metrics |
+| `servingEngineSpec.modelSpec[].keda.advanced.scalingModifiers.target` | string | - | Target value for the composed metric |
+| `servingEngineSpec.modelSpec[].keda.advanced.scalingModifiers.activationTarget` | string | - | Activation target for the composed metric |
+| `servingEngineSpec.modelSpec[].keda.advanced.scalingModifiers.metricType` | string | `"AverageValue"` | Metric type (AverageValue or Value) |
+| `servingEngineSpec.modelSpec[].keda.advanced.scalingModifiers.formula` | string | - | Formula to compose metrics together |
+
 ### Router Configuration
 
 | Field | Type | Default | Description |
diff --git a/helm/templates/scaledobject-vllm.yaml b/helm/templates/scaledobject-vllm.yaml
new file mode 100644
index 000000000..be813c03c
--- /dev/null
+++ b/helm/templates/scaledobject-vllm.yaml
@@ -0,0 +1,74 @@
+{{- if .Values.servingEngineSpec.enableEngine -}}
+{{- range $modelSpec := .Values.servingEngineSpec.modelSpec }}
+{{- if and (hasKey $modelSpec "keda") $modelSpec.keda.enabled }}
+{{- if not (hasKey $modelSpec "raySpec") }}
+{{- with $ -}}
+apiVersion: keda.sh/v1alpha1
+kind: ScaledObject
+metadata:
+  name: "{{ .Release.Name }}-{{$modelSpec.name}}-scaledobject"
+  namespace: {{ .Release.Namespace }}
+  labels:
+    model: {{ $modelSpec.name }}
+    helm-release-name: {{ .Release.Name }}
+  {{- include "chart.engineLabels" . | nindent 4 }}
+spec:
+  scaleTargetRef:
+    apiVersion: apps/v1
+    kind: Deployment
+    name: "{{ .Release.Name }}-{{$modelSpec.name}}-deployment-vllm"
+  {{- if hasKey $modelSpec.keda "minReplicaCount" }}
+  minReplicaCount: {{ $modelSpec.keda.minReplicaCount }}
+  {{- end }}
+  {{- if hasKey $modelSpec.keda "maxReplicaCount" }}
+  maxReplicaCount: {{ $modelSpec.keda.maxReplicaCount }}
+  {{- end }}
+  {{- if hasKey $modelSpec.keda "pollingInterval" }}
+  pollingInterval: {{ $modelSpec.keda.pollingInterval }}
+  {{- end }}
+  {{- if hasKey $modelSpec.keda "cooldownPeriod" }}
+  cooldownPeriod: {{ $modelSpec.keda.cooldownPeriod }}
+  {{- end }}
+  {{- if hasKey $modelSpec.keda "idleReplicaCount" }}
+  idleReplicaCount: {{ $modelSpec.keda.idleReplicaCount }}
+  {{- end }}
+  {{- if hasKey $modelSpec.keda "initialCooldownPeriod" }}
+  initialCooldownPeriod: {{ $modelSpec.keda.initialCooldownPeriod }}
+  {{- end }}
+  {{- if hasKey $modelSpec.keda "fallback" }}
+  fallback:
+    {{- toYaml $modelSpec.keda.fallback | nindent 4 }}
+  {{- end }}
+  {{- if hasKey $modelSpec.keda "advanced" }}
+  advanced:
+    {{- if hasKey $modelSpec.keda.advanced "restoreToOriginalReplicaCount" }}
+    restoreToOriginalReplicaCount: {{ $modelSpec.keda.advanced.restoreToOriginalReplicaCount }}
+    {{- end }}
+    {{- if hasKey $modelSpec.keda.advanced "horizontalPodAutoscalerConfig" }}
+    horizontalPodAutoscalerConfig:
+      {{- toYaml $modelSpec.keda.advanced.horizontalPodAutoscalerConfig | nindent 6 }}
+    {{- end }}
+    {{- if hasKey $modelSpec.keda.advanced "scalingModifiers" }}
+    scalingModifiers:
+      {{- toYaml $modelSpec.keda.advanced.scalingModifiers | nindent 6 }}
+    {{- end }}
+  {{- end }}
+  {{- if hasKey $modelSpec.keda "triggers" }}
+  triggers:
+    {{- toYaml $modelSpec.keda.triggers | nindent 4 }}
+  {{- else }}
+  # Default trigger configuration - monitors vllm queue length
+  triggers:
+    - type: prometheus
+      metadata:
+        serverAddress: http://prometheus-operated.monitoring.svc:9090
+        metricName: vllm:num_requests_waiting
+        query: vllm:num_requests_waiting{model="{{ $modelSpec.name }}"}
+        threshold: '5'
+  {{- end }}
+---
+{{- end }}
+{{- end }}
+{{- end }}
+{{- end }}
+{{- end }}
diff --git a/helm/tests/keda_test.yaml b/helm/tests/keda_test.yaml
new file mode 100644
index 000000000..402e1809d
--- /dev/null
+++ b/helm/tests/keda_test.yaml
@@ -0,0 +1,160 @@
+suite: test KEDA autoscaling configuration
+templates:
+  - scaledobject-vllm.yaml
+  - deployment-vllm-multi.yaml
+tests:
+  - it: should not create ScaledObject when keda is not enabled
+    set:
+      servingEngineSpec:
+        enableEngine: true
+        modelSpec:
+          - name: "test-model"
+            repository: "vllm/vllm-openai"
+            tag: "latest"
+            modelURL: "facebook/opt-125m"
+            replicaCount: 1
+            requestCPU: 1
+            requestMemory: "1Gi"
+            requestGPU: 1
+            keda:
+              enabled: false
+    asserts:
+      - template: scaledobject-vllm.yaml
+        hasDocuments:
+          count: 0
+
+  - it: should create ScaledObject when keda is enabled
+    set:
+      servingEngineSpec:
+        enableEngine: true
+        modelSpec:
+          - name: "test-model"
+            repository: "vllm/vllm-openai"
+            tag: "latest"
+            modelURL: "facebook/opt-125m"
+            replicaCount: 1
+            requestCPU: 1
+            requestMemory: "1Gi"
+            requestGPU: 1
+            keda:
+              enabled: true
+              minReplicaCount: 1
+              maxReplicaCount: 3
+              triggers:
+                - type: prometheus
+                  metadata:
+                    serverAddress: http://prometheus-operated.monitoring.svc:9090
+                    metricName: vllm:num_requests_waiting
+                    query: vllm:num_requests_waiting
+                    threshold: '5'
+    asserts:
+      - template: scaledobject-vllm.yaml
+        hasDocuments:
+          count: 1
+      - template: scaledobject-vllm.yaml
+        equal:
+          path: metadata.name
+          value: RELEASE-NAME-test-model-scaledobject
+      - template: scaledobject-vllm.yaml
+        equal:
+          path: spec.minReplicaCount
+          value: 1
+      - template: scaledobject-vllm.yaml
+        equal:
+          path: spec.maxReplicaCount
+          value: 3
+
+  - it: should use default trigger when checks are not provided
+    set:
+      servingEngineSpec:
+        enableEngine: true
+        modelSpec:
+          - name: "test-model-default"
+            repository: "vllm/vllm-openai"
+            tag: "latest"
+            modelURL: "facebook/opt-125m"
+            replicaCount: 1
+            requestCPU: 1
+            requestMemory: "1Gi"
+            requestGPU: 1
+            keda:
+              enabled: true
+              minReplicaCount: 1
+              maxReplicaCount: 5
+    asserts:
+      - template: scaledobject-vllm.yaml
+        hasDocuments:
+          count: 1
+      - template: scaledobject-vllm.yaml
+        equal:
+          path: spec.triggers[0].type
+          value: prometheus
+      - template: scaledobject-vllm.yaml
+        equal:
+          path: spec.triggers[0].metadata.query
+          value: vllm:num_requests_waiting{model="test-model-default"}
+
+  - it: should support advanced KEDA configuration
+    set:
+      servingEngineSpec:
+        enableEngine: true
+        modelSpec:
+          - name: "test-model-advanced"
+            repository: "vllm/vllm-openai"
+            tag: "latest"
+            modelURL: "facebook/opt-125m"
+            replicaCount: 1
+            requestCPU: 1
+            requestMemory: "1Gi"
+            requestGPU: 1
+            keda:
+              enabled: true
+              minReplicaCount: 0
+              maxReplicaCount: 10
+              pollingInterval: 10
+              cooldownPeriod: 60
+              idleReplicaCount: 0
+              initialCooldownPeriod: 30
+              fallback:
+                failureThreshold: 5
+                replicas: 2
+              advanced:
+                restoreToOriginalReplicaCount: true
+                horizontalPodAutoscalerConfig:
+                  name: my-hpa
+                  behavior:
+                    scaleDown:
+                      stabilizationWindowSeconds: 300
+    asserts:
+      - template: scaledobject-vllm.yaml
+        equal:
+          path: spec.pollingInterval
+          value: 10
+      - template: scaledobject-vllm.yaml
+        equal:
+          path: spec.cooldownPeriod
+          value: 60
+      - template: scaledobject-vllm.yaml
+        equal:
+          path: spec.idleReplicaCount
+          value: 0
+      - template: scaledobject-vllm.yaml
+        equal:
+          path: spec.initialCooldownPeriod
+          value: 30
+      - template: scaledobject-vllm.yaml
+        equal:
+          path: spec.fallback.failureThreshold
+          value: 5
+      - template: scaledobject-vllm.yaml
+        equal:
+          path: spec.fallback.replicas
+          value: 2
+      - template: scaledobject-vllm.yaml
+        equal:
+          path: spec.advanced.restoreToOriginalReplicaCount
+          value: true
+      - template: scaledobject-vllm.yaml
+        equal:
+          path: spec.advanced.horizontalPodAutoscalerConfig.name
+          value: my-hpa
diff --git a/helm/values.schema.json b/helm/values.schema.json
index ef618fa5a..954d90e7c 100644
--- a/helm/values.schema.json
+++ b/helm/values.schema.json
@@ -306,6 +306,111 @@
                     }
                   }
                 }
+              },
+              "keda": {
+                "type": "object",
+                "description": "KEDA autoscaling configuration for this model deployment",
+                "properties": {
+                  "enabled": {
+                    "type": "boolean",
+                    "description": "Whether to enable KEDA autoscaling for this model"
+                  },
+                  "minReplicaCount": {
+                    "type": "integer",
+                    "description": "Minimum number of replicas (supports 0 for scale-to-zero)",
+                    "minimum": 0
+                  },
+                  "maxReplicaCount": {
+                    "type": "integer",
+                    "description": "Maximum number of replicas",
+                    "minimum": 1
+                  },
+                  "pollingInterval": {
+                    "type": "integer",
+                    "description": "How often KEDA should check the metrics (in seconds)",
+                    "minimum": 1
+                  },
+                  "cooldownPeriod": {
+                    "type": "integer",
+                    "description": "How long to wait before scaling down after scaling up (in seconds)",
+                    "minimum": 0
+                  },
+                  "idleReplicaCount": {
+                    "type": "integer",
+                    "description": "Number of replicas to scale to when no triggers are active",
+                    "minimum": 0
+                  },
+                  "initialCooldownPeriod": {
+                    "type": "integer",
+                    "description": "Initial cooldown period in seconds before scaling down after creation",
+                    "minimum": 0
+                  },
+                  "fallback": {
+                    "type": "object",
+                    "description": "Fallback configuration when scaler fails",
+                    "properties": {
+                      "failureThreshold": {
+                        "type": "integer",
+                        "description": "Number of consecutive failures before fallback",
+                        "minimum": 1
+                      },
+                      "replicas": {
+                        "type": "integer",
+                        "description": "Number of replicas to scale to in fallback",
+                        "minimum": 0
+                      }
+                    },
+                    "required": [
+                      "failureThreshold",
+                      "replicas"
+                    ]
+                  },
+                  "triggers": {
+                    "type": "array",
+                    "description": "List of KEDA trigger configurations",
+                    "items": {
+                      "type": "object",
+                      "properties": {
+                        "type": {
+                          "type": "string",
+                          "description": "Trigger type (e.g., prometheus, cpu, memory)"
+                        },
+                        "metadata": {
+                          "type": "object",
+                          "description": "Trigger-specific metadata",
+                          "additionalProperties": true
+                        }
+                      },
+                      "required": [
+                        "type",
+                        "metadata"
+                      ]
+                    }
+                  },
+                  "advanced": {
+                    "type": "object",
+                    "description": "Advanced KEDA configuration",
+                    "properties": {
+                      "restoreToOriginalReplicaCount": {
+                        "type": "boolean",
+                        "description": "Restore original replica count when ScaledObject is deleted"
+                      },
+                      "horizontalPodAutoscalerConfig": {
+                        "type": "object",
+                        "description": "HPA-specific configuration",
+                        "additionalProperties": true
+                      },
+                      "scalingModifiers": {
+                        "type": "object",
+                        "description": "Scaling modifiers for composite metrics",
+                        "additionalProperties": true
+                      }
+                    }
+                  }
+                },
+                "required": [
+                  "enabled"
+                ]
               }
             },
             "required": [
@@ -492,16 +597,26 @@
                 "type": "string"
               }
             },
-            "required": ["name"]
+            "required": [
+              "name"
+            ]
           }
         },
         "autoscaling": {
           "type": "object",
           "properties": {
-            "enabled": {"type": "boolean"},
-            "minReplicas": {"type": "integer"},
-            "maxReplicas": {"type": "integer"},
-            "targetCPUUtilizationPercentage": {"type": "integer"}
+            "enabled": {
+              "type": "boolean"
+            },
+            "minReplicas": {
+              "type": "integer"
+            },
+            "maxReplicas": {
+              "type": "integer"
+            },
+            "targetCPUUtilizationPercentage": {
+              "type": "integer"
+            }
           }
         },
         "containerPort": {
@@ -517,19 +632,24 @@
           "description": "Container-level security context configuration",
           "additionalProperties": true
         },
-
         "servicePort": {
           "type": "integer"
         },
         "serviceDiscovery": {
           "type": "string",
           "description": "Service discovery mode. Available values: k8s or static.",
-          "enum": ["k8s", "static"]
+          "enum": [
+            "k8s",
+            "static"
+          ]
         },
         "k8sServiceDiscoveryType": {
           "type": "string",
           "description": "Kubernetes service discovery type. Available values: pod-ip and service-name.",
-          "enum": ["pod-ip", "service-name"]
+          "enum": [
+            "pod-ip",
+            "service-name"
+          ]
         },
         "routingLogic": {
           "type": "string"
diff --git a/helm/values.yaml b/helm/values.yaml
index f011ac552..ecd2c0607 100644
--- a/helm/values.yaml
+++ b/helm/values.yaml
@@ -119,6 +119,35 @@ servingEngineSpec:
   # - shmSize: (optional, string) The size of the shared memory, e.g., "20Gi"
   # - enableLoRA: (optional, bool) Whether to enable LoRA, e.g., true
   #
+  # - keda: (optional, map) KEDA autoscaling configuration for this model deployment. Requires KEDA to be installed in the cluster.
+  #   - enabled: (optional, bool) Whether to enable KEDA autoscaling for this model, e.g., true
+  #   - minReplicaCount: (optional, int) Minimum number of replicas (supports 0 for scale-to-zero), e.g., 1
+  #   - maxReplicaCount: (optional, int) Maximum number of replicas, e.g., 5
+  #   - pollingInterval: (optional, int) How often KEDA should check the metrics (in seconds), e.g., 15
+  #   - cooldownPeriod: (optional, int) How long to wait before scaling down after scaling up (in seconds), e.g., 360
+  #   - idleReplicaCount: (optional, int) Number of replicas to scale to when no triggers are active, e.g., 0
+  #   - initialCooldownPeriod: (optional, int) Initial cooldown period in seconds before scaling down after creation, e.g., 60
+  #   - fallback: (optional, map) Fallback configuration when scaler fails
+  #     - failureThreshold: (int) Number of consecutive failures before fallback, e.g., 3
+  #     - replicas: (int) Number of replicas to scale to in fallback, e.g., 2
+  #   - triggers: (optional, list) List of KEDA trigger configurations
+  #     - type: (string) Trigger type, e.g., "prometheus"
+  #     - metadata: (map) Trigger metadata
+  #       - serverAddress: (string) Prometheus server address, e.g., "http://prometheus-operated.monitoring.svc:9090"
+  #       - metricName: (string) Name of the metric to monitor, e.g., "vllm:num_requests_waiting"
+  #       - query: (string) Prometheus query to fetch the metric, e.g., "vllm:num_requests_waiting"
+  #       - threshold: (string) Threshold value that triggers scaling, e.g., "5"
+  #   - advanced: (optional, map) Advanced KEDA configuration
+  #     - restoreToOriginalReplicaCount: (optional, bool) Restore original replica count when ScaledObject is deleted, e.g., false
+  #     - horizontalPodAutoscalerConfig: (optional, map) HPA-specific configuration
+  #       - name: (optional, string) Custom name for the HPA resource, default: "keda-hpa-{scaled-object-name}"
+  #       - behavior: (optional, map) HPA scaling behavior configuration, see https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/
+  #     - scalingModifiers: (optional, map) Scaling modifiers for composite metrics
+  #       - target: (string) Target value for the composed metric
+  #       - activationTarget: (optional, string) Activation target for the composed metric
+  #       - metricType: (optional, string) Metric type (AverageValue or Value), default: "AverageValue"
+  #       - formula: (string) Formula to compose metrics together
+  #
   # Example:
   # vllmApiKey: "vllm_xxxxxxxxxxxxx"
   # modelSpec:
@@ -177,6 +206,21 @@ servingEngineSpec:
   #         operator: "In"
   #         values:
   #         - "NVIDIA-RTX-A6000"
+  #
+  #   keda:
+  #     enabled: true
+  #     minReplicaCount: 1
+  #     maxReplicaCount: 3
+  #     pollingInterval: 15
+  #     cooldownPeriod: 360
+  #     triggers:
+  #       - type: prometheus
+  #         metadata:
+  #           serverAddress: http://prometheus-operated.monitoring.svc:9090
+  #           metricName: vllm:num_requests_waiting
+  #           query: vllm:num_requests_waiting
+  #           threshold: '5'
+
 
   #  extraVolumes:
   #    - name: dev-fuse