adding KEDA feature

syaseen-rh · syaseen-rh · commit 0a7915c73fae · 2025-08-08T09:17:49.000-04:00
diff --git a/assemblies/serving-large-models.adoc b/assemblies/serving-large-models.adoc
@@ -45,7 +45,7 @@ include::modules/viewing-performance-metrics-for-deployed-model.adoc[leveloffset
 include::modules/deploying-a-grafana-metrics-dashboard.adoc[leveloffset=+2]
 include::modules/deploying-vllm-gpu-metrics-dashboard-grafana.adoc[leveloffset=+2]
 include::modules/ref-grafana-metrics.adoc[leveloffset=+2]
-
+include::modules/configuring-metric-based-autoscaling.adoc[leveloffset=+2]
 == Optimizing model-serving runtimes
 
 You can optionally enhance the preinstalled model-serving runtimes available in {productname-short} to leverage additional benefits and capabilities, such as optimized inferencing, reduced latency, and fine-tuned resource allocation. 
diff --git a/modules/configuring-metric-based-autoscaling.adoc b/modules/configuring-metric-based-autoscaling.adoc
@@ -0,0 +1,58 @@
+:_module-type: PROCEDURE
+
+[id="configuring-metric-based-autoscaling_{context}"]
+= Configuring metric-based autoscaling
+
+[role="_abstract"]
+While knative-based autoscaling features are not available in standard deployment modes, you can enable metrics-based autoscaling for an inference service in these deployments. This capability helps you efficiently manage accelerator resources, lower operational costs, and ensure that your inference services meet performance requirements.
+
+To setup autoscaling for your inference service in standard deployments, you must install and configure the Openshift Custom Metrics Autoscaler (CMA), which is based on Kubernetes Event-driven Autoscaling (KEDA). You can then utilize various model runtime metrics available in OpenShift Monitoring, such as KVCache utilization, Time to First Token (TTFT), and concurrency, to trigger autoscaling of your inference service. 
+
+.Prerequisites
+* You have cluster administrator privileges for your {openshift-platform} cluster.
+* You have installed the CMA operator on your cluster. For more informatipn, see link:https://docs.redhat.com/en/documentation/openshift_container_platform/{ocp-latest-version}/html/nodes/automatically-scaling-pods-with-the-custom-metrics-autoscaler-operator#nodes-cma-autoscaling-custom-install[Installing the custom metrics autoscaler].
++
+[NOTE]
+====
+The `odh-controller` automatically creates the `TriggerAuthentication`, `ServiceAccount`, `Role`, `RoleBinding`, and `Secret` resources to allow CMA access to OpenShift Monitoring metrics.
+====
+* You have enabled User Workload Monitoring (UWM) for your cluster. For more information, see https://docs.redhat.com/en/documentation/openshift_container_platform/{ocp-latest-version}/html/monitoring/configuring-user-workload-monitoring[Configuring user workload monitoring].
+* You have deployed a model on the single-model serving platform in standard deployment mode.
+
+.Procedure
+
+. Log in to the OpenShift console as a cluster administrator.
+. In the *Administrator* perspective, click *Home* -> *Search*.
+. Select the project where you have deployed your model.
+. From the *Resources* dropdown menu, select *InferenceService*.
+. Click the `InferenceService` for your deployed model and then click *YAML*.
+. Under `spec.predictor`, define a metric-based autoscaling policy similar to the following example:
++
+[source]
+----
+spec:
+  predictor:
+    # …
+    minReplicas: 1
+    maxReplicas: 5
+    autoScaling:
+      metrics:
+        - type: External
+          external:
+            metric:
+              backend: "prometheus"
+              serverAddress: "http://<thanos-service>.<monitoring-namespace>.svc.cluster.local:9092"
+              query: vllm:num_requests_waiting 
+            authenticationRef:
+              name: openshift-monitoring-metrics-auth
+            target:
+              type: Value
+              value: "2"
+----
++
+The example configures the inference service to autoscale between 1-5 replicas based on the number of requests waiting to be processed, as determined by the `vllm:num_requests_waiting` metric.
+. Click *Save*
+
+//[role="_additional-resources"]
+//.Additional resources
+// link:https://docs.redhat.com/en/documentation/openshift_container_platform/4.19/html/monitoring/index[Monitoring]