Skip to content

Commit f8b9aee

Browse files
committed
addressing SME feedback
1 parent ba94039 commit f8b9aee

File tree

1 file changed

+15
-6
lines changed

1 file changed

+15
-6
lines changed

modules/configuring-metric-based-autoscaling.adoc

Lines changed: 15 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -6,17 +6,18 @@
66
[role="_abstract"]
77
While Knative-based autoscaling features are not available in standard deployment modes, you can enable metrics-based autoscaling for an inference service in these deployments. This capability helps you efficiently manage accelerator resources, lower operational costs, and ensure that your inference services meet performance requirements.
88

9-
To set up autoscaling for your inference service in standard deployments, you must install and configure the OpenShift Custom Metrics Autoscaler (CMA), which is based on Kubernetes Event-driven Autoscaling (KEDA). You can then utilize various model runtime metrics available in OpenShift Monitoring, such as KVCache utilization, Time to First Token (TTFT), and concurrency, to trigger autoscaling of your inference service.
9+
To set up autoscaling for your inference service in standard deployments, you must install and configure the OpenShift Custom Metrics Autoscaler (CMA), which is based on Kubernetes Event-driven Autoscaling (KEDA). You can then utilize various model runtime metrics available in OpenShift Monitoring, such as KVCache utilization, Time to First Token (TTFT), and Concurrency, to trigger autoscaling of your inference service.
1010

1111
.Prerequisites
1212
* You have cluster administrator privileges for your {openshift-platform} cluster.
1313
* You have installed the CMA operator on your cluster. For more information, see link:https://docs.redhat.com/en/documentation/openshift_container_platform/{ocp-latest-version}/html/nodes/automatically-scaling-pods-with-the-custom-metrics-autoscaler-operator#nodes-cma-autoscaling-custom-install[Installing the custom metrics autoscaler].
1414
+
1515
[NOTE]
1616
====
17-
The `odh-controller` automatically creates the `TriggerAuthentication`, `ServiceAccount`, `Role`, `RoleBinding`, and `Secret` resources to allow CMA access to OpenShift Monitoring metrics.
17+
* You must configure the `KedaController` resource after installing the CMA operator.
18+
* The `odh-controller` automatically creates the `TriggerAuthentication`, `ServiceAccount`, `Role`, `RoleBinding`, and `Secret` resources to allow CMA access to OpenShift Monitoring metrics.
1819
====
19-
* You have enabled User Workload Monitoring (UWM) for your cluster. For more information, see https://docs.redhat.com/en/documentation/openshift_container_platform/{ocp-latest-version}/html/monitoring/configuring-user-workload-monitoring[Configuring user workload monitoring].
20+
* You have enabled User Workload Monitoring (UWM) for your cluster. For more information, see link:https://docs.redhat.com/en/documentation/openshift_container_platform/{ocp-latest-version}/html/monitoring/configuring-user-workload-monitoring[Configuring user workload monitoring].
2021
* You have deployed a model on the single-model serving platform in standard deployment mode.
2122

2223
.Procedure
@@ -30,6 +31,12 @@ The `odh-controller` automatically creates the `TriggerAuthentication`, `Service
3031
+
3132
[source,yaml]
3233
----
34+
kind: InferenceService
35+
metadata:
36+
# ...
37+
annotations:
38+
# ...
39+
serving.kserve.io/autoscalerClass: keda
3340
spec:
3441
predictor:
3542
# …
@@ -41,16 +48,18 @@ spec:
4148
external:
4249
metric:
4350
backend: "prometheus"
44-
serverAddress: "https://<thanos-service>.<monitoring-namespace>.svc.cluster.local:9091"
51+
serverAddress: "https://thanos-querier.openshift-monitoring.svc:9092"
4552
query: vllm:num_requests_waiting
4653
authenticationRef:
47-
name: openshift-monitoring-metrics-auth
54+
authModes: bearer
55+
authenticationRef:
56+
name: inference-prometheus-auth
4857
target:
4958
type: Value
5059
value: 2
5160
----
5261
+
53-
The example configures the inference service to autoscale between 1-5 replicas based on the number of requests waiting to be processed, as determined by the `vllm:num_requests_waiting` metric.
62+
The example configuration sets up the inference service to autoscale between 1 and 5 replicas based on the number of requests waiting to be processed, as indicated by the `vllm:num_requests_waiting` metric.
5463
. Click *Save*.
5564

5665
//[role="_additional-resources"]

0 commit comments

Comments
 (0)