Skip to content

Commit 45de92a

Browse files
authored
Merge pull request #893 from syaseen-rh/RHOAIENG-25115
RHOAIENG-25115: Adding KEDA feature
2 parents c4284c5 + ce3ed04 commit 45de92a

File tree

2 files changed

+74
-1
lines changed

2 files changed

+74
-1
lines changed

assemblies/serving-large-models.adoc

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -45,7 +45,7 @@ include::modules/viewing-performance-metrics-for-deployed-model.adoc[leveloffset
4545
include::modules/deploying-a-grafana-metrics-dashboard.adoc[leveloffset=+2]
4646
include::modules/deploying-vllm-gpu-metrics-dashboard-grafana.adoc[leveloffset=+2]
4747
include::modules/ref-grafana-metrics.adoc[leveloffset=+2]
48-
48+
include::modules/configuring-metric-based-autoscaling.adoc[leveloffset=+2]
4949
== Optimizing model-serving runtimes
5050

5151
You can optionally enhance the preinstalled model-serving runtimes available in {productname-short} to leverage additional benefits and capabilities, such as optimized inferencing, reduced latency, and fine-tuned resource allocation.
Lines changed: 73 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,73 @@
1+
:_module-type: PROCEDURE
2+
3+
[id="configuring-metrics-based-autoscaling_{context}"]
4+
= Configuring metrics-based autoscaling
5+
6+
[role="_abstract"]
7+
Knative-based autoscaling is not available in standard deployment mode. However, you can enable metrics-based autoscaling for an inference service in standard deployment mode. Metrics-based autoscaling helps you efficiently manage accelerator resources, lower operational costs, and ensure that your inference services meet performance requirements.
8+
9+
To set up autoscaling for your inference service in standard deployments, install and configure the OpenShift Custom Metrics Autoscaler (CMA), which is based on Kubernetes Event-driven Autoscaling (KEDA). You can then use various model runtime metrics available in OpenShift Monitoring to trigger autoscaling of your inference service, such as KVCache utilization, Time to First Token (TTFT), and Concurrency.
10+
11+
.Prerequisites
12+
* You have cluster administrator privileges for your {openshift-platform} cluster.
13+
* You have installed the CMA operator on your cluster. For more information, see link:https://docs.redhat.com/en/documentation/openshift_container_platform/{ocp-latest-version}/html/nodes/automatically-scaling-pods-with-the-custom-metrics-autoscaler-operator#nodes-cma-autoscaling-custom-install[Installing the custom metrics autoscaler].
14+
+
15+
[NOTE]
16+
====
17+
* You must configure the `KedaController` resource after installing the CMA operator.
18+
* The `odh-controller` automatically creates the `TriggerAuthentication`, `ServiceAccount`, `Role`, `RoleBinding`, and `Secret` resources to allow CMA access to OpenShift Monitoring metrics.
19+
====
20+
* You have enabled User Workload Monitoring (UWM) for your cluster. For more information, see link:https://docs.redhat.com/en/documentation/openshift_container_platform/{ocp-latest-version}/html/monitoring/configuring-user-workload-monitoring[Configuring user workload monitoring].
21+
* You have deployed a model on the single-model serving platform in standard deployment mode.
22+
23+
.Procedure
24+
25+
. Log in to the {openshift-platform} console as a cluster administrator.
26+
. In the *Administrator* perspective, click *Home* -> *Search*.
27+
. Select the project where you have deployed your model.
28+
. From the *Resources* dropdown menu, select *InferenceService*.
29+
. Click the `InferenceService` for your deployed model and then click *YAML*.
30+
. Under `spec.predictor`, define a metric-based autoscaling policy similar to the following example:
31+
+
32+
[source,yaml]
33+
----
34+
kind: InferenceService
35+
metadata:
36+
name: my-inference-service
37+
namespace: my-namespace
38+
annotations:
39+
serving.kserve.io/autoscalerClass: keda
40+
spec:
41+
predictor:
42+
minReplicas: 1
43+
maxReplicas: 5
44+
autoscaling:
45+
metrics:
46+
- type: External
47+
external:
48+
metric:
49+
backend: "prometheus"
50+
serverAddress: "https://thanos-querier.openshift-monitoring.svc:9092"
51+
query: vllm:num_requests_waiting
52+
authenticationRef:
53+
name: inference-prometheus-auth
54+
authModes: bearer
55+
target:
56+
type: Value
57+
value: 2
58+
----
59+
+
60+
The example configuration sets up the inference service to autoscale between 1 and 5 replicas based on the number of requests waiting to be processed, as indicated by the `vllm:num_requests_waiting` metric.
61+
. Click *Save*.
62+
63+
.Verification
64+
65+
* Confirm that the KEDA `ScaledObject` resource is created:
66+
+
67+
[source, console]
68+
----
69+
oc get scaledobject -n <namespace>
70+
----
71+
72+
//[role="_additional-resources"]
73+
//.Additional resources

0 commit comments

Comments
 (0)