Skip to content

Commit bc60175

Browse files
authored
Merge pull request #100573 from openshift-cherrypick-robot/cherry-pick-100332-to-enterprise-4.20
[enterprise-4.20] OSDOCS 16489 Add GPU usage with CMA docs -- NEEDED for 4.20!
2 parents 1c5c6f8 + adfba1e commit bc60175

File tree

2 files changed

+41
-0
lines changed

2 files changed

+41
-0
lines changed
Lines changed: 40 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,40 @@
1+
// Module included in the following assemblies:
2+
//
3+
// * nodes/cma/nodes-cma-autoscaling-custom-trigger.adoc
4+
5+
:_mod-docs-content-type: CONCEPT
6+
[id="nodes-cma-autoscaling-custom-trigger-prom-gpu_{context}"]
7+
= Configuring GPU-based autoscaling with Prometheus and DCGM metrics
8+
9+
You can use the Custom Metrics Autoscaler with NVIDIA Data Center GPU Manager (DCGM) metrics to scale workloads based on GPU utilization. This is particularly useful for AI and machine learning workloads that require GPU resources.
10+
11+
.Example scaled object with a Prometheus target for GPU-based autoscaling
12+
[source,yaml,options="nowrap"]
13+
----
14+
apiVersion: keda.sh/v1alpha1
15+
kind: ScaledObject
16+
metadata:
17+
name: gpu-scaledobject
18+
namespace: my-namespace
19+
spec:
20+
scaleTargetRef:
21+
kind: Deployment
22+
name: gpu-deployment
23+
minReplicaCount: 1 <1>
24+
maxReplicaCount: 5 <2>
25+
triggers:
26+
- type: prometheus
27+
metadata:
28+
serverAddress: https://thanos-querier.openshift-monitoring.svc.cluster.local:9092
29+
namespace: my-namespace
30+
metricName: gpu_utilization
31+
threshold: '90' <3>
32+
query: SUM(DCGM_FI_DEV_GPU_UTIL{instance=~".+", gpu=~".+"}) <4>
33+
authModes: bearer
34+
authenticationRef:
35+
name: keda-trigger-auth-prometheus
36+
----
37+
<1> Specifies the minimum number of replicas to maintain. For GPU workloads, this should not be set to `0` to ensure that metrics continue to be collected.
38+
<2> Specifies the maximum number of replicas allowed during scale-up operations.
39+
<3> Specifies the GPU utilization percentage threshold that triggers scaling. When the average GPU utilization exceeds 90%, the autoscaler scales up the deployment.
40+
<4> Specifies a Prometheus query using NVIDIA DCGM metrics to monitor GPU utilization across all GPU devices. The `DCGM_FI_DEV_GPU_UTIL` metric provides GPU utilization percentages.

nodes/cma/nodes-cma-autoscaling-custom-trigger.adoc

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -22,6 +22,7 @@ You can configure a certificate authority xref:../../nodes/cma/nodes-cma-autosca
2222
// assemblies.
2323

2424
include::modules/nodes-cma-autoscaling-custom-trigger-prom.adoc[leveloffset=+1]
25+
include::modules/nodes-cma-autoscaling-custom-trigger-prom-gpu.adoc[leveloffset=+2]
2526
include::modules/nodes-cma-autoscaling-custom-prometheus-config.adoc[leveloffset=+2]
2627

2728
[role="_additional-resources"]

0 commit comments

Comments
 (0)