Skip to content

Commit 72102e1

Browse files
authored
Merge pull request #52233 from StephenJamesSmith/TELCODOCS-620-GPU-CONSOLE
Telcodocs 620 gpu console: QE edits
2 parents 7b78452 + 27388e0 commit 72102e1

File tree

5 files changed

+224
-0
lines changed

5 files changed

+224
-0
lines changed

_topic_maps/_topic_map.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2275,6 +2275,8 @@ Topics:
22752275
File: managing-alerts
22762276
- Name: Reviewing monitoring dashboards
22772277
File: reviewing-monitoring-dashboards
2278+
- Name: The NVIDIA GPU administration dashboard
2279+
File: nvidia-gpu-admin-dashboard
22782280
- Name: Monitoring bare-metal events
22792281
File: using-rfhe
22802282
- Name: Accessing third-party monitoring APIs
Lines changed: 136 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,136 @@
1+
// Module included in the following assemblies:
2+
//
3+
// * monitoring/nvidia-gpu-admin-dashboard.adoc
4+
5+
:_content-type: PROCEDURE
6+
[id="nvidia-gpu-admin-dashboard-installing_{context}"]
7+
= Installing the NVIDIA GPU administration dashboard
8+
9+
Install the NVIDIA GPU plug-in by using Helm on the OpenShift Container Platform (OCP) Console to add GPU capabilities.
10+
11+
The OpenShift Console NVIDIA GPU plug-in works as a remote bundle for the OCP console. To run the OpenShift Console NVIDIA GPU plug-in
12+
an instance of the OCP console must be running.
13+
14+
15+
.Prerequisites
16+
17+
* Red Hat OpenShift 4.11+
18+
* NVIDIA GPU operator
19+
* link:https://helm.sh/docs/intro/install/[Helm]
20+
21+
22+
.Procedure
23+
24+
Use the following procedure to install the OpenShift Console NVIDIA GPU plug-in.
25+
26+
. Add the Helm repository:
27+
+
28+
[source,terminal]
29+
----
30+
$ helm repo add rh-ecosystem-edge https://rh-ecosystem-edge.github.io/console-plugin-nvidia-gpu
31+
----
32+
+
33+
[source,terminal]
34+
----
35+
$ helm repo update
36+
----
37+
38+
. Install the Helm chart in the default NVIDIA GPU operator namespace:
39+
+
40+
[source,terminal]
41+
----
42+
$ helm install -n nvidia-gpu-operator console-plugin-nvidia-gpu rh-ecosystem-edge/console-plugin-nvidia-gpu
43+
----
44+
+
45+
.Example output
46+
+
47+
[source,terminal]
48+
----
49+
NAME: console-plugin-nvidia-gpu
50+
LAST DEPLOYED: Tue Aug 23 15:37:35 2022
51+
NAMESPACE: nvidia-gpu-operator
52+
STATUS: deployed
53+
REVISION: 1
54+
NOTES:
55+
View the Console Plugin NVIDIA GPU deployed resources by running the following command:
56+
57+
$ oc -n {{ .Release.Namespace }} get all -l app.kubernetes.io/name=console-plugin-nvidia-gpu
58+
59+
Enable the plugin by running the following command:
60+
61+
# Check if a plugins field is specified
62+
$ oc get consoles.operator.openshift.io cluster --output=jsonpath="{.spec.plugins}"
63+
64+
# if not, then run the following command to enable the plugin
65+
$ oc patch consoles.operator.openshift.io cluster --patch '{ "spec": { "plugins": ["console-plugin-nvidia-gpu"] } }' --type=merge
66+
67+
# if yes, then run the following command to enable the plugin
68+
$ oc patch consoles.operator.openshift.io cluster --patch '[{"op": "add", "path": "/spec/plugins/-", "value": "console-plugin-nvidia-gpu" }]' --type=json
69+
70+
# add the required DCGM Exporter metrics ConfigMap to the existing NVIDIA operator ClusterPolicy CR:
71+
oc patch clusterpolicies.nvidia.com gpu-cluster-policy --patch '{ "spec": { "dcgmExporter": { "config": { "name": "console-plugin-nvidia-gpu" } } } }' --type=merge
72+
73+
----
74+
+
75+
The dashboard relies mostly on Prometheus metrics exposed by the NVIDIA DCGM Exporter, but the default exposed metrics are not enough for the dashboard to render the required gauges. Therefore, the DGCM exporter is configured to expose a custom set of metrics, as shown here.
76+
+
77+
[source,yaml]
78+
----
79+
apiVersion: v1
80+
data:
81+
dcgm-metrics.csv: |
82+
DCGM_FI_PROF_GR_ENGINE_ACTIVE, gauge, gpu utilization.
83+
DCGM_FI_DEV_MEM_COPY_UTIL, gauge, mem utilization.
84+
DCGM_FI_DEV_ENC_UTIL, gauge, enc utilization.
85+
DCGM_FI_DEV_DEC_UTIL, gauge, dec utilization.
86+
DCGM_FI_DEV_POWER_USAGE, gauge, power usage.
87+
DCGM_FI_DEV_POWER_MGMT_LIMIT_MAX, gauge, power mgmt limit.
88+
DCGM_FI_DEV_GPU_TEMP, gauge, gpu temp.
89+
DCGM_FI_DEV_SM_CLOCK, gauge, sm clock.
90+
DCGM_FI_DEV_MAX_SM_CLOCK, gauge, max sm clock.
91+
DCGM_FI_DEV_MEM_CLOCK, gauge, mem clock.
92+
DCGM_FI_DEV_MAX_MEM_CLOCK, gauge, max mem clock.
93+
kind: ConfigMap
94+
metadata:
95+
annotations:
96+
meta.helm.sh/release-name: console-plugin-nvidia-gpu
97+
meta.helm.sh/release-namespace: nvidia-gpu-operator
98+
creationTimestamp: "2022-10-26T19:46:41Z"
99+
labels:
100+
app.kubernetes.io/component: console-plugin-nvidia-gpu
101+
app.kubernetes.io/instance: console-plugin-nvidia-gpu
102+
app.kubernetes.io/managed-by: Helm
103+
app.kubernetes.io/name: console-plugin-nvidia-gpu
104+
app.kubernetes.io/part-of: console-plugin-nvidia-gpu
105+
app.kubernetes.io/version: latest
106+
helm.sh/chart: console-plugin-nvidia-gpu-0.2.3
107+
name: console-plugin-nvidia-gpu
108+
namespace: nvidia-gpu-operator
109+
resourceVersion: "19096623"
110+
uid: 96cdf700-dd27-437b-897d-5cbb1c255068
111+
----
112+
+
113+
Install the ConfigMap and edit the NVIDIA Operator ClusterPolicy CR to add that ConfigMap in the DCGM exporter configuration. The installation of the ConfigMap is done by the new version of the Console Plugin NVIDIA GPU Helm Chart, but the ClusterPolicy CR editing is done by the user.
114+
115+
. View the deployed resources:
116+
+
117+
[source,terminal]
118+
----
119+
$ oc -n nvidia-gpu-operator get all -l app.kubernetes.io/name=console-plugin-nvidia-gpu
120+
----
121+
+
122+
.Example output
123+
[source,terminal]
124+
----
125+
NAME READY STATUS RESTARTS AGE
126+
pod/console-plugin-nvidia-gpu-7dc9cfb5df-ztksx 1/1 Running 0 2m6s
127+
128+
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
129+
service/console-plugin-nvidia-gpu ClusterIP 172.30.240.138 <none> 9443/TCP 2m6s
130+
131+
NAME READY UP-TO-DATE AVAILABLE AGE
132+
deployment.apps/console-plugin-nvidia-gpu 1/1 1 1 2m6s
133+
134+
NAME DESIRED CURRENT READY AGE
135+
replicaset.apps/console-plugin-nvidia-gpu-7dc9cfb5df 1 1 1 2m6s
136+
----
Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
// Module included in the following assemblies:
2+
//
3+
// * monitoring/nvidia-gpu-admin-dashboard.adoc
4+
5+
:_content-type: CONCEPT
6+
[id="nvidia-gpu-admin-dashboard-introduction_{context}"]
7+
= Introduction
8+
9+
The OpenShift Console NVIDIA GPU plug-in is a dedicated administration dashboard for NVIDIA GPU usage visualization
10+
in the OpenShift Container Platform (OCP) Console. The visualizations in the administration dashboard provide guidance on how to
11+
best optimize GPU resources in clusters, such as when a GPU is under- or over-utilized.
12+
13+
The OpenShift Console NVIDIA GPU plug-in works as a remote bundle for the OCP console.
14+
To run the plug-in the OCP console must be running.
Lines changed: 59 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,59 @@
1+
// Module included in the following assemblies:
2+
//
3+
// * monitoring/nvidia-gpu-admin-dashboard.adoc
4+
5+
:_content-type: PROCEDURE
6+
[id="nvidia-gpu-admin-dashboard-using_{context}"]
7+
= Using the NVIDIA GPU administration dashboard
8+
9+
After deploying the the OpenShift Console NVIDIA GPU plug-in, log in to the OpenShift Container Platform web console using your login credentials to access the *Administrator* perspective.
10+
11+
To view the changes, you need to refresh the console to see the **GPUs** tab under **Compute**.
12+
13+
14+
== Viewing the cluster GPU overview
15+
16+
You can view the status of your cluster GPUs in the Overview page by selecting
17+
Overview in the Home section.
18+
19+
The Overview page provides information about the cluster GPUs, including:
20+
21+
* Details about the GPU providers
22+
* Status of the GPUs
23+
* Cluster utilization of the GPUs
24+
25+
== Viewing the GPUs dashboard
26+
27+
You can view the NVIDIA GPU administration dashboard by selecting GPUs
28+
in the Compute section of the OpenShift Console.
29+
30+
31+
Charts on the GPUs dashboard include:
32+
33+
* *GPU utilization*: Shows the ratio of time the graphics engine is active and is based on the ``DCGM_FI_PROF_GR_ENGINE_ACTIVE`` metric.
34+
35+
* *Memory utilization*: Shows the memory being used by the GPU and is based on the ``DCGM_FI_DEV_MEM_COPY_UTIL`` metric.
36+
37+
* *Encoder utilization*: Shows the video encoder rate of utilization and is based on the ``DCGM_FI_DEV_ENC_UTIL`` metric.
38+
39+
* *Decoder utilization*: *Encoder utilization*: Shows the video decoder rate of utilization and is based on the ``DCGM_FI_DEV_DEC_UTIL`` metric.
40+
41+
* *Power consumption*: Shows the average power usage of the GPU in Watts and is based on the ``DCGM_FI_DEV_POWER_USAGE`` metric.
42+
43+
* *GPU temperature*: Shows the current GPU temperature and is based on the ``DCGM_FI_DEV_GPU_TEMP`` metric. The maximum is set to ``110``, which is an empirical number, as the actual number is not exposed via a metric.
44+
45+
* *GPU clock speed*: Shows the average clock speed utilized by the GPU and is based on the ``DCGM_FI_DEV_SM_CLOCK`` metric.
46+
47+
* *Memory clock speed*: Shows the average clock speed utilized by memory and is based on the ``DCGM_FI_DEV_MEM_CLOCK`` metric.
48+
49+
== Viewing the GPU Metrics
50+
51+
You can view the metrics for the GPUs by selecting the metric at the bottom of
52+
each GPU to view the Metrics page.
53+
54+
On the Metrics page, you can:
55+
56+
* Specify a refresh rate for the metrics
57+
* Add, run, disable, and delete queries
58+
* Insert Metrics
59+
* Reset the zoom view
Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
:_content-type: ASSEMBLY
2+
[id="nvidia-gpu-admin-dashboard"]
3+
= The NVIDIA GPU administration dashboard
4+
include::_attributes/common-attributes.adoc[]
5+
:context: nvidia-gpu-admin-dashboard
6+
7+
toc::[]
8+
9+
include::modules/nvidia-gpu-admin-dashboard-introduction.adoc[leveloffset=+1]
10+
11+
include::modules/nvidia-gpu-admin-dashboard-installing.adoc[leveloffset=+1]
12+
13+
include::modules/nvidia-gpu-admin-dashboard-using.adoc[leveloffset=+1]

0 commit comments

Comments
 (0)