|
| 1 | +// Module included in the following assemblies: |
| 2 | +// |
| 3 | +// * monitoring/nvidia-gpu-admin-dashboard.adoc |
| 4 | + |
| 5 | +:_content-type: PROCEDURE |
| 6 | +[id="nvidia-gpu-admin-dashboard-installing_{context}"] |
| 7 | += Installing the NVIDIA GPU administration dashboard |
| 8 | + |
| 9 | +Install the NVIDIA GPU plug-in by using Helm on the OpenShift Container Platform (OCP) Console to add GPU capabilities. |
| 10 | + |
| 11 | +The OpenShift Console NVIDIA GPU plug-in works as a remote bundle for the OCP console. To run the OpenShift Console NVIDIA GPU plug-in |
| 12 | +an instance of the OCP console must be running. |
| 13 | + |
| 14 | + |
| 15 | +.Prerequisites |
| 16 | + |
| 17 | +* Red Hat OpenShift 4.11+ |
| 18 | +* NVIDIA GPU operator |
| 19 | +* link:https://helm.sh/docs/intro/install/[Helm] |
| 20 | +
|
| 21 | +
|
| 22 | +.Procedure |
| 23 | + |
| 24 | +Use the following procedure to install the OpenShift Console NVIDIA GPU plug-in. |
| 25 | + |
| 26 | +. Add the Helm repository: |
| 27 | ++ |
| 28 | +[source,terminal] |
| 29 | +---- |
| 30 | +$ helm repo add rh-ecosystem-edge https://rh-ecosystem-edge.github.io/console-plugin-nvidia-gpu |
| 31 | +---- |
| 32 | ++ |
| 33 | +[source,terminal] |
| 34 | +---- |
| 35 | +$ helm repo update |
| 36 | +---- |
| 37 | + |
| 38 | +. Install the Helm chart in the default NVIDIA GPU operator namespace: |
| 39 | ++ |
| 40 | +[source,terminal] |
| 41 | +---- |
| 42 | +$ helm install -n nvidia-gpu-operator console-plugin-nvidia-gpu rh-ecosystem-edge/console-plugin-nvidia-gpu |
| 43 | +---- |
| 44 | ++ |
| 45 | +.Example output |
| 46 | ++ |
| 47 | +[source,terminal] |
| 48 | +---- |
| 49 | +NAME: console-plugin-nvidia-gpu |
| 50 | +LAST DEPLOYED: Tue Aug 23 15:37:35 2022 |
| 51 | +NAMESPACE: nvidia-gpu-operator |
| 52 | +STATUS: deployed |
| 53 | +REVISION: 1 |
| 54 | +NOTES: |
| 55 | +View the Console Plugin NVIDIA GPU deployed resources by running the following command: |
| 56 | +
|
| 57 | +$ oc -n {{ .Release.Namespace }} get all -l app.kubernetes.io/name=console-plugin-nvidia-gpu |
| 58 | +
|
| 59 | +Enable the plugin by running the following command: |
| 60 | +
|
| 61 | +# Check if a plugins field is specified |
| 62 | +$ oc get consoles.operator.openshift.io cluster --output=jsonpath="{.spec.plugins}" |
| 63 | +
|
| 64 | +# if not, then run the following command to enable the plugin |
| 65 | +$ oc patch consoles.operator.openshift.io cluster --patch '{ "spec": { "plugins": ["console-plugin-nvidia-gpu"] } }' --type=merge |
| 66 | +
|
| 67 | +# if yes, then run the following command to enable the plugin |
| 68 | +$ oc patch consoles.operator.openshift.io cluster --patch '[{"op": "add", "path": "/spec/plugins/-", "value": "console-plugin-nvidia-gpu" }]' --type=json |
| 69 | +
|
| 70 | +# add the required DCGM Exporter metrics ConfigMap to the existing NVIDIA operator ClusterPolicy CR: |
| 71 | +oc patch clusterpolicies.nvidia.com gpu-cluster-policy --patch '{ "spec": { "dcgmExporter": { "config": { "name": "console-plugin-nvidia-gpu" } } } }' --type=merge |
| 72 | +
|
| 73 | +---- |
| 74 | ++ |
| 75 | +The dashboard relies mostly on Prometheus metrics exposed by the NVIDIA DCGM Exporter, but the default exposed metrics are not enough for the dashboard to render the required gauges. Therefore, the DGCM exporter is configured to expose a custom set of metrics, as shown here. |
| 76 | ++ |
| 77 | +[source,yaml] |
| 78 | +---- |
| 79 | +apiVersion: v1 |
| 80 | +data: |
| 81 | + dcgm-metrics.csv: | |
| 82 | + DCGM_FI_PROF_GR_ENGINE_ACTIVE, gauge, gpu utilization. |
| 83 | + DCGM_FI_DEV_MEM_COPY_UTIL, gauge, mem utilization. |
| 84 | + DCGM_FI_DEV_ENC_UTIL, gauge, enc utilization. |
| 85 | + DCGM_FI_DEV_DEC_UTIL, gauge, dec utilization. |
| 86 | + DCGM_FI_DEV_POWER_USAGE, gauge, power usage. |
| 87 | + DCGM_FI_DEV_POWER_MGMT_LIMIT_MAX, gauge, power mgmt limit. |
| 88 | + DCGM_FI_DEV_GPU_TEMP, gauge, gpu temp. |
| 89 | + DCGM_FI_DEV_SM_CLOCK, gauge, sm clock. |
| 90 | + DCGM_FI_DEV_MAX_SM_CLOCK, gauge, max sm clock. |
| 91 | + DCGM_FI_DEV_MEM_CLOCK, gauge, mem clock. |
| 92 | + DCGM_FI_DEV_MAX_MEM_CLOCK, gauge, max mem clock. |
| 93 | +kind: ConfigMap |
| 94 | +metadata: |
| 95 | + annotations: |
| 96 | + meta.helm.sh/release-name: console-plugin-nvidia-gpu |
| 97 | + meta.helm.sh/release-namespace: nvidia-gpu-operator |
| 98 | + creationTimestamp: "2022-10-26T19:46:41Z" |
| 99 | + labels: |
| 100 | + app.kubernetes.io/component: console-plugin-nvidia-gpu |
| 101 | + app.kubernetes.io/instance: console-plugin-nvidia-gpu |
| 102 | + app.kubernetes.io/managed-by: Helm |
| 103 | + app.kubernetes.io/name: console-plugin-nvidia-gpu |
| 104 | + app.kubernetes.io/part-of: console-plugin-nvidia-gpu |
| 105 | + app.kubernetes.io/version: latest |
| 106 | + helm.sh/chart: console-plugin-nvidia-gpu-0.2.3 |
| 107 | + name: console-plugin-nvidia-gpu |
| 108 | + namespace: nvidia-gpu-operator |
| 109 | + resourceVersion: "19096623" |
| 110 | + uid: 96cdf700-dd27-437b-897d-5cbb1c255068 |
| 111 | +---- |
| 112 | ++ |
| 113 | +Install the ConfigMap and edit the NVIDIA Operator ClusterPolicy CR to add that ConfigMap in the DCGM exporter configuration. The installation of the ConfigMap is done by the new version of the Console Plugin NVIDIA GPU Helm Chart, but the ClusterPolicy CR editing is done by the user. |
| 114 | + |
| 115 | +. View the deployed resources: |
| 116 | ++ |
| 117 | +[source,terminal] |
| 118 | +---- |
| 119 | +$ oc -n nvidia-gpu-operator get all -l app.kubernetes.io/name=console-plugin-nvidia-gpu |
| 120 | +---- |
| 121 | ++ |
| 122 | +.Example output |
| 123 | +[source,terminal] |
| 124 | +---- |
| 125 | +NAME READY STATUS RESTARTS AGE |
| 126 | +pod/console-plugin-nvidia-gpu-7dc9cfb5df-ztksx 1/1 Running 0 2m6s |
| 127 | +
|
| 128 | +NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE |
| 129 | +service/console-plugin-nvidia-gpu ClusterIP 172.30.240.138 <none> 9443/TCP 2m6s |
| 130 | +
|
| 131 | +NAME READY UP-TO-DATE AVAILABLE AGE |
| 132 | +deployment.apps/console-plugin-nvidia-gpu 1/1 1 1 2m6s |
| 133 | +
|
| 134 | +NAME DESIRED CURRENT READY AGE |
| 135 | +replicaset.apps/console-plugin-nvidia-gpu-7dc9cfb5df 1 1 1 2m6s |
| 136 | +---- |
0 commit comments