You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: openshift/gpu-operator-with-precompiled-drivers.rst
+20-11Lines changed: 20 additions & 11 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -17,7 +17,7 @@ About Precompiled Driver Containers
17
17
***********************************
18
18
19
19
By default, NVIDIA GPU drivers are built on the cluster nodes when you deploy the GPU Operator.
20
-
Driver compilation and packaging is done on every Kubernetes node, which leads to bursts of compute demand, waste of resources, and long provisioning times.
20
+
Driver compilation and packaging is done on every Kubernetes node, leading to bursts of compute demand, waste of resources, and long provisioning times.
21
21
In contrast, using container images with precompiled drivers makes the drivers immediately available on all nodes, resulting in faster provisioning and cost savings in public cloud deployments.
22
22
23
23
***********************************
@@ -43,19 +43,19 @@ Perform the following steps to build a custom driver image for use with Red Hat
43
43
44
44
.. rubric:: Prerequisites
45
45
46
-
* You have access to a container registry, such as NVIDIA NGC Private Registry, Red Hat Quay, or the OpenShift internal container registry, and can push container images to the registry.
46
+
* You have access to a container registry such as NVIDIA NGC Private Registry, Red Hat Quay, or the OpenShift internal container registry and can push container images to the registry.
47
47
48
48
* You have a valid Red Hat subscription with an activation key.
49
49
50
50
* You have a Red Hat OpenShift pull secret.
51
51
52
52
* Your build machine has access to the internet to download operating system packages.
53
53
54
-
* You know a CUDA version, such as ``12.1.0``, that you want to use.
54
+
* You know a CUDA version such as ``12.1.0`` that you want to use.
55
55
56
-
One way to find a supported CUDA version for your operating system is to access the NVIDIA GPU Cloud registry at `CUDA | NVIDIA NGC <https://catalog.ngc.nvidia.com/orgs/nvidia/containers/cuda/tags>`_ and view the tags. Use the search field to filter the tags, such as ``base-ubi8`` for RHEL 8 and ``base-ubi9`` for RHEL 9. The filtered results show the CUDA versions, such as ``12.1.0``, ``12.0.1``, ``12.0.0``, and so on.
56
+
One way to find a supported CUDA version for your operating system is to access the NVIDIA GPU Cloud registry at `CUDA | NVIDIA NGC <https://catalog.ngc.nvidia.com/orgs/nvidia/containers/cuda/tags>`_ and view the tags. Use the search field to filter the tags such as ``base-ubi8`` for RHEL 8 and ``base-ubi9`` for RHEL 9. The filtered results show the CUDA versions such as ``12.1.0``, ``12.0.1``, and ``12.0.0``.
57
57
58
-
* You know the GPU driver version, such as ``525.105.17``, that you want to use.
58
+
* You know the GPU driver version such as ``525.105.17`` that you want to use.
59
59
60
60
.. rubric:: Procedure
61
61
@@ -65,26 +65,26 @@ Perform the following steps to build a custom driver image for use with Red Hat
#. Change the directory to ``rhel8/precompiled`` under the cloned repository. You can build precompiled driver images for versions 8 and 9 of RHEL from this directory:
68
+
#. Change to the ``rhel8/precompiled`` directory under the cloned repository. You can build precompiled driver images for versions 8 and 9 of RHEL from this directory:
69
69
70
70
.. code-block:: console
71
71
72
72
$ cd driver/rhel8/precompiled
73
73
74
-
#. Create a Red Hat Customer Portal Activation Key and note your Red Hat Subscription Management (RHSM) organization ID. These are to install packages during a build. Save the values to files, for example, ``$HOME/rhsm_org`` and ``$HOME/rhsm_activationkey``:
74
+
#. Create a Red Hat Customer Portal Activation Key and note your Red Hat Subscription Management (RHSM) organization ID. These are to install packages during a build. Save the values to files such as ``$HOME/rhsm_org`` and ``$HOME/rhsm_activationkey``:
#. Download your Red Hat OpenShift pull secret and store it in a file, for example, ``${HOME}/pull-secret``:
81
+
#. Download your Red Hat OpenShift pull secret and store it in a file such as ``${HOME}/pull-secret``:
82
82
83
83
.. code-block:: console
84
84
85
85
export PULL_SECRET_FILE=$HOME/pull-secret.txt
86
86
87
-
#. Set the Red Hat OpenShift version and target architecture of your cluster, for example, ``x86_64``:
87
+
#. Set the Red Hat OpenShift version and target architecture of your cluster such as ``x86_64``:
88
88
89
89
.. code-block:: console
90
90
@@ -121,15 +121,24 @@ Perform the following steps to build a custom driver image for use with Red Hat
121
121
export DRIVER_VERSION=525.105.17
122
122
export OS_TAG=rhcos4.12
123
123
124
+
.. note:: The driver container image tag for OpenShift has changed after the OCP 4.19 release.
125
+
126
+
- Before OCP 4.19: The driver image tag is formed with the suffix ``-rhcos4.17`` (such as with OCP 4.17).
127
+
- Starting OCP 4.19 and later: The driver image tag is formed with the suffix ``-rhel9.6`` (such as with OCP 4.19).
128
+
129
+
Refer to `RHEL Versions Utilized by RHEL CoreOS and OCP <https://access.redhat.com/articles/6907891>`_
130
+
and `Split RHCOS into layers: /etc/os-release <https://github.com/openshift/enhancements/blob/master/enhancements/rhcos/split-rhcos-into-layers.md#etcos-release>`_
131
+
for more information.
132
+
124
133
#. Build and push the image:
125
134
126
135
.. code-block:: console
127
136
128
137
make image image-push
129
138
130
-
Optionally, override the ``IMAGE_REGISTRY``, ``IMAGE_NAME``, and ``CONTAINER_TOOL``. You can also override ``BUILDER_USER`` and ``BUILDER_EMAIL`` if you want, otherwise your Git username and email are used. See the Makefile for all available variables.
139
+
Optionally, override the ``IMAGE_REGISTRY``, ``IMAGE_NAME``, and ``CONTAINER_TOOL``. You can also override ``BUILDER_USER`` and ``BUILDER_EMAIL`` if you want. Otherwise, your Git username and email are used. Refer to the Makefile for all available variables.
131
140
132
-
.. note:: Do not set the ``DRIVER_TYPE``. The only supported value is currently ``passthrough``, which is set by default.
141
+
.. note:: Do not set the ``DRIVER_TYPE``. The only supported value is currently ``passthrough``, and this is set by default.
Copy file name to clipboardExpand all lines: openshift/install-gpu-ocp.rst
+17-16Lines changed: 17 additions & 16 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -13,18 +13,19 @@ Installing the NVIDIA GPU Operator by using the web console
13
13
14
14
#. In the OpenShift Container Platform web console, from the side menu, navigate to **Operators** > **OperatorHub** and select **All Projects**.
15
15
16
+
#. In **Operators** > **OperatorHub**, search for the **NVIDIA GPU Operator**. For additional information, refer to the `Red Hat OpenShift Container Platform documentation <https://docs.openshift.com/container-platform/latest/operators/admin/olm-adding-operators-to-cluster.html>`_.
16
17
#. In **Operators** > **OperatorHub**, search for the **NVIDIA GPU Operator**. For additional information, refer to the `Red Hat OpenShift Container Platform documentation <https://docs.openshift.com/container-platform/latest/operators/admin/olm-adding-operators-to-cluster.html>`_.
17
18
18
19
#. Select the **NVIDIA GPU Operator**, click **Install**. In the following screen, click **Install**.
19
20
20
21
.. note:: Here, you can select the namespace where you want to deploy the GPU Operator. The suggested namespace to use is the ``nvidia-gpu-operator``. You can choose any existing namespace or create a new namespace under **Select a Namespace**.
21
22
22
-
If you install in any other namespace other than ``nvidia-gpu-operator``, the GPU Operator will **not** automatically enable namespace monitoring, and metrics and alerts will **not** be collected by Prometheus.
23
-
If only trusted operators are installed in this namespace, you can manually enable namespace monitoring with this command:
23
+
If you install in any other namespace other than ``nvidia-gpu-operator``, the GPU Operator does **not** automatically enable namespace monitoring, and metrics and alerts are **not** collected by Prometheus.
24
+
If only trusted operators are installed in this namespace, you can manually enable namespace monitoring with this command:
Proceed to :ref:`Create the cluster policy for the NVIDIA GPU Operator <create-cluster-policy>`.
30
31
@@ -198,7 +199,7 @@ When you install the **NVIDIA GPU Operator** in the OpenShift Container Platform
198
199
.. note:: If you create a ClusterPolicy that contains an empty specification such as ``spec{}``, the ClusterPolicy fails to deploy.
199
200
200
201
As a cluster administrator, you can create a ClusterPolicy using the OpenShift Container Platform CLI or the web console. Also, these steps differ
201
-
when using **NVIDIA vGPU**. Refer to the appropriate sections that follow.
202
+
when using **NVIDIA vGPU**. Refer to the appropriate sections below.
202
203
203
204
.. _create-cluster-policy-web-console:
204
205
@@ -209,7 +210,7 @@ Create the cluster policy using the web console
209
210
210
211
#. Select the **ClusterPolicy** tab, then click **Create ClusterPolicy**. The platform assigns the default name *gpu-cluster-policy*.
211
212
212
-
.. note:: You can use this screen to customize the ClusterPolicy; although, the default values are sufficient to get the GPU configured and running in most cases.
213
+
.. note:: You can use this screen to customize the ClusterPolicy. However, the default values are sufficient to get the GPU configured and running in most cases.
213
214
214
215
.. note:: For OpenShift 4.12 with GPU Operator 25.3.1 or later, you must expand the **Driver** section and set the following fields:
215
216
@@ -219,7 +220,7 @@ Create the cluster policy using the web console
219
220
220
221
#. Click **Create**.
221
222
222
-
At this point, the GPU Operator proceeds and installs all the required components to set up the NVIDIA GPUs in the OpenShift 4 cluster. Wait at least 10-20 minutes before digging deeper into any form of troubleshooting because this may take a period of time to finish.
223
+
At this point, the GPU Operator proceeds and installs all the required components to set up the NVIDIA GPUs in the OpenShift 4 cluster. Wait at least 10 to 20 minutes before troubleshooting because this process can take some time to finish.
223
224
224
225
#. The status of the newly deployed ClusterPolicy *gpu-cluster-policy* for the NVIDIA GPU Operator changes to ``State:ready`` when the installation succeeds.
225
226
@@ -237,7 +238,7 @@ Create the cluster policy using the CLI
.. note:: For OpenShift 4.12 with GPU Operator 25.3.1 or later, modify the clusterpolicy.json file to specify ``driver.licensingConfig``, ``driver.repository``, ``driver.image``, ``driver.version``, and ``driver.imagePullSecrets`` (optional). The following snippet is shown as an example. Change values accordingly. Refer to :ref:`operator-release-notes` for recommended driver versions.
241
+
.. note:: For OpenShift 4.12 with GPU Operator 25.3.1 or later, modify the ``clusterpolicy.json`` file to specify ``driver.licensingConfig``, ``driver.repository``, ``driver.image``, ``driver.version``, and ``driver.imagePullSecrets`` (optional). The following snippet is shown as an example. Change values accordingly. Refer to :ref:`operator-release-notes` for recommended driver versions.
241
242
242
243
.. code-block:: json
243
244
@@ -275,13 +276,13 @@ Create the cluster policy using the web console
275
276
276
277
.. image:: graphics/cluster_policy_vgpu_1.png
277
278
278
-
#. Specify ``repository`` path, ``image`` name and NVIDIA vGPU driver ``version`` bundled under **Driver** section. If the registry is not public, please specify the ``imagePullSecret`` created during pre-requisite step under **Driver** advanced configurations section.
279
+
#. Specify the ``repository`` path, ``image`` name, and NVIDIA vGPU driver ``version`` bundled under the **Driver** section. If the registry is not public, specify the ``imagePullSecret`` created during the prerequisite step under the **Driver** advanced configurations section.
279
280
280
281
.. image:: graphics/cluster_policy_vgpu_2.png
281
282
282
283
#. Click **Create**.
283
284
284
-
At this point, the GPU Operator proceeds and installs all the required components to set up the NVIDIA GPUs in the OpenShift 4 cluster. Wait at least 10-20 minutes before digging deeper into any form of troubleshooting because this may take a period of time to finish.
285
+
At this point, the GPU Operator proceeds and installs all the required components to set up the NVIDIA GPUs in the OpenShift 4 cluster. Wait at least 10 to 20 minutes before troubleshooting because this process can take some time to finish.
285
286
286
287
#. The status of the newly deployed ClusterPolicy *gpu-cluster-policy* for the NVIDIA GPU Operator changes to ``State:ready`` when the installation succeeds.
287
288
@@ -297,7 +298,7 @@ Create the cluster policy using the CLI
Modify clusterpolicy.json file to specify ``driver.licensingConfig``, ``driver.repository``, ``driver.image``, ``driver.version`` and ``driver.imagePullSecrets`` created during pre-requiste steps. Below snippet is shown as an example, please change values accordingly.
301
+
Modify the ``clusterpolicy.json`` file to specify ``driver.licensingConfig``, ``driver.repository``, ``driver.image``, ``driver.version``, and ``driver.imagePullSecrets`` created during the prerequisite steps. The following snippet is shown as an example. Change values accordingly.
301
302
302
303
.. code-block:: json
303
304
@@ -372,7 +373,7 @@ The GPU Operator generates GPU performance metrics (DCGM-export), status metrics
372
373
When the GPU Operator is installed in the suggested ``nvidia-gpu-operator`` namespace, the GPU Operator automatically enables monitoring if the ``openshift.io/cluster-monitoring`` label is not defined.
373
374
If the label is defined, the GPU Operator will not change its value.
374
375
375
-
Disable cluster monitoring in the ``nvidia-gpu-operator`` namespace by setting ``openshift.io/cluster-monitoring=false`` as shown:
376
+
Disable cluster monitoring in the ``nvidia-gpu-operator`` namespace by setting ``openshift.io/cluster-monitoring=false``:
376
377
377
378
.. code-block:: console
378
379
@@ -459,7 +460,7 @@ Run a simple CUDA VectorAdd sample that adds two vectors together to ensure the
The ``nvidia-smi`` shows memory usage, GPU utilization, and the temperature of the GPU. Test the GPU access by running the popular ``nvidia-smi`` command within the pod.
463
+
The ``nvidia-smi`` command shows memory usage, GPU utilization, and the temperature of the GPU. Test the GPU access by running the popular ``nvidia-smi`` command within the pod.
463
464
464
465
To view GPU utilization, run ``nvidia-smi`` from a pod in the GPU Operator daemonset.
465
466
@@ -481,7 +482,7 @@ To view GPU utilization, run ``nvidia-smi`` from a pod in the GPU Operator daemo
Two tables are generated. The first table reflects the information about all available GPUs (the example shows one GPU). The second table provides details on the processes using the GPUs.
517
+
Two tables are generated. The first table reflects the information about all available GPUs (the example shows one GPU). The second table provides details about the processes using the GPUs.
517
518
518
-
For more information describing the contents of the tables see the man page for ``nvidia-smi``.
519
+
For more information describing the contents of the tables, refer to the man page for ``nvidia-smi``.
#. When the Node Feature Discovery is installed, create an instance of Node Feature Discovery using the **NodeFeatureDiscovery** tab.
29
+
#. When the Node Feature Discovery is installed, create an instance of Node Feature Discovery using the **NodeFeatureDiscovery** tab:
30
30
31
31
#. Click **Operators** > **Installed Operators** from the side menu.
32
32
@@ -38,7 +38,7 @@ The Node Feature Discovery (NFD) Operator is a prerequisite for the **NVIDIA GPU
38
38
39
39
#. In the following screen, click **Create**. This starts the Node Feature Discovery Operator that proceeds to label the nodes in the cluster that have GPUs.
40
40
41
-
.. note:: The values prepopulated by the OperatorHub are valid for the GPU Operator.
41
+
.. note:: The values prepopulated by the OperatorHub are valid for the GPU Operator.
Copy file name to clipboardExpand all lines: openshift/introduction.rst
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -17,7 +17,7 @@ Red Hat OpenShift Container Platform includes enhancements to Kubernetes so user
17
17
18
18
The NVIDIA GPU Operator uses the operator framework within Kubernetes to automate the management of all NVIDIA software components needed to provision GPU. These components include the NVIDIA drivers (to enable CUDA),
19
19
Kubernetes device plugin for GPUs, the `NVIDIA Container Toolkit <https://github.com/NVIDIA/nvidia-container-toolkit>`_,
20
-
automatic node labeling using `GFD <https://github.com/NVIDIA/gpu-feature-discovery>`_, `DCGM <https://developer.nvidia.com/dcgm>`_-based monitoring and others.
20
+
automatic node labeling using `GFD <https://github.com/NVIDIA/gpu-feature-discovery>`_, `DCGM <https://developer.nvidia.com/dcgm>`_-based monitoring, and others.
21
21
22
22
For guidance on the specific NVIDIA support entitlement needs,
23
23
refer |essug|_ if you have an NVIDIA AI Enterprise entitlement.
0 commit comments