Skip to content

Commit cc01e96

Browse files
authored
Add driver container image tag note for OCP 4.19+ (#245)
* add driver container image tag note for OCP 4.19+ Signed-off-by: Andrew Chen <[email protected]> * formatting and style guide fixes Signed-off-by: Andrew Chen <[email protected]> * add reference links to RH docs Signed-off-by: Andrew Chen <[email protected]> * accept suggestions Signed-off-by: Andrew Chen <[email protected]> --------- Signed-off-by: Andrew Chen <[email protected]>
1 parent 8be04f1 commit cc01e96

File tree

6 files changed

+46
-36
lines changed

6 files changed

+46
-36
lines changed

openshift/appendix-ocp.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -63,5 +63,5 @@ For additional troubleshooting resources:
6363
* `Node Feature Discovery documentation <https://kubernetes-sigs.github.io/node-feature-discovery/>`_.
6464
* `Red Hat Node Feature Discovery Operator documentation <https://docs.openshift.com/container-platform/latest/hardware_enablement/psap-node-feature-discovery-operator.html>`_
6565
* `OpenShift Driver Toolkit documentation <https://docs.redhat.com/en/documentation/openshift_container_platform/latest/html/specialized_hardware_and_driver_enablement/driver-toolkit>`_
66-
* `OpenShift Driver Toolkit GihHub repository <https://github.com/openshift/driver-toolkit/>`_
66+
* `OpenShift Driver Toolkit GitHub repository <https://github.com/openshift/driver-toolkit/>`_
6767
* `OpenShift troubleshooting guide <https://docs.openshift.com/container-platform/latest/support/troubleshooting/>`_

openshift/gpu-operator-with-precompiled-drivers.rst

Lines changed: 20 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,7 @@ About Precompiled Driver Containers
1717
***********************************
1818

1919
By default, NVIDIA GPU drivers are built on the cluster nodes when you deploy the GPU Operator.
20-
Driver compilation and packaging is done on every Kubernetes node, which leads to bursts of compute demand, waste of resources, and long provisioning times.
20+
Driver compilation and packaging is done on every Kubernetes node, leading to bursts of compute demand, waste of resources, and long provisioning times.
2121
In contrast, using container images with precompiled drivers makes the drivers immediately available on all nodes, resulting in faster provisioning and cost savings in public cloud deployments.
2222

2323
***********************************
@@ -43,19 +43,19 @@ Perform the following steps to build a custom driver image for use with Red Hat
4343

4444
.. rubric:: Prerequisites
4545

46-
* You have access to a container registry, such as NVIDIA NGC Private Registry, Red Hat Quay, or the OpenShift internal container registry, and can push container images to the registry.
46+
* You have access to a container registry such as NVIDIA NGC Private Registry, Red Hat Quay, or the OpenShift internal container registry and can push container images to the registry.
4747

4848
* You have a valid Red Hat subscription with an activation key.
4949

5050
* You have a Red Hat OpenShift pull secret.
5151

5252
* Your build machine has access to the internet to download operating system packages.
5353

54-
* You know a CUDA version, such as ``12.1.0``, that you want to use.
54+
* You know a CUDA version such as ``12.1.0`` that you want to use.
5555

56-
One way to find a supported CUDA version for your operating system is to access the NVIDIA GPU Cloud registry at `CUDA | NVIDIA NGC <https://catalog.ngc.nvidia.com/orgs/nvidia/containers/cuda/tags>`_ and view the tags. Use the search field to filter the tags, such as ``base-ubi8`` for RHEL 8 and ``base-ubi9`` for RHEL 9. The filtered results show the CUDA versions, such as ``12.1.0``, ``12.0.1``, ``12.0.0``, and so on.
56+
One way to find a supported CUDA version for your operating system is to access the NVIDIA GPU Cloud registry at `CUDA | NVIDIA NGC <https://catalog.ngc.nvidia.com/orgs/nvidia/containers/cuda/tags>`_ and view the tags. Use the search field to filter the tags such as ``base-ubi8`` for RHEL 8 and ``base-ubi9`` for RHEL 9. The filtered results show the CUDA versions such as ``12.1.0``, ``12.0.1``, and ``12.0.0``.
5757

58-
* You know the GPU driver version, such as ``525.105.17``, that you want to use.
58+
* You know the GPU driver version such as ``525.105.17`` that you want to use.
5959

6060
.. rubric:: Procedure
6161

@@ -65,26 +65,26 @@ Perform the following steps to build a custom driver image for use with Red Hat
6565
6666
$ git clone https://gitlab.com/nvidia/container-images/driver
6767
68-
#. Change the directory to ``rhel8/precompiled`` under the cloned repository. You can build precompiled driver images for versions 8 and 9 of RHEL from this directory:
68+
#. Change to the ``rhel8/precompiled`` directory under the cloned repository. You can build precompiled driver images for versions 8 and 9 of RHEL from this directory:
6969

7070
.. code-block:: console
7171
7272
$ cd driver/rhel8/precompiled
7373
74-
#. Create a Red Hat Customer Portal Activation Key and note your Red Hat Subscription Management (RHSM) organization ID. These are to install packages during a build. Save the values to files, for example, ``$HOME/rhsm_org`` and ``$HOME/rhsm_activationkey``:
74+
#. Create a Red Hat Customer Portal Activation Key and note your Red Hat Subscription Management (RHSM) organization ID. These are to install packages during a build. Save the values to files such as ``$HOME/rhsm_org`` and ``$HOME/rhsm_activationkey``:
7575

7676
.. code-block:: console
7777
7878
export RHSM_ORG_FILE=$HOME/rhsm_org
7979
export RHSM_ACTIVATIONKEY_FILE=$HOME/rhsm_activationkey
8080
81-
#. Download your Red Hat OpenShift pull secret and store it in a file, for example, ``${HOME}/pull-secret``:
81+
#. Download your Red Hat OpenShift pull secret and store it in a file such as ``${HOME}/pull-secret``:
8282

8383
.. code-block:: console
8484
8585
export PULL_SECRET_FILE=$HOME/pull-secret.txt
8686
87-
#. Set the Red Hat OpenShift version and target architecture of your cluster, for example, ``x86_64``:
87+
#. Set the Red Hat OpenShift version and target architecture of your cluster such as ``x86_64``:
8888

8989
.. code-block:: console
9090
@@ -121,15 +121,24 @@ Perform the following steps to build a custom driver image for use with Red Hat
121121
export DRIVER_VERSION=525.105.17
122122
export OS_TAG=rhcos4.12
123123
124+
.. note:: The driver container image tag for OpenShift has changed after the OCP 4.19 release.
125+
126+
- Before OCP 4.19: The driver image tag is formed with the suffix ``-rhcos4.17`` (such as with OCP 4.17).
127+
- Starting OCP 4.19 and later: The driver image tag is formed with the suffix ``-rhel9.6`` (such as with OCP 4.19).
128+
129+
Refer to `RHEL Versions Utilized by RHEL CoreOS and OCP <https://access.redhat.com/articles/6907891>`_
130+
and `Split RHCOS into layers: /etc/os-release <https://github.com/openshift/enhancements/blob/master/enhancements/rhcos/split-rhcos-into-layers.md#etcos-release>`_
131+
for more information.
132+
124133
#. Build and push the image:
125134

126135
.. code-block:: console
127136
128137
make image image-push
129138
130-
Optionally, override the ``IMAGE_REGISTRY``, ``IMAGE_NAME``, and ``CONTAINER_TOOL``. You can also override ``BUILDER_USER`` and ``BUILDER_EMAIL`` if you want, otherwise your Git username and email are used. See the Makefile for all available variables.
139+
Optionally, override the ``IMAGE_REGISTRY``, ``IMAGE_NAME``, and ``CONTAINER_TOOL``. You can also override ``BUILDER_USER`` and ``BUILDER_EMAIL`` if you want. Otherwise, your Git username and email are used. Refer to the Makefile for all available variables.
131140

132-
.. note:: Do not set the ``DRIVER_TYPE``. The only supported value is currently ``passthrough``, which is set by default.
141+
.. note:: Do not set the ``DRIVER_TYPE``. The only supported value is currently ``passthrough``, and this is set by default.
133142

134143
*********************************************
135144
Enabling Precompiled Driver Container Support

openshift/install-gpu-ocp.rst

Lines changed: 17 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -13,18 +13,19 @@ Installing the NVIDIA GPU Operator by using the web console
1313

1414
#. In the OpenShift Container Platform web console, from the side menu, navigate to **Operators** > **OperatorHub** and select **All Projects**.
1515

16+
#. In **Operators** > **OperatorHub**, search for the **NVIDIA GPU Operator**. For additional information, refer to the `Red Hat OpenShift Container Platform documentation <https://docs.openshift.com/container-platform/latest/operators/admin/olm-adding-operators-to-cluster.html>`_.
1617
#. In **Operators** > **OperatorHub**, search for the **NVIDIA GPU Operator**. For additional information, refer to the `Red Hat OpenShift Container Platform documentation <https://docs.openshift.com/container-platform/latest/operators/admin/olm-adding-operators-to-cluster.html>`_.
1718

1819
#. Select the **NVIDIA GPU Operator**, click **Install**. In the following screen, click **Install**.
1920

2021
.. note:: Here, you can select the namespace where you want to deploy the GPU Operator. The suggested namespace to use is the ``nvidia-gpu-operator``. You can choose any existing namespace or create a new namespace under **Select a Namespace**.
2122

22-
If you install in any other namespace other than ``nvidia-gpu-operator``, the GPU Operator will **not** automatically enable namespace monitoring, and metrics and alerts will **not** be collected by Prometheus.
23-
If only trusted operators are installed in this namespace, you can manually enable namespace monitoring with this command:
23+
If you install in any other namespace other than ``nvidia-gpu-operator``, the GPU Operator does **not** automatically enable namespace monitoring, and metrics and alerts are **not** collected by Prometheus.
24+
If only trusted operators are installed in this namespace, you can manually enable namespace monitoring with this command:
2425

25-
.. code-block:: console
26+
.. code-block:: console
2627
27-
$ oc label ns/$NAMESPACE_NAME openshift.io/cluster-monitoring=true
28+
$ oc label ns/$NAMESPACE_NAME openshift.io/cluster-monitoring=true
2829
2930
Proceed to :ref:`Create the cluster policy for the NVIDIA GPU Operator <create-cluster-policy>`.
3031

@@ -198,7 +199,7 @@ When you install the **NVIDIA GPU Operator** in the OpenShift Container Platform
198199
.. note:: If you create a ClusterPolicy that contains an empty specification such as ``spec{}``, the ClusterPolicy fails to deploy.
199200

200201
As a cluster administrator, you can create a ClusterPolicy using the OpenShift Container Platform CLI or the web console. Also, these steps differ
201-
when using **NVIDIA vGPU**. Refer to the appropriate sections that follow.
202+
when using **NVIDIA vGPU**. Refer to the appropriate sections below.
202203

203204
.. _create-cluster-policy-web-console:
204205

@@ -209,7 +210,7 @@ Create the cluster policy using the web console
209210

210211
#. Select the **ClusterPolicy** tab, then click **Create ClusterPolicy**. The platform assigns the default name *gpu-cluster-policy*.
211212

212-
.. note:: You can use this screen to customize the ClusterPolicy; although, the default values are sufficient to get the GPU configured and running in most cases.
213+
.. note:: You can use this screen to customize the ClusterPolicy. However, the default values are sufficient to get the GPU configured and running in most cases.
213214

214215
.. note:: For OpenShift 4.12 with GPU Operator 25.3.1 or later, you must expand the **Driver** section and set the following fields:
215216

@@ -219,7 +220,7 @@ Create the cluster policy using the web console
219220

220221
#. Click **Create**.
221222

222-
At this point, the GPU Operator proceeds and installs all the required components to set up the NVIDIA GPUs in the OpenShift 4 cluster. Wait at least 10-20 minutes before digging deeper into any form of troubleshooting because this may take a period of time to finish.
223+
At this point, the GPU Operator proceeds and installs all the required components to set up the NVIDIA GPUs in the OpenShift 4 cluster. Wait at least 10 to 20 minutes before troubleshooting because this process can take some time to finish.
223224

224225
#. The status of the newly deployed ClusterPolicy *gpu-cluster-policy* for the NVIDIA GPU Operator changes to ``State:ready`` when the installation succeeds.
225226

@@ -237,7 +238,7 @@ Create the cluster policy using the CLI
237238
$ oc get csv -n nvidia-gpu-operator gpu-operator-certified.v22.9.0 -ojsonpath={.metadata.annotations.alm-examples} | jq .[0] > clusterpolicy.json
238239
239240
240-
.. note:: For OpenShift 4.12 with GPU Operator 25.3.1 or later, modify the clusterpolicy.json file to specify ``driver.licensingConfig``, ``driver.repository``, ``driver.image``, ``driver.version``, and ``driver.imagePullSecrets`` (optional). The following snippet is shown as an example. Change values accordingly. Refer to :ref:`operator-release-notes` for recommended driver versions.
241+
.. note:: For OpenShift 4.12 with GPU Operator 25.3.1 or later, modify the ``clusterpolicy.json`` file to specify ``driver.licensingConfig``, ``driver.repository``, ``driver.image``, ``driver.version``, and ``driver.imagePullSecrets`` (optional). The following snippet is shown as an example. Change values accordingly. Refer to :ref:`operator-release-notes` for recommended driver versions.
241242

242243
.. code-block:: json
243244
@@ -275,13 +276,13 @@ Create the cluster policy using the web console
275276

276277
.. image:: graphics/cluster_policy_vgpu_1.png
277278

278-
#. Specify ``repository`` path, ``image`` name and NVIDIA vGPU driver ``version`` bundled under **Driver** section. If the registry is not public, please specify the ``imagePullSecret`` created during pre-requisite step under **Driver** advanced configurations section.
279+
#. Specify the ``repository`` path, ``image`` name, and NVIDIA vGPU driver ``version`` bundled under the **Driver** section. If the registry is not public, specify the ``imagePullSecret`` created during the prerequisite step under the **Driver** advanced configurations section.
279280

280281
.. image:: graphics/cluster_policy_vgpu_2.png
281282

282283
#. Click **Create**.
283284

284-
At this point, the GPU Operator proceeds and installs all the required components to set up the NVIDIA GPUs in the OpenShift 4 cluster. Wait at least 10-20 minutes before digging deeper into any form of troubleshooting because this may take a period of time to finish.
285+
At this point, the GPU Operator proceeds and installs all the required components to set up the NVIDIA GPUs in the OpenShift 4 cluster. Wait at least 10 to 20 minutes before troubleshooting because this process can take some time to finish.
285286

286287
#. The status of the newly deployed ClusterPolicy *gpu-cluster-policy* for the NVIDIA GPU Operator changes to ``State:ready`` when the installation succeeds.
287288

@@ -297,7 +298,7 @@ Create the cluster policy using the CLI
297298
298299
$ oc get csv -n nvidia-gpu-operator gpu-operator-certified.v22.9.0 -ojsonpath={.metadata.annotations.alm-examples} | jq .[0] > clusterpolicy.json
299300
300-
Modify clusterpolicy.json file to specify ``driver.licensingConfig``, ``driver.repository``, ``driver.image``, ``driver.version`` and ``driver.imagePullSecrets`` created during pre-requiste steps. Below snippet is shown as an example, please change values accordingly.
301+
Modify the ``clusterpolicy.json`` file to specify ``driver.licensingConfig``, ``driver.repository``, ``driver.image``, ``driver.version``, and ``driver.imagePullSecrets`` created during the prerequisite steps. The following snippet is shown as an example. Change values accordingly.
301302

302303
.. code-block:: json
303304
@@ -372,7 +373,7 @@ The GPU Operator generates GPU performance metrics (DCGM-export), status metrics
372373
When the GPU Operator is installed in the suggested ``nvidia-gpu-operator`` namespace, the GPU Operator automatically enables monitoring if the ``openshift.io/cluster-monitoring`` label is not defined.
373374
If the label is defined, the GPU Operator will not change its value.
374375

375-
Disable cluster monitoring in the ``nvidia-gpu-operator`` namespace by setting ``openshift.io/cluster-monitoring=false`` as shown:
376+
Disable cluster monitoring in the ``nvidia-gpu-operator`` namespace by setting ``openshift.io/cluster-monitoring=false``:
376377

377378
.. code-block:: console
378379
@@ -459,7 +460,7 @@ Run a simple CUDA VectorAdd sample that adds two vectors together to ensure the
459460
Getting information about the GPU
460461
*************************************************************
461462

462-
The ``nvidia-smi`` shows memory usage, GPU utilization, and the temperature of the GPU. Test the GPU access by running the popular ``nvidia-smi`` command within the pod.
463+
The ``nvidia-smi`` command shows memory usage, GPU utilization, and the temperature of the GPU. Test the GPU access by running the popular ``nvidia-smi`` command within the pod.
463464

464465
To view GPU utilization, run ``nvidia-smi`` from a pod in the GPU Operator daemonset.
465466

@@ -481,7 +482,7 @@ To view GPU utilization, run ``nvidia-smi`` from a pod in the GPU Operator daemo
481482
nvidia-driver-daemonset-410.84.202203290245-0-xxgdv 2/2 Running 0 23m 10.130.2.18 ip-10-0-143-147.ec2.internal <none> <none>
482483
483484
484-
.. note:: With the Pod and node name, run the ``nvidia-smi`` on the correct node.
485+
.. note:: With the pod and node name, run the ``nvidia-smi`` command on the correct node.
485486

486487
#. Run the ``nvidia-smi`` command within the pod:
487488

@@ -513,6 +514,6 @@ To view GPU utilization, run ``nvidia-smi`` from a pod in the GPU Operator daemo
513514
| No running processes found |
514515
+-----------------------------------------------------------------------------+
515516
516-
Two tables are generated. The first table reflects the information about all available GPUs (the example shows one GPU). The second table provides details on the processes using the GPUs.
517+
Two tables are generated. The first table reflects the information about all available GPUs (the example shows one GPU). The second table provides details about the processes using the GPUs.
517518

518-
For more information describing the contents of the tables see the man page for ``nvidia-smi``.
519+
For more information describing the contents of the tables, refer to the man page for ``nvidia-smi``.

openshift/install-nfd.rst

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -26,7 +26,7 @@ The Node Feature Discovery (NFD) Operator is a prerequisite for the **NVIDIA GPU
2626
NAME READY STATUS RESTARTS AGE
2727
nfd-controller-manager-7f86ccfb58-nqgxm 2/2 Running 0 11m
2828
29-
#. When the Node Feature Discovery is installed, create an instance of Node Feature Discovery using the **NodeFeatureDiscovery** tab.
29+
#. When the Node Feature Discovery is installed, create an instance of Node Feature Discovery using the **NodeFeatureDiscovery** tab:
3030

3131
#. Click **Operators** > **Installed Operators** from the side menu.
3232

@@ -38,7 +38,7 @@ The Node Feature Discovery (NFD) Operator is a prerequisite for the **NVIDIA GPU
3838

3939
#. In the following screen, click **Create**. This starts the Node Feature Discovery Operator that proceeds to label the nodes in the cluster that have GPUs.
4040

41-
.. note:: The values prepopulated by the OperatorHub are valid for the GPU Operator.
41+
.. note:: The values prepopulated by the OperatorHub are valid for the GPU Operator.
4242

4343
*************************************************************************
4444
Verify that the Node Feature Discovery Operator is functioning correctly
@@ -61,7 +61,7 @@ The Node Feature Discovery Operator uses vendor PCI IDs to identify hardware in
6161
6262
.. note:: ``0x10de`` is the PCI vendor ID assigned to NVIDIA.
6363

64-
#. Verify that the GPU device (``pci-10de``) is discovered on the GPU node.
64+
#. Verify that the GPU device (``pci-10de``) is discovered on the GPU node:
6565

6666
.. code-block:: console
6767

openshift/introduction.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,7 @@ Red Hat OpenShift Container Platform includes enhancements to Kubernetes so user
1717

1818
The NVIDIA GPU Operator uses the operator framework within Kubernetes to automate the management of all NVIDIA software components needed to provision GPU. These components include the NVIDIA drivers (to enable CUDA),
1919
Kubernetes device plugin for GPUs, the `NVIDIA Container Toolkit <https://github.com/NVIDIA/nvidia-container-toolkit>`_,
20-
automatic node labeling using `GFD <https://github.com/NVIDIA/gpu-feature-discovery>`_, `DCGM <https://developer.nvidia.com/dcgm>`_-based monitoring and others.
20+
automatic node labeling using `GFD <https://github.com/NVIDIA/gpu-feature-discovery>`_, `DCGM <https://developer.nvidia.com/dcgm>`_-based monitoring, and others.
2121

2222
For guidance on the specific NVIDIA support entitlement needs,
2323
refer |essug|_ if you have an NVIDIA AI Enterprise entitlement.

0 commit comments

Comments
 (0)