Skip to content

Commit fd8ce5f

Browse files
authored
Merge pull request #209 from empovit/remove-entitled-builds
Remove entitled driver builds support
2 parents a3c48cc + da9fb79 commit fd8ce5f

File tree

4 files changed

+42
-224
lines changed

4 files changed

+42
-224
lines changed

openshift/appendix-ocp.rst

Lines changed: 27 additions & 196 deletions
Original file line numberDiff line numberDiff line change
@@ -9,228 +9,59 @@ Appendix
99

1010
.. _cluster-entitlement:
1111

12-
Enabling a Cluster-wide entitlement
13-
============================================
12+
Entitled NVIDIA Driver Builds No Longer Supported
13+
=================================================
1414

1515
Introduction
1616
-------------
1717

18-
.. note::
18+
.. important::
1919

20-
The Driver Toolkit, which enables entitlement-free deployments of the GPU Operator, is available for certain z-streams on OpenShift
21-
4.8 and all z-streams on OpenShift 4.9. However, some Driver Toolkit images are broken, so we recommend maintaining entitlements for
22-
all OpenShift versions prior to 4.9.9. See :ref:`broken driver toolkit <broken-dtk>` for more information.
20+
**Entitled NVIDIA driver builds are deprecated and not supported starting with Red Hat OpenShift 4.10.**
2321

24-
The **NVIDIA GPU Operator** deploys several pods used to manage and enable GPUs for use in the OpenShift Container Platform.
25-
Some of these Pods require packages that are not available by default in the Universal Base Image (UBI) that OpenShift Container
26-
Platform uses. To make packages available to the NVIDIA GPU driver container, you must enable cluster-wide entitled container builds in OpenShift.
22+
The Driver Toolkit (DTK) enables entitlement-free deployments of the GPU Operator. In the past, entitled builds were used pre-DTK and for some OpenShift versions where Driver Toolkit images were broken.
2723

28-
At a high level, enabling a cluster-wide entitlement involves three steps:
24+
If you encounter the :ref:`"broken driver toolkit detected" <broken-dtk>` warning on OpenShift 4.10 or later, you should :ref:`troubleshoot <broken-dtk-troubleshooting>` to find the root cause instead of falling back to entitled driver builds.
2925

30-
#. Download Red Hat OpenShift Container Platform subscription certificates from the `Red Hat Customer Portal <https://access.redhat.com/>`_ (access requires login credentials).
26+
If the broken DTK warning is encountered on an older version of OpenShift, refer to the documentation for an older version of the NVIDIA GPU operator to enable entitled builds. Keep in mind that older versions of OpenShift might no longer be supported.
3127

32-
#. Create a ``MachineConfig`` that enables the subscription manager and provides a valid subscription certificate. Wait for the ``MachineConfigOperator`` to reboot the node and finish applying the ``MachineConfig``.
28+
.. _broken-dtk-troubleshooting:
3329

34-
#. Validate that cluster-wide entitlement is working properly.
30+
Troubleshooting Broken Driver Toolkit Errors
31+
--------------------------------------------
3532

36-
These instructions assume you downloaded an entitlement encoded in base64 from the `Red Hat Customer Portal <https://access.redhat.com/>`_ or extracted it from an existing node.
33+
The most likely reason for the broken DTK message is Node Feature Discovery (NFD) not working correctly. NFD might be disabled, failing, or not updating the kernel version label for other reasons. Another cause might be a missing or incomplete DTK image stream, e.g. because of broken mirroring.
3734

38-
Creating entitled containers requires that you assign machine configuration that has a valid Red Hat entitlement certificate to your worker nodes. This step is necessary because Red Hat Enterprise Linux (RHEL) CoreOS nodes are not yet automatically entitled.
35+
Follow these steps for initial troubleshooting of Node Feature Discovery:
3936

40-
.. _obtain-entitlement:
41-
42-
Obtaining an entitlement certificate
43-
---------------------------------------
44-
45-
Follow the guidance below to edit obtain the entitlement certificate.
46-
47-
#. Navigate to the `Red Hat Customer Portal systems management page <https://access.redhat.com/management/systems/>`_ and click **New**.
48-
49-
.. image:: graphics/cluster_entitlement_1.png
50-
51-
#. Select **Hypervisor** and populate the **Name** field with the text **OpenShift-Entitlement**.
52-
53-
.. image:: graphics/entitlement_hypervisor.png
54-
55-
#. Click **CREATE**.
56-
57-
#. Select the **Subscriptions** tab and click **Attach Subscriptions**.
58-
59-
.. image:: graphics/cluster_entitlement_3.png
60-
61-
#. Search for **Red Hat Developer Subscription** [content here may vary according to accounts], select one of them and click **Attach Subscriptions**.
62-
63-
.. note::
64-
The **Red Hat Developer Subscription** is choosen here purely for illustrating this example. Choose an appropriate subscription relevant for your your needs.
65-
66-
#. Click **Download Certificates**.
67-
68-
.. image:: graphics/cluster_entitlement_5.png
69-
70-
#. Download and extract the file.
71-
72-
#. Extract the key *<key>.pem* and test it with this command:
73-
74-
.. code-block:: console
75-
76-
$ curl -E <key>.pem -Sfs -k https://cdn.redhat.com/content/dist/rhel8/8/x86_64/baseos/os/repodata/repomd.xml | head -3
77-
78-
.. note::
79-
80-
With a valid key, `curl` downloads the repository entrypoint and shows its `head` shown in the example below.
81-
82-
With an invalid key, `curl` download is refused by the Red Hat package mirror.
83-
84-
.. code-block:: console
85-
86-
<?xml version="1.0" encoding="UTF-8"?>
87-
<repomd xmlns="http://linux.duke.edu/metadata/repo" xmlns:rpm="http://linux.duke.edu/metadata/rpm">
88-
<revision>1631130504</revision>
89-
90-
Add a cluster-wide entitlement
91-
---------------------------------------
92-
93-
Use the following procedure to add a cluster-wide entitlement:
94-
95-
#. Create a local appropriately named directory. Change to this directory.
96-
97-
#. Download the :download:`machine config YAML template <download/0003-cluster-wide-machineconfigs.yaml.template>` for cluster-wide entitlements on OpenShift Container Platform. Save the downloaded file ``0003-cluster-wide-machineconfigs.yaml.template`` to the directory created in step 1.
98-
99-
#. Copy the selected ``pem`` file from your entitlement certificate to a local file named ``nvidia.pem``:
37+
#. **Check Node Feature Discovery (NFD) status:**
10038

10139
.. code-block:: console
10240
103-
$ cp <path/to/pem/file>/<certificate-file-name>.pem nvidia.pem
104-
105-
#. Generate the MachineConfig file by appending the entitlement certificate:
106-
107-
.. code-block:: console
41+
$ oc get pods -n openshift-nfd
10842
109-
$ sed -i -f - 0003-cluster-wide-machineconfigs.yaml.template << EOF
110-
s/BASE64_ENCODED_PEM_FILE/$(base64 -w0 nvidia.pem)/g
111-
EOF
43+
Ensure NFD pods are running and healthy. If NFD is not deployed or is failing, this can cause DTK issues.
11244

113-
#. Apply the machine config to the OpenShift cluster:
45+
#. **Verify kernel version labels are present and correct:**
11446

11547
.. code-block:: console
11648
117-
$ oc apply -f 0003-cluster-wide-machineconfigs.yaml.template
49+
$ oc get nodes -o jsonpath='{range .items[*]}{.metadata.name}{":\t"}{.metadata.labels.feature\.node\.kubernetes\.io/kernel-version\.full}{"\n"}{end}'
11850
119-
.. note:: This step triggers an update driven by the OpenShift Machine Config Operator and initiates a restart on all worker nodes one by one.
51+
Ensure nodes have proper kernel version labels that match current OpenShift version of the cluster.
12052

121-
.. code-block:: console
122-
123-
machineconfig.machineconfiguration.openshift.io/50-rhsm-conf created
124-
machineconfig.machineconfiguration.openshift.io/50-entitlement-pem created
125-
machineconfig.machineconfiguration.openshift.io/50-entitlement-key-pem created
126-
127-
#. Check the ``machineconfig``:
53+
#. **Check Driver Toolkit image stream:**
12854

12955
.. code-block:: console
13056
131-
$ oc get machineconfig | grep entitlement
57+
$ oc get -n openshift is/driver-toolkit
13258
133-
.. code-block:: console
59+
Verify the driver-toolkit image stream exists and has the correct tags that correspond to current OpenShift version.
13460

135-
50-entitlement-key-pem 2.2.0 45s
136-
50-entitlement-pem 2.2.0 45s
137-
138-
#. Monitor the ``MachineConfigPool`` object:
139-
140-
.. code-block:: console
141-
142-
$ oc get mcp/worker
143-
144-
.. code-block:: console
145-
146-
NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE
147-
worker rendered-worker-5f1eaf24c760fb389d47d3c37ef41c29 True False False 2 2 2 0 7h15m
148-
149-
Here you can see that the MCP is updated, not updating or degraded, so all the ``MachineConfig`` resources have been successfully applied to the nodes and you can proceed to validate the cluster.
150-
151-
Validate the cluster-wide entitlement
152-
---------------------------------------
153-
154-
Validate the cluster-wide entitlement with a test pod that queries a Red Hat subscription repo for the kernel-devel package.
155-
156-
#. Create a test pod:
157-
158-
.. code-block:: console
159-
160-
$ cat << EOF >> mypod.yaml
161-
162-
apiVersion: v1
163-
kind: Pod
164-
metadata:
165-
name: cluster-entitled-build-pod
166-
namespace: default
167-
spec:
168-
containers:
169-
- name: cluster-entitled-build
170-
image: registry.access.redhat.com/ubi8:latest
171-
command: [ "/bin/sh", "-c", "dnf search kernel-devel --showduplicates" ]
172-
restartPolicy: Never
173-
EOF
174-
175-
#. Apply the test pod:
176-
177-
.. code-block:: console
178-
179-
$ oc create -f mypod.yaml
180-
181-
.. code-block:: console
182-
183-
pod/cluster-entitled-build-pod created
184-
185-
#. Verify the test pod is created:
186-
187-
.. code-block:: console
188-
189-
$ oc get pods -n default
190-
191-
.. code-block:: console
192-
193-
NAME READY STATUS RESTARTS AGE
194-
cluster-entitled-build-pod 1/1 Completed 0 64m
195-
196-
#. Validate that the pod can locate the necessary kernel-devel packages:
197-
198-
.. code-block:: console
199-
200-
$ oc logs cluster-entitled-build-pod -n default
201-
202-
.. code-block:: console
61+
For additional troubleshooting resources:
20362

204-
Updating Subscription Management repositories.
205-
Unable to read consumer identity
206-
Subscription Manager is operating in container mode.
207-
Red Hat Enterprise Linux 8 for x86_64 - AppStre 15 MB/s | 14 MB 00:00
208-
Red Hat Enterprise Linux 8 for x86_64 - BaseOS 15 MB/s | 13 MB 00:00
209-
Red Hat Universal Base Image 8 (RPMs) - BaseOS 493 kB/s | 760 kB 00:01
210-
Red Hat Universal Base Image 8 (RPMs) - AppStre 2.0 MB/s | 3.1 MB 00:01
211-
Red Hat Universal Base Image 8 (RPMs) - CodeRea 12 kB/s | 9.1 kB 00:00
212-
====================== Name Exactly Matched: kernel-devel ======================
213-
kernel-devel-4.18.0-80.1.2.el8_0.x86_64 : Development package for building
214-
: kernel modules to match the kernel
215-
kernel-devel-4.18.0-80.el8.x86_64 : Development package for building kernel
216-
: modules to match the kernel
217-
kernel-devel-4.18.0-80.4.2.el8_0.x86_64 : Development package for building
218-
: kernel modules to match the kernel
219-
kernel-devel-4.18.0-80.7.1.el8_0.x86_64 : Development package for building
220-
: kernel modules to match the kernel
221-
kernel-devel-4.18.0-80.11.1.el8_0.x86_64 : Development package for building
222-
: kernel modules to match the kernel
223-
kernel-devel-4.18.0-147.el8.x86_64 : Development package for building kernel
224-
: modules to match the kernel
225-
kernel-devel-4.18.0-80.11.2.el8_0.x86_64 : Development package for building
226-
: kernel modules to match the kernel
227-
kernel-devel-4.18.0-80.7.2.el8_0.x86_64 : Development package for building
228-
: kernel modules to match the kernel
229-
kernel-devel-4.18.0-147.0.3.el8_1.x86_64 : Development package for building
230-
: kernel modules to match the kernel
231-
kernel-devel-4.18.0-147.0.2.el8_1.x86_64 : Development package for building
232-
: kernel modules to match the kernel
233-
kernel-devel-4.18.0-147.3.1.el8_1.x86_64 : Development package for building
234-
: kernel modules to match the kernel
235-
236-
Any Pod based on RHEL can now run entitled builds.
63+
* `Node Feature Discovery documentation <https://kubernetes-sigs.github.io/node-feature-discovery/>`_.
64+
* `Red Hat Node Feature Discovery Operator documentation <https://docs.openshift.com/container-platform/latest/hardware_enablement/psap-node-feature-discovery-operator.html>`_
65+
* `OpenShift Driver Toolkit documentation <https://docs.redhat.com/en/documentation/openshift_container_platform/latest/html/specialized_hardware_and_driver_enablement/driver-toolkit>`_
66+
* `OpenShift Driver Toolkit GihHub repository <https://github.com/openshift/driver-toolkit/>`_
67+
* `OpenShift troubleshooting guide <https://docs.openshift.com/container-platform/latest/support/troubleshooting/>`_

openshift/get-entitlement.rst

Lines changed: 4 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -4,24 +4,11 @@
44
.. _get-entitlement:
55

66
####################################################
7-
Obtaining an entitlement certificate
7+
Entitled Driver Builds No Longer Supported
88
####################################################
99

10-
Follow the guidance below to edit your cluster subscription setting and obtain the entitlement.
10+
.. important::
1111

12-
#. Navigate to `https://access.redhat.com/management/systems/`` and click **New**.
13-
Log in to `access.redhat.com <https://console.redhat.com/>`_ .
12+
**Entitled NVIDIA driver builds are deprecated and not supported.**
1413

15-
#. Fill "Virtual Server", "x86_64", 1 core, RHEL 8, and click Create.
16-
17-
.. image:: graphics/locate-cluster-acm.png
18-
19-
#. Go to the "Subscription" page and click "Attach Subscriptions"r.
20-
21-
#. Search for "Red Hat Developer Subscription" [content here may vary according to accounts], tick one of them and click "Attach Subscriptions".
22-
23-
#. Click "Download Certificates"
24-
25-
#. Download and extract the file.
26-
27-
#. Extract the key from "consumer_export.zip/export/entitlement_certificates/<key>.pem" and test it with this command:
14+
If you encounter issues with the NVIDIA GPU driver build that might require entitlement, please refer to the Driver Toolkit (DTK) troubleshooting section: :ref:`broken-dtk-troubleshooting`

openshift/steps-overview.rst

Lines changed: 8 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -120,13 +120,15 @@ A fix for this issue has been merged in the following releases:
120120
About the Broken Driver Toolkit
121121
*******************************
122122
123-
OpenShift 4.8.19, 4.8.21, 4.9.8 are known to have a broken Driver Toolkit image.
124-
The following messages are recorded in the driver pod containers.
125-
Follow the guidance in :ref:`enabling a Cluster-wide entitlement <cluster-entitlement>`.
126-
Afterward, the ``nvidia-driver-daemonset`` automatically uses an entitlement-based fallback.
123+
.. important::
124+
125+
**Entitled NVIDIA driver builds are deprecated and not supported.**
126+
127+
OpenShift 4.8.19, 4.8.21, 4.9.8 are known to have a broken Driver Toolkit image. However, on newer OpenShift versions the driver builds rely on Driver Toolkit (DTK). With these versions, entitled builds are not supported and might not work.
128+
129+
When the DTK image is broken, the following messages are recorded in the driver pod containers. Follow the guidance in :ref:`broken-dtk-troubleshooting` to troubleshoot the underlying issue.
127130
128-
To disable the use of Driver Toolkit image altogether, edit the cluster policy instance and set ``operator.use_ocp_driver_toolkit`` option to ``false``.
129-
Also, we recommend maintaining entitlements for OpenShift versions < 4.9.9.
131+
If you need to force entitled builds, disable the use of Driver Toolkit image by editing the cluster policy instance and setting ``operator.use_ocp_driver_toolkit`` option to ``false``.
130132
131133
#. View the logs from the OpenShift Driver Toolkit container:
132134

openshift/troubleshooting-gpu-ocp.rst

Lines changed: 3 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -194,11 +194,9 @@ This is an illustrated example of a situation where the deployment of the Operat
194194
195195
FATAL: failed to install elfutils packages. RHEL entitlement may be improperly deployed
196196
197-
This message maybe associated with the unsuccessful deployment of the driver toolkit. To confirm the driver toolkit is successfully deployed follow the guidance in :ref:`verify_toolkit`.
198-
If you see this message a workaround is to edit the created ``gpu-cluster-policy`` YAML file in the OpenShift Container Platform console and set ``use_ocp_driver_toolkit`` to ``false``.
199-
200-
Set up the entitlement.
201-
Refer to :ref:`cluster-entitlement` for more information.
197+
This message may be associated with the unsuccessful deployment of the driver toolkit. To confirm the driver toolkit is successfully deployed follow the guidance in :ref:`verify_toolkit`.
198+
If you see this message, you should troubleshoot the underlying issue instead of relying on RHEL entitlement. Entitled driver builds are deprecated and not supported on recent versions of Red Hat OpenShift.
199+
See :ref:`broken-dtk-troubleshooting` for more information.
202200

203201
.. _verify_toolkit:
204202

0 commit comments

Comments
 (0)