Skip to content

Commit aeadc60

Browse files
chenopisshivakunvShiva Kumartariq1890
authored
Update release-25.3 (#280)
* Update the building of vgpu gpu-operator install doc (#269) * Update the building of vgpu gpu-operator install doc Signed-off-by: Shiva Kumar (SW-CLOUD) <[email protected]> Signed-off-by: chenopis <[email protected]> * Update gpu-operator/install-gpu-operator-vgpu.rst Signed-off-by: Andrew Chen <[email protected]> Co-authored-by: Tariq <[email protected]> Signed-off-by: chenopis <[email protected]> * Remove VERSION variable from instructions Signed-off-by: Andrew Chen <[email protected]> * - Remove "and append -grid" from instructions - Update Install the Operator driver version variable Signed-off-by: Andrew Chen <[email protected]> --------- Signed-off-by: Shiva Kumar (SW-CLOUD) <[email protected]> Signed-off-by: chenopis <[email protected]> Signed-off-by: Andrew Chen <[email protected]> Co-authored-by: Shiva Kumar <[email protected]> Co-authored-by: Andrew Chen <[email protected]> Co-authored-by: Tariq <[email protected]> Co-authored-by: Andrew Chen <[email protected]> * Correct HMM name to Heterogeneous Memory Management Signed-off-by: Andrew Chen <[email protected]> * Fix note RST formatting Signed-off-by: Andrew Chen <[email protected]> * Update HMM wording Signed-off-by: Andrew Chen <[email protected]> --------- Signed-off-by: Shiva Kumar (SW-CLOUD) <[email protected]> Signed-off-by: chenopis <[email protected]> Signed-off-by: Andrew Chen <[email protected]> Co-authored-by: Shiva Kumar <[email protected]> Co-authored-by: Shiva Kumar <[email protected]> Co-authored-by: Tariq <[email protected]>
1 parent 97f236e commit aeadc60

File tree

3 files changed

+23
-48
lines changed

3 files changed

+23
-48
lines changed

gpu-operator/custom-driver-params.rst

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -49,7 +49,8 @@ To pass custom parameters, execute the following steps.
4949
Example using ``nvidia-uvm`` module
5050
-----------------------------------
5151

52-
This example shows the High Memory Mode being disabled in the ``nvidia-uvm`` module.
52+
This example shows the Heterogeneous Memory Management (HMM) being disabled in the ``nvidia-uvm`` module.
53+
Refer to `Simplifying GPU Application Development with Heterogeneous Memory Management <https://developer.nvidia.com/blog/simplifying-gpu-application-development-with-heterogeneous-memory-management/>`_ for more information about HMM.
5354

5455
#. Create a configuration file named ``nvidia-uvm.conf``:
5556

gpu-operator/install-gpu-operator-vgpu.rst

Lines changed: 14 additions & 41 deletions
Original file line numberDiff line numberDiff line change
@@ -104,29 +104,25 @@ Perform the following steps to build and push a container image that includes th
104104

105105
.. code-block:: console
106106
107-
$ git clone https://gitlab.com/nvidia/container-images/driver
107+
$ git clone https://github.com/NVIDIA/gpu-driver-container
108108
109109
.. code-block:: console
110110
111-
$ cd driver
111+
$ cd gpu-driver-container
112112
113-
#. Change directory to the operating system name and version under the driver directory:
113+
#. Copy the NVIDIA vGPU guest driver from your extracted ZIP file and the NVIDIA vGPU driver catalog file to the operating system version you want to build the driver container for:
114114

115-
.. code-block:: console
116-
117-
$ cd ubuntu20.04
118-
119-
For Red Hat OpenShift Container Platform, use a directory that includes ``rhel`` in the directory name.
120-
121-
#. Copy the NVIDIA vGPU guest driver from your extracted ZIP file and the NVIDIA vGPU driver catalog file:
115+
Copy ``<local-driver-download-directory>/\*-grid.run`` and ``vgpuDriverCatalog.yaml`` to ``ubuntu22.04/drivers/``.
122116

123117
.. code-block:: console
124118
125-
$ cp <local-driver-download-directory>/*-grid.run drivers/
119+
$ cp <local-driver-download-directory>/*-grid.run ubuntu22.04/drivers/
126120
127121
.. code-block:: console
128122
129-
$ cp vgpuDriverCatalog.yaml drivers/
123+
$ cp vgpuDriverCatalog.yaml ubuntu22.04/drivers/
124+
125+
For Red Hat OpenShift Container Platform, use a directory that includes ``rhel`` in the directory name.
130126

131127
#. Set environment variables for building the driver container image.
132128

@@ -141,35 +137,17 @@ Perform the following steps to build and push a container image that includes th
141137

142138
.. code-block:: console
143139
144-
$ export OS_TAG=ubuntu20.04
140+
$ export OS_TAG=ubuntu22.04
145141
146142
The value must match the guest operating system version.
147143
For Red Hat OpenShift Container Platform, specify ``rhcos4.<x>`` where ``x`` is the supported minor OCP version.
148144
Refer to :ref:`Supported Operating Systems and Kubernetes Platforms` for the list of supported OS distributions.
149145

150-
- Specify the driver container image tag such as ``1.0.0``:
151-
152-
.. code-block:: console
153-
154-
$ export VERSION=1.0.0
155-
156-
The specified value can be any user-defined value.
157-
The value is used to install the Operator in a subsequent step.
158-
159-
- Specify the version of the CUDA base image to use when building the driver container:
160-
161-
.. code-block:: console
162-
163-
$ export CUDA_VERSION=11.8.0
164-
165-
The CUDA version only specifies the base image used to build the driver container.
166-
The version does not have any correlation to the version of CUDA that is associated with or supported by the resulting driver container.
167-
168-
- Specify the Linux guest vGPU driver version that you downloaded from the NVIDIA Licensing Portal and append ``-grid``:
146+
- Specify the Linux guest vGPU driver version that you downloaded from the NVIDIA Licensing Portal:
169147

170148
.. code-block:: console
171149
172-
$ export VGPU_DRIVER_VERSION=525.60.13-grid
150+
$ export VGPU_DRIVER_VERSION=580.95.05
173151
174152
The Operator automatically selects the compatible guest driver version from the drivers bundled with the ``driver`` image.
175153
If you disable the version check by specifying ``--build-arg DISABLE_VGPU_VERSION_CHECK=true`` when you build the driver image,
@@ -179,12 +157,7 @@ Perform the following steps to build and push a container image that includes th
179157

180158
.. code-block:: console
181159
182-
$ sudo docker build \
183-
--build-arg DRIVER_TYPE=vgpu \
184-
--build-arg DRIVER_VERSION=$VGPU_DRIVER_VERSION \
185-
--build-arg CUDA_VERSION=$CUDA_VERSION \
186-
--build-arg TARGETARCH=amd64 \ # amd64 or arm64
187-
-t ${PRIVATE_REGISTRY}/driver:${VERSION}-${OS_TAG} .
160+
$ VGPU_GUEST_DRIVER_VERSION=${VGPU_DRIVER_VERSION} IMAGE_NAME=${PRIVATE_REGISTRY}/driver make build-vgpuguest-${OS_TAG}
188161
189162
#. Push the driver container image to your private registry.
190163

@@ -200,7 +173,7 @@ Perform the following steps to build and push a container image that includes th
200173

201174
.. code-block:: console
202175
203-
$ sudo docker push ${PRIVATE_REGISTRY}/driver:${VERSION}-${OS_TAG}
176+
$ VGPU_GUEST_DRIVER_VERSION=${VGPU_DRIVER_VERSION} IMAGE_NAME=${PRIVATE_REGISTRY}/driver make push-vgpuguest-${OS_TAG}
204177
205178
206179
**************************************************************************************
@@ -274,7 +247,7 @@ Install the Operator
274247
-n gpu-operator --create-namespace \
275248
nvidia/gpu-operator \
276249
--set driver.repository=${PRIVATE_REGISTRY} \
277-
--set driver.version=${VERSION} \
250+
--set driver.version=${VGPU_DRIVER_VERSION} \
278251
--set driver.imagePullSecrets={$REGISTRY_SECRET_NAME} \
279252
--set driver.licensingConfig.configMapName=licensing-config
280253

gpu-operator/platform-support.rst

Lines changed: 7 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -173,12 +173,13 @@ The following NVIDIA data center GPUs are supported on x86 based platforms:
173173
| NVIDIA T400 | Turing |
174174
+-------------------------+------------------------+
175175

176-
.. note::
176+
.. note::
177177

178178
NVIDIA RTX PRO 6000 Blackwell Server Edition notes:
179-
* Driver versions 575.57.08 or later is required.
180-
* MIG is not supported on the 575.57.08 driver release.
181-
* You must disable High Memory Mode (HMM) in UVM by :ref:`Customizing NVIDIA GPU Driver Parameters during Installation`.
179+
180+
* Driver versions 575.57.08 or later is required.
181+
* MIG is not supported on the 575.57.08 driver release.
182+
* In cases where CUDA init fails, you may need to disable Heterogeneous Memory Management (HMM) in UVM by :ref:`Customizing NVIDIA GPU Driver Parameters during Installation`.
182183

183184
.. tab-item:: B-series Products
184185

@@ -192,9 +193,9 @@ The following NVIDIA data center GPUs are supported on x86 based platforms:
192193
| NVIDIA HGX GB200 NVL72 | NVIDIA Blackwell |
193194
+-------------------------+------------------------+
194195

195-
.. note::
196+
.. note::
196197

197-
* HGX B200 requires a driver container version of 570.133.20 or later.
198+
* HGX B200 requires a driver container version of 570.133.20 or later.
198199

199200

200201
.. _gpu-operator-arm-platforms:

0 commit comments

Comments
 (0)