Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 14 additions & 2 deletions gpu-operator/getting-started.rst
Original file line number Diff line number Diff line change
Expand Up @@ -173,6 +173,17 @@ To view all the options, run ``helm show values nvidia/gpu-operator``.
Set this value to ``false`` when using the Operator on systems with pre-installed drivers.
- ``true``

* - ``kernelModuleType``
- Specifies the type of the NVIDIA GPU Kernel modules to use.
Valid values are ``auto`` (default), ``proprietary``, and ``open``.

``Auto`` means that the recommended kernel module type (open or proprietary) is chosen based on the GPU devices on the host and the driver branch used.
Note, ``auto`` is only supported with the 570.86.15 and 570.124.06 or later driver containers.
550 and 535 branch drivers do not yet support this mode.
``Open`` means the open kernel module is used.
``Proprietary`` means the proprietary module is used.
- ``auto``

* - ``driver.repository``
- The images are downloaded from NGC. Specify another image repository when using
custom driver images.
Expand All @@ -197,8 +208,9 @@ To view all the options, run ``helm show values nvidia/gpu-operator``.
runs slowly in your cluster.
- ``60s``

* - ``driver.useOpenKernelModules``
- When set to ``true``, the driver containers install the NVIDIA Open GPU Kernel module driver.
* - ``driver.useOpenKernelModules`` Deprecated.
- This field is deprecated as of v25.3.0 and will be ignored. Use ``kernelModuleType`` instead.
When set to ``true``, the driver containers install the NVIDIA Open GPU Kernel module driver.
- ``false``

* - ``driver.usePrecompiled``
Expand Down
12 changes: 10 additions & 2 deletions gpu-operator/gpu-driver-configuration.rst
Original file line number Diff line number Diff line change
Expand Up @@ -195,6 +195,13 @@ The following table describes some of the fields in the custom resource.
- Specifies the credentials to provide to the registry if the registry is secured.
- None

* - ``kernelModuleType``
- Specifies the type of the NVIDIA GPU Kernel modules to use.
Valid values are ``auto`` (default), ``proprietary``, and ``open``.

``Auto`` means that the recommended kernel module type is chosen based on the GPU devices on the host and the driver branch used.
- ``auto``

* - ``labels``
- Specifies a map of key and value pairs to add as custom labels to the driver pod.
- None
Expand All @@ -217,8 +224,9 @@ The following table describes some of the fields in the custom resource.
- Specifies the container registry that contains the driver container.
- ``nvcr.io/nvidia``

* - ``useOpenKernelModules``
- Specifies to use the NVIDIA Open GPU Kernel modules.
* - ``useOpenKernelModules`` Deprecated.
- This field is deprecated as of v25.3.0 and will be ignored. Use ``kernelModuleType`` instead.
Specifies to use the NVIDIA Open GPU Kernel modules.
- ``false``

* - ``tolerations``
Expand Down
13 changes: 6 additions & 7 deletions gpu-operator/gpu-operator-rdma.rst
Original file line number Diff line number Diff line change
Expand Up @@ -30,10 +30,11 @@ To support GPUDirect RDMA, userspace CUDA APIs are required.
The kernel mode support is provided by one of two approaches: DMA-BUF from the Linux kernel or the legacy ``nvidia-peermem`` kernel module.
NVIDIA recommends using the DMA-BUF rather than using the ``nvidia-peermem`` kernel module from the GPU Driver.

Starting with v23.9.1 of the Operator, the Operator uses GDS driver version 2.17.5 or newer.
The Operator uses GDS driver version 2.17.5 or newer.
This version and higher is only supported with the NVIDIA Open GPU Kernel module driver.
The sample commands for installing the Operator include the ``--set useOpenKernelModules=true``
command-line argument for Helm.
In GPU Operator v25.3.0 and later, the ``driver.kernelModuleType`` default is ``auto``, for the supported driver versions.
This configuration allows the GPU Operator to choose the recommended driver kernel module type depending on the driver branch and the GPU devices available.
Newer driver versions will use the open kernel module by default, however to make sure you are using the open kernel module, include ``--set driver.kernelModuleType=open`` command-line argument in your helm Operator install command.

In conjunction with the Network Operator, the GPU Operator can be used to
set up the networking related components such as network device kernel drivers and Kubernetes device plugins to enable
Expand Down Expand Up @@ -128,7 +129,6 @@ To use DMA-BUF and network device drivers that are installed by the Network Oper
-n gpu-operator --create-namespace \
nvidia/gpu-operator \
--version=${version} \
--set driver.useOpenKernelModules=true

To use DMA-BUF and network device drivers that are installed on the host:

Expand All @@ -138,11 +138,10 @@ To use DMA-BUF and network device drivers that are installed on the host:
-n gpu-operator --create-namespace \
nvidia/gpu-operator \
--version=${version} \
--set driver.useOpenKernelModules=true \
--set driver.rdma.useHostMofed=true

To use the legacy ``nvidia-peermem`` kernel module instead of DMA-BUF, add ``--set driver.rdma.enabled=true`` to either of the preceding commands.
The ``driver.useOpenKernelModules=true`` argument is optional for using the legacy kernel driver.
Add ``--set driver.kernelModuleType=open`` if you are using a driver version from a branch earlier than R570.

Verifying the Installation of GPUDirect with RDMA
=================================================
Expand Down Expand Up @@ -431,11 +430,11 @@ The following sample command applies to clusters that use the Network Operator t
-n gpu-operator --create-namespace \
nvidia/gpu-operator \
--version=${version} \
--set driver.useOpenKernelModules=true \
--set gds.enabled=true

Add ``--set driver.rdma.enabled=true`` to the command to use the legacy ``nvidia-peermem`` kernel module.

Add ``--set driver.kernelModuleType=open`` if you are using a driver version from a branch earlier than R570.

Verification
==============
Expand Down
41 changes: 18 additions & 23 deletions gpu-operator/life-cycle-policy.rst
Original file line number Diff line number Diff line change
Expand Up @@ -55,13 +55,13 @@ The product life cycle and versioning are subject to change in the future.
* - GPU Operator Version
- Status

* - 24.9.x
* - 25.3.x
- Generally Available

* - 24.6.x
* - 24.9.x
- Maintenance

* - 24.3.x and lower
* - 24.6.x and lower
- EOL


Expand Down Expand Up @@ -89,60 +89,55 @@ Refer to :ref:`Upgrading the NVIDIA GPU Operator` for more information.
- ${version}

* - NVIDIA GPU Driver
- | `570.86.15 <https://docs.nvidia.com/datacenter/tesla/tesla-release-notes-570-86-15/index.html>`_ (recommended),
| `565.57.01 <https://docs.nvidia.com/datacenter/tesla/tesla-release-notes-565-57-01/index.html>`_
| `560.35.03 <https://docs.nvidia.com/datacenter/tesla/tesla-release-notes-560-35-03/index.html>`_
| `550.144.03 <https://docs.nvidia.com/datacenter/tesla/tesla-release-notes-550-144-03/index.html>`_ (default),
| `550.127.08 <https://docs.nvidia.com/datacenter/tesla/tesla-release-notes-550-127-08/index.html>`_
| `535.230.02 <https://docs.nvidia.com/datacenter/tesla/tesla-release-notes-535-230-02/index.html>`_
| `535.216.03 <https://docs.nvidia.com/datacenter/tesla/tesla-release-notes-535-216-03/index.html>`_
- | `570.124.06 <https://docs.nvidia.com/datacenter/tesla/tesla-release-notes-570-124-06/index.html>`_ (default, recommended),
| `570.86.15 <https://docs.nvidia.com/datacenter/tesla/tesla-release-notes-570-86-15/index.html>`_

* - NVIDIA Driver Manager for Kubernetes
- `v0.7.0 <https://ngc.nvidia.com/catalog/containers/nvidia:cloud-native:k8s-driver-manager>`__
- `v0.8.0 <https://ngc.nvidia.com/catalog/containers/nvidia:cloud-native:k8s-driver-manager>`__

* - NVIDIA Container Toolkit
- `1.17.4 <https://github.com/NVIDIA/nvidia-container-toolkit/releases>`__
- `1.17.5 <https://github.com/NVIDIA/nvidia-container-toolkit/releases>`__

* - NVIDIA Kubernetes Device Plugin
- `0.17.0 <https://github.com/NVIDIA/k8s-device-plugin/releases>`__
- `0.17.1 <https://github.com/NVIDIA/k8s-device-plugin/releases>`__

* - DCGM Exporter
- `3.3.9-3.6.1 <https://github.com/NVIDIA/dcgm-exporter/releases>`__
- `4.1.1-4.0.4 <https://github.com/NVIDIA/dcgm-exporter/releases>`__

* - Node Feature Discovery
- v0.16.6
- `v0.17.2 <https://github.com/kubernetes-sigs/node-feature-discovery/releases/>`__

* - | NVIDIA GPU Feature Discovery
| for Kubernetes
- `0.17.0 <https://github.com/NVIDIA/k8s-device-plugin/releases>`__
- `0.17.1 <https://github.com/NVIDIA/k8s-device-plugin/releases>`__

* - NVIDIA MIG Manager for Kubernetes
- `0.10.0 <https://github.com/NVIDIA/mig-parted/tree/main/deployments/gpu-operator>`__
- `0.12.1 <https://github.com/NVIDIA/mig-parted/tree/main/deployments/gpu-operator>`__

* - DCGM
- `3.3.9-1 <https://docs.nvidia.com/datacenter/dcgm/latest/release-notes/changelog.html>`__
- `4.1.1-2 <https://docs.nvidia.com/datacenter/dcgm/latest/release-notes/changelog.html>`__

* - Validator for NVIDIA GPU Operator
- ${version}

* - NVIDIA KubeVirt GPU Device Plugin
- `v1.2.10 <https://github.com/NVIDIA/kubevirt-gpu-device-plugin>`__
- `v1.3.1 <https://github.com/NVIDIA/kubevirt-gpu-device-plugin>`__

* - NVIDIA vGPU Device Manager
- `v0.2.8 <https://github.com/NVIDIA/vgpu-device-manager>`__
- `v0.3.0 <https://github.com/NVIDIA/vgpu-device-manager>`__

* - NVIDIA GDS Driver |gds|_
- `2.20.5 <https://github.com/NVIDIA/gds-nvidia-fs/releases>`__

* - NVIDIA Kata Manager for Kubernetes
- `v0.2.2 <https://github.com/NVIDIA/k8s-kata-manager>`__
- `v0.2.3 <https://github.com/NVIDIA/k8s-kata-manager>`__

* - | NVIDIA Confidential Computing
| Manager for Kubernetes
- v0.1.1

* - NVIDIA GDRCopy Driver
- `v2.4.1-1 <https://github.com/NVIDIA/gdrcopy/releases>`__
- `v2.4.4 <https://github.com/NVIDIA/gdrcopy/releases>`__

.. _gds-open-kernel:

Expand All @@ -156,4 +151,4 @@ Refer to :ref:`Upgrading the NVIDIA GPU Operator` for more information.
version downloaded from the `NVIDIA vGPU Software Portal <https://nvid.nvidia.com/dashboard/#/dashboard>`_.
- The GPU Operator is supported on all active NVIDIA data center production drivers.
Refer to `Supported Drivers and CUDA Toolkit Versions <https://docs.nvidia.com/datacenter/tesla/drivers/index.html#cuda-drivers>`_
for more information.
for more information.
2 changes: 1 addition & 1 deletion gpu-operator/manifests/input/nvd-demo-gold.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@ spec:
image: driver
imagePullPolicy: IfNotPresent
imagePullSecrets: []
kernelModuleType: auto
manager: {}
nodeSelector:
driver.config: "gold"
Expand All @@ -30,6 +31,5 @@ spec:
initialDelaySeconds: 60
periodSeconds: 10
timeoutSeconds: 60
useOpenKernelModules: false
usePrecompiled: false
version: 535.104.12
Loading