Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 5 additions & 6 deletions gpu-operator/getting-started.rst
Original file line number Diff line number Diff line change
Expand Up @@ -135,7 +135,6 @@ To view all the options, run ``helm show values nvidia/gpu-operator``.

* - ``ccManager.enabled``
- When set to ``true``, the Operator deploys NVIDIA Confidential Computing Manager for Kubernetes.
Refer to :doc:`gpu-operator-confidential-containers` for more information.
- ``false``

* - ``cdi.enabled``
Expand All @@ -160,9 +159,9 @@ To view all the options, run ``helm show values nvidia/gpu-operator``.
* - ``daemonsets.labels``
- Map of custom labels to add to all GPU Operator managed pods.
- ``{}``

* - ``dcgmExporter.enabled``
- By default, the Operator gathers GPU telemetry in Kubernetes via `DCGM Exporter <https://docs.nvidia.com/datacenter/cloud-native/gpu-telemetry/latest/dcgm-exporter.html>`_.
- By default, the Operator gathers GPU telemetry in Kubernetes via `DCGM Exporter <https://docs.nvidia.com/datacenter/cloud-native/gpu-telemetry/latest/dcgm-exporter.html>`_.
Set this value to ``false`` to disable it.
Available values are ``true`` (default) or ``false``.
- ``true``
Expand All @@ -186,10 +185,10 @@ To view all the options, run ``helm show values nvidia/gpu-operator``.

* - ``driver.kernelModuleType``
- Specifies the type of the NVIDIA GPU Kernel modules to use.
Valid values are ``auto`` (default), ``proprietary``, and ``open``.
Valid values are ``auto`` (default), ``proprietary``, and ``open``.

``Auto`` means that the recommended kernel module type (open or proprietary) is chosen based on the GPU devices on the host and the driver branch used.
Note, ``auto`` is only supported with the 570.86.15 and 570.124.06 or later driver containers.
Note, ``auto`` is only supported with the 570.86.15 and 570.124.06 or later driver containers.
550 and 535 branch drivers do not yet support this mode.
``Open`` means the open kernel module is used.
``Proprietary`` means the proprietary module is used.
Expand Down
26 changes: 13 additions & 13 deletions gpu-operator/life-cycle-policy.rst
Original file line number Diff line number Diff line change
Expand Up @@ -55,13 +55,13 @@ The product life cycle and versioning are subject to change in the future.
* - GPU Operator Version
- Status

* - 25.3.x
* - 25.10.x
- Generally Available

* - 24.9.x
* - 25.3.x
- Maintenance

* - 24.6.x and lower
* - 24.9.x and lower
- EOL


Expand Down Expand Up @@ -104,7 +104,7 @@ Refer to :ref:`Upgrading the NVIDIA GPU Operator` for more information.
| `570.148.08 <https://docs.nvidia.com/datacenter/tesla/tesla-release-notes-570-148-08/index.html>`_
| `535.261.03 <https://docs.nvidia.com/datacenter/tesla/tesla-release-notes-535-261-03/index.html>`_
| `550.163.01 <https://docs.nvidia.com/datacenter/tesla/tesla-release-notes-550-163-01/index.html>`_
| `535.247.01 <https://docs.nvidia.com/datacenter/tesla/tesla-release-notes-535-247-01/index.html>`_
| `535.247.01 <https://docs.nvidia.com/datacenter/tesla/tesla-release-notes-535-247-01/index.html>`_
- | `580.82.07 <https://docs.nvidia.com/datacenter/tesla/tesla-release-notes-580-82-07/index.html>`_ (**D**, **R**)
| `580.65.06 <https://docs.nvidia.com/datacenter/tesla/tesla-release-notes-580-65-06/index.html>`_
| `575.57.08 <https://docs.nvidia.com/datacenter/tesla/tesla-release-notes-575-57-08/index.html>`_
Expand All @@ -113,31 +113,31 @@ Refer to :ref:`Upgrading the NVIDIA GPU Operator` for more information.
| `570.148.08 <https://docs.nvidia.com/datacenter/tesla/tesla-release-notes-570-148-08/index.html>`_
| `535.261.03 <https://docs.nvidia.com/datacenter/tesla/tesla-release-notes-535-261-03/index.html>`_
| `550.163.01 <https://docs.nvidia.com/datacenter/tesla/tesla-release-notes-550-163-01/index.html>`_
| `535.247.01 <https://docs.nvidia.com/datacenter/tesla/tesla-release-notes-535-247-01/index.html>`_
- | `580.65.06 <https://docs.nvidia.com/datacenter/tesla/tesla-release-notes-580-65-06/index.html>`_ (**R**)
| `535.247.01 <https://docs.nvidia.com/datacenter/tesla/tesla-release-notes-535-247-01/index.html>`_
- | `580.65.06 <https://docs.nvidia.com/datacenter/tesla/tesla-release-notes-580-65-06/index.html>`_ (**R**)
| `575.57.08 <https://docs.nvidia.com/datacenter/tesla/tesla-release-notes-575-57-08/index.html>`_
| `570.172.08 <https://docs.nvidia.com/datacenter/tesla/tesla-release-notes-570-172-08/index.html>`_ (**D**)
| `570.172.08 <https://docs.nvidia.com/datacenter/tesla/tesla-release-notes-570-172-08/index.html>`_ (**D**)
| `570.158.01 <https://docs.nvidia.com/datacenter/tesla/tesla-release-notes-570-158-01/index.html>`_
| `570.148.08 <https://docs.nvidia.com/datacenter/tesla/tesla-release-notes-570-148-08/index.html>`_
| `535.261.03 <https://docs.nvidia.com/datacenter/tesla/tesla-release-notes-535-261-03/index.html>`_
| `550.163.01 <https://docs.nvidia.com/datacenter/tesla/tesla-release-notes-550-163-01/index.html>`_
| `535.247.01 <https://docs.nvidia.com/datacenter/tesla/tesla-release-notes-535-247-01/index.html>`_
| `535.247.01 <https://docs.nvidia.com/datacenter/tesla/tesla-release-notes-535-247-01/index.html>`_
- | `580.65.06 <https://docs.nvidia.com/datacenter/tesla/tesla-release-notes-580-65-06/index.html>`_ (**R**)
| `575.57.08 <https://docs.nvidia.com/datacenter/tesla/tesla-release-notes-575-57-08/index.html>`_
| `570.172.08 <https://docs.nvidia.com/datacenter/tesla/tesla-release-notes-570-172-08/index.html>`_ (**D**)
| `570.158.01 <https://docs.nvidia.com/datacenter/tesla/tesla-release-notes-570-158-01/index.html>`_
| `570.148.08 <https://docs.nvidia.com/datacenter/tesla/tesla-release-notes-570-148-08/index.html>`_
| `535.261.03 <https://docs.nvidia.com/datacenter/tesla/tesla-release-notes-535-261-03/index.html>`_
| `550.163.01 <https://docs.nvidia.com/datacenter/tesla/tesla-release-notes-550-163-01/index.html>`_
| `535.247.01 <https://docs.nvidia.com/datacenter/tesla/tesla-release-notes-535-247-01/index.html>`_
| `535.247.01 <https://docs.nvidia.com/datacenter/tesla/tesla-release-notes-535-247-01/index.html>`_
- | `580.65.06 <https://docs.nvidia.com/datacenter/tesla/tesla-release-notes-580-65-06/index.html>`_ (**R**)
| `575.57.08 <https://docs.nvidia.com/datacenter/tesla/tesla-release-notes-575-57-08/index.html>`_
| `570.172.08 <https://docs.nvidia.com/datacenter/tesla/tesla-release-notes-570-172-08/index.html>`_ (**D**)
| `570.158.01 <https://docs.nvidia.com/datacenter/tesla/tesla-release-notes-570-158-01/index.html>`_
| `570.148.08 <https://docs.nvidia.com/datacenter/tesla/tesla-release-notes-570-148-08/index.html>`_
| `550.163.01 <https://docs.nvidia.com/datacenter/tesla/tesla-release-notes-550-163-01/index.html>`_
| `535.261.03 <https://docs.nvidia.com/datacenter/tesla/tesla-release-notes-535-261-03/index.html>`_
| `535.247.01 <https://docs.nvidia.com/datacenter/tesla/tesla-release-notes-535-247-01/index.html>`_
| `535.247.01 <https://docs.nvidia.com/datacenter/tesla/tesla-release-notes-535-247-01/index.html>`_

* - NVIDIA Driver Manager for Kubernetes
- :cspan:`1` `v0.8.1 <https://ngc.nvidia.com/catalog/containers/nvidia:cloud-native:k8s-driver-manager>`__
Expand Down Expand Up @@ -213,8 +213,8 @@ Refer to :ref:`Upgrading the NVIDIA GPU Operator` for more information.

:sup:`1`
Known Issue: For drivers 570.124.06, 570.133.20, 570.148.08, and 570.158.01,
GPU workloads cannot be scheduled on nodes that have a mix of MIG slices and full GPUs.
This manifests as GPU pods getting stuck indefinitely in the ``Pending`` state.
GPU workloads cannot be scheduled on nodes that have a mix of MIG slices and full GPUs.
This manifests as GPU pods getting stuck indefinitely in the ``Pending`` state.
NVIDIA recommends that you downgrade the driver to version 570.86.15 to work around this issue.
For more detailed information, see GitHub issue https://github.com/NVIDIA/gpu-operator/issues/1361.

Expand All @@ -224,7 +224,7 @@ Refer to :ref:`Upgrading the NVIDIA GPU Operator` for more information.
:sup:`2`
This release of the GDS driver requires that you use the NVIDIA Open GPU Kernel module driver for the GPUs.
Refer to :doc:`gpu-operator-rdma` for more information.

.. note::

- Driver version could be different with NVIDIA vGPU, as it depends on the driver
Expand Down
49 changes: 24 additions & 25 deletions gpu-operator/release-notes.rst
Original file line number Diff line number Diff line change
Expand Up @@ -55,7 +55,7 @@ New Features
- KubeVirt and OpenShift Virtualization: VM with GPU passthrough (Ubuntu 22.04 only)
- KubeVirt and OpenShift Virtualization: VM with time-slice vGPU (Ubuntu 22.04 only)

- RTX Pro 6000D
- RTX Pro 6000D

- KubeVirt and OpenShift Virtualization: VM with GPU passthrough (Ubuntu 22.04 only)

Expand Down Expand Up @@ -97,7 +97,7 @@ New Features

- 580.65.06 (recommended)
- 570.172.08 (default)
- 535.261.03
- 535.261.03

.. _v25.3.2-known-issues:

Expand All @@ -106,20 +106,20 @@ Known Issues

* Starting with version **580.65.06**, the driver container has **Coherent Driver Memory Management (CDMM)** enabled by default to support **GB200** on Kubernetes.
For more information about CDMM, refer to the `release notes <https://docs.nvidia.com/datacenter/tesla/tesla-release-notes-580-65-06/index.html#hardware-software-support>`__.

.. note::

Currently, CDMM is not compatible with the **Multi-Instance GPUs (MIG)** sharing.
CDMM is also not compatible with **GPU Direct Storage**.
CDMM support for these features is planned for future driver updates.
However, these limitations will remain in place until a future driver update removes them.

CDMM enablement applies only to **Grace-based systems** such as **GH200** and **GB200** and is ignored on other GPU platforms.
NVIDIA strongly recommends keeping CDMM enabled with Kubernetes on supported systems to prevent memory over-reporting and uncontrolled GPU memory access.

* For drivers 570.124.06, 570.133.20, 570.148.08, and 570.158.01,
GPU workloads cannot be scheduled on nodes that have a mix of MIG slices and full GPUs.
This manifests as GPU pods getting stuck indefinitely in the ``Pending`` state.
GPU workloads cannot be scheduled on nodes that have a mix of MIG slices and full GPUs.
This manifests as GPU pods getting stuck indefinitely in the ``Pending`` state.
NVIDIA recommends that you upgrade the driver to version 570.172.08 to avoid this issue.
For more detailed information, see GitHub issue https://github.com/NVIDIA/gpu-operator/issues/1361.

Expand All @@ -135,7 +135,7 @@ Fixed Issues
------------

* Fixed security vulnerabilities in NVIDIA Container Toolkit and related components.
This release addresses CVE-2025-23266 (Critical) and CVE-2025-23267 (High) that could allow
This release addresses CVE-2025-23266 (Critical) and CVE-2025-23267 (High) that could allow
arbitrary code execution and link following attacks in container environments.
For complete details, refer to the `NVIDIA Security Bulletin <https://nvidia.custhelp.com/app/answers/detail/a_id/5659>`__.

Expand Down Expand Up @@ -169,13 +169,13 @@ New Features
- 535.247.01

* Added support for Red Hat Enterprise Linux 9.
Non-precompiled driver containers for Red Hat Enterprise Linux 9.2, 9.4, 9.5, and 9.6 versions are available for x86 based platforms only.
Non-precompiled driver containers for Red Hat Enterprise Linux 9.2, 9.4, 9.5, and 9.6 versions are available for x86 based platforms only.
They are not available for ARM based systems.

* Added support for Kubernetes v1.33.

* Added support for setting the internalTrafficPolicy for the DCGM Exporter service.
You can configure this in the Helm chart value by setting ``dcgmexporter.service.internalTrafficPolicy`` to ``Local`` or ``Cluster`` (default).
You can configure this in the Helm chart value by setting ``dcgmexporter.service.internalTrafficPolicy`` to ``Local`` or ``Cluster`` (default).
Choose Local if you want to route internal traffic within the node only.

.. _v25.3.1-known-issues:
Expand All @@ -184,8 +184,8 @@ Known Issues
------------

* For drivers 570.124.06, 570.133.20, 570.148.08, and 570.158.01,
GPU workloads cannot be scheduled on nodes that have a mix of MIG slices and full GPUs.
This manifests as GPU pods getting stuck indefinitely in the ``Pending`` state.
GPU workloads cannot be scheduled on nodes that have a mix of MIG slices and full GPUs.
This manifests as GPU pods getting stuck indefinitely in the ``Pending`` state.
NVIDIA recommends that you upgrade the driver to version 570.172.08 to avoid this issue.
For more detailed information, see GitHub issue https://github.com/NVIDIA/gpu-operator/issues/1361.

Expand All @@ -196,7 +196,7 @@ Known Issues
Fixed Issues
------------

* Fixed an issue where the NVIDIADriver controller may enter an endless loop of creating and deleting a DaemonSet.
* Fixed an issue where the NVIDIADriver controller may enter an endless loop of creating and deleting a DaemonSet.
This could occur when the NVIDIADriver DaemonSet does not tolerate a taint present on all nodes matching its configured nodeSelector, or when none of the DaemonSet pods have been scheduled yet.
Refer to GitHub `pull request #1416 <https://github.com/NVIDIA/gpu-operator/pull/1416>`__ for more details.

Expand Down Expand Up @@ -227,32 +227,32 @@ New Features

* Added support for the NVIDIA GPU DRA Driver v25.3.0 component (coming soon) which enables Multi-Node NVLink through Kubernetes Dynamic Resource Allocation (DRA) and IMEX support.

This component can be installed alongside the GPU Operator.
It is supported on Kubernetes v1.32 clusters, running on NVIDIA HGX GB200 NVL, and with CDI enabled on your GPU Operator.
This component can be installed alongside the GPU Operator.
It is supported on Kubernetes v1.32 clusters, running on NVIDIA HGX GB200 NVL, and with CDI enabled on your GPU Operator.

* Transitioned to installing the open kernel modules by default starting with R570 driver containers.
* Transitioned to installing the open kernel modules by default starting with R570 driver containers.

* Added a new parameter, ``kernelModuleType``, to the ClusterPolicy and NVIDIADriver APIs which specifies how the GPU Operator and driver containers will choose kernel modules to use.

Valid values include:

* ``auto``: Default and recommended option. ``auto`` means that the recommended kernel module type (open or proprietary) is chosen based on the GPU devices on the host and the driver branch used.
* ``open``: Use the NVIDIA Open GPU kernel module driver.
* ``open``: Use the NVIDIA Open GPU kernel module driver.
* ``proprietary``: Use the NVIDIA Proprietary GPU kernel module driver.

Currently, ``auto`` is only supported with the 570.86.15 and 570.124.06 or later driver containers.
Currently, ``auto`` is only supported with the 570.86.15 and 570.124.06 or later driver containers.
550 and 535 branch drivers do not yet support this mode.

In previous versions, the ``useOpenKernelModules`` field specified the driver containers to install the NVIDIA Open GPU kernel module driver.
In previous versions, the ``useOpenKernelModules`` field specified the driver containers to install the NVIDIA Open GPU kernel module driver.
This field is now deprecated and will be removed in a future release.
If you were using the ``useOpenKernelModules`` field, NVIDIA recommends that you update your configuration to use the ``kernelModuleType`` field instead.
If you were using the ``useOpenKernelModules`` field, NVIDIA recommends that you update your configuration to use the ``kernelModuleType`` field instead.

* Added support for Ubuntu 24.04 LTS.

* Added support for NVIDIA HGX GB200 NVL and NVIDIA HGX B200.
Note that HGX B200 requires a driver container version of 570.133.20 or later.

* Added support for the NVIDIA Data Center GPU Driver version 570.124.06.
* Added support for the NVIDIA Data Center GPU Driver version 570.124.06.

* Added support for KubeVirt and OpenShift Virtualization with vGPU v18 on H200NVL.

Expand Down Expand Up @@ -302,7 +302,7 @@ New Features
* ``2g.47gb`` :math:`\times` 1
* ``3g.95gb`` :math:`\times` 1

Improvements
Improvements
------------

* Improved security by removing unnecessary permissions in the GPU Operator ClusterRole.
Expand All @@ -316,7 +316,7 @@ Improvements
Fixed Issues
------------

* Removed default liveness probe from the ``nvidia-fs-ctr`` and ``nvidia-gdrcopy-ctr`` containers of the GPU driver daemonset.
* Removed default liveness probe from the ``nvidia-fs-ctr`` and ``nvidia-gdrcopy-ctr`` containers of the GPU driver daemonset.
Long response times of the `lsmod` commands were causing timeout errors in the probe and unnecessary restarts of the container, resulting in the DaemonSet being in a bad state.

* Fixed an issue where the GPU Operator failed to create a valid DaemonSet name on OpenShift Container Platform when using 64 kernel page size.
Expand All @@ -334,7 +334,7 @@ Fixed Issues
New Features
------------

* Added support for the NVIDIA Data Center GPU Driver version 570.86.15.
* Added support for the NVIDIA Data Center GPU Driver version 570.86.15.
* The default driver in this version is now 550.144.03.
Refer to the :ref:`GPU Operator Component Matrix`
on the platform support page for more details on supported drivers.
Expand Down Expand Up @@ -1296,7 +1296,6 @@ New Features
* Added support for configuring Confidential Containers for GPU workloads as a technology preview feature.
This feature builds on the work for configuring Kata Containers and
introduces NVIDIA Confidential Computing Manager for Kubernetes as an operand of GPU Operator.
Refer to :doc:`gpu-operator-confidential-containers` for more information.

* Added support for the NVIDIA Data Center GPU Driver version 535.86.10.
Refer to the :ref:`GPU Operator Component Matrix`
Expand Down