Skip to content

Commit 41120ae

Browse files
authored
Update docs for 25.10.1 (#326)
small updates add additional components Apply suggestions from code review add in known issues and component updates Signed-off-by: Abigail McCarthy <[email protected]>
1 parent b9978e3 commit 41120ae

File tree

3 files changed

+68
-4
lines changed

3 files changed

+68
-4
lines changed

gpu-operator/life-cycle-policy.rst

Lines changed: 17 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -87,9 +87,10 @@ Refer to :ref:`Upgrading the NVIDIA GPU Operator` for more information.
8787
:header-rows: 2
8888

8989
* - :rspan:`1` Component
90-
- GPU Operator Version
90+
- :cspan:`2` GPU Operator Version
9191

9292
* - v25.10.0
93+
- v25.10.1
9394

9495
* - NVIDIA GPU Driver |ki|_
9596
- | `580.95.05 <https://docs.nvidia.com/datacenter/tesla/tesla-release-notes-580-95-05/index.html>`_ (**D**, **R**)
@@ -98,32 +99,44 @@ Refer to :ref:`Upgrading the NVIDIA GPU Operator` for more information.
9899
| `570.195.03 <https://docs.nvidia.com/datacenter/tesla/tesla-release-notes-570-195-03/index.html>`_
99100
| `550.163.01 <https://docs.nvidia.com/datacenter/tesla/tesla-release-notes-550-163-01/index.html>`_
100101
| `535.274.02 <https://docs.nvidia.com/datacenter/tesla/tesla-release-notes-535-274-03/index.html>`_
102+
- | `580.105.08 <https://docs.nvidia.com/datacenter/tesla/tesla-release-notes-580-105-08/index.html>`_ (**D**, **R**)
103+
| `580.95.05 <https://docs.nvidia.com/datacenter/tesla/tesla-release-notes-580-95-05/index.html>`_
104+
| `580.82.07 <https://docs.nvidia.com/datacenter/tesla/tesla-release-notes-580-82-07/index.html>`_
105+
| `575.57.08 <https://docs.nvidia.com/datacenter/tesla/tesla-release-notes-575-57-08/index.html>`_
106+
| `570.195.03 <https://docs.nvidia.com/datacenter/tesla/tesla-release-notes-570-195-03/index.html>`_
107+
| `550.163.01 <https://docs.nvidia.com/datacenter/tesla/tesla-release-notes-550-163-01/index.html>`_
108+
| `535.274.02 <https://docs.nvidia.com/datacenter/tesla/tesla-release-notes-535-274-03/index.html>`_
101109
102110

103111
* - NVIDIA Driver Manager for Kubernetes
104112
- `v0.9.0 <https://ngc.nvidia.com/catalog/containers/nvidia:cloud-native:k8s-driver-manager>`__
113+
- `v0.9.1 <https://ngc.nvidia.com/catalog/containers/nvidia:cloud-native:k8s-driver-manager>`__
105114

106115
* - NVIDIA Container Toolkit
107116
- `1.18.0 <https://github.com/NVIDIA/nvidia-container-toolkit/releases>`__
108117

109118
* - NVIDIA Kubernetes Device Plugin
110119
- `0.18.0 <https://github.com/NVIDIA/k8s-device-plugin/releases>`__
120+
- `0.18.1 <https://github.com/NVIDIA/k8s-device-plugin/releases>`__
111121

112122
* - DCGM Exporter
113123
- `v4.4.1-4.6.0 <https://github.com/NVIDIA/dcgm-exporter/releases>`__
124+
- `v4.4.2-4.7.0 <https://github.com/NVIDIA/dcgm-exporter/releases>`__
114125

115126
* - Node Feature Discovery
116127
- `v0.18.2 <https://github.com/kubernetes-sigs/node-feature-discovery/releases/>`__
117128

118129
* - | NVIDIA GPU Feature Discovery
119130
| for Kubernetes
120-
- `0.18.0 <https://github.com/NVIDIA/k8s-device-plugin/releases>`__
131+
- `0.18.1 <https://github.com/NVIDIA/k8s-device-plugin/releases>`__
121132

122133
* - NVIDIA MIG Manager for Kubernetes
123134
- `0.13.0 <https://github.com/NVIDIA/mig-parted/blob/main/CHANGELOG.md>`__
135+
- `0.13.1 <https://github.com/NVIDIA/mig-parted/blob/main/CHANGELOG.md>`__
124136

125137
* - DCGM
126138
- `4.4.1 <https://docs.nvidia.com/datacenter/dcgm/latest/release-notes/changelog.html>`__
139+
- `4.4.2-1 <https://docs.nvidia.com/datacenter/dcgm/latest/release-notes/changelog.html>`__
127140

128141
* - Validator for NVIDIA GPU Operator
129142
- v25.10.0
@@ -169,4 +182,5 @@ Refer to :ref:`Upgrading the NVIDIA GPU Operator` for more information.
169182
version downloaded from the `NVIDIA Licensing Portal <https://ui.licensing.nvidia.com>`_.
170183
- The GPU Operator is supported on all active NVIDIA data center production drivers.
171184
Refer to `Supported Drivers and CUDA Toolkit Versions <https://docs.nvidia.com/datacenter/tesla/drivers/index.html#supported-drivers-and-cuda-toolkit-versions>`_
172-
for more information.
185+
for more information.
186+

gpu-operator/release-notes.rst

Lines changed: 50 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -33,6 +33,56 @@ Refer to the :ref:`GPU Operator Component Matrix` for a list of software compone
3333

3434
----
3535

36+
37+
38+
.. _v25.10.1:
39+
40+
25.10.1
41+
=======
42+
43+
New Features
44+
------------
45+
46+
* Updated software component versions:
47+
48+
- NVIDIA Container Toolkit v1.18.1
49+
- NVIDIA DCGM v4.4.2-1
50+
- NVIDIA DCGM Exporter v4.4.2-4.7.0
51+
- NVIDIA Kubernetes Device Plugin v0.18.1
52+
- NVIDIA GPU Feature Discovery v0.18.1
53+
- NVIDIA MIG Manager for Kubernetes 0.13.1
54+
- NVIDIA Driver Manager for Kubernetes v0.9.1
55+
56+
* Added support for this NVIDIA Data Center GPU Driver version:
57+
58+
- 580.105.08 (default)
59+
60+
* Add HPC job mapping support to DCGM Exporter to collect metrics for HPC jobs running on the cluster.
61+
62+
Configure the HPC job mapping by setting the ``dcgmExporter.hpcJobMapping.enabled`` field to ``true`` in the ClusterPolicy custom resource.
63+
Set ``dcgmExporter.hpcJobMapping.directory`` with the directory path where HPC job mapping files are created by the workload manager.
64+
The default directory is ``/var/lib/dcgm-exporter/job-mapping``.
65+
66+
* Improved the cluster policy reconciler to be more resilient to race conditions during node updates.
67+
68+
Fixed Issues
69+
------------
70+
71+
* Fixed the following known issue introduced in GPU Operator v25.10.0:
72+
73+
* When using cri-o as the container runtime, several GPU Operator pods can be stuck in the ``Init:RunContainerError`` or ``Init:CreateContainerError`` state during GPU Operator installation or upgrade, or during GPU driver daemonset upgrade.
74+
* NVIDIA Container Toolkit 1.18.0 overwrites the imports field in the top-level containerd configuration file, so any previously imported paths are lost.
75+
This was fixed in NVIDIA Container Toolkit v1.18.1.
76+
77+
* Fixed a race condition where user-supplied NVIDIA kernel module parameters were sometimes not being applied by the driver daemonset.
78+
For more information, refer to `PR #1939 <https://github.com/NVIDIA/gpu-operator/pull/1939>`__.
79+
80+
* Fixed a bug where driver images were being incorrectly assigned in multi-nodepool clusters.
81+
For more information, refer to `Issue #1622 <https://github.com/NVIDIA/gpu-operator/issues/1622>`__.
82+
* Fixed a bug where the GPU Operator Helm chart template was not assigning the correct namespace to resources it created.
83+
* Fixed a bug where the k8s-driver-manager would wait indefinitely when MOFED is enabled and ``USE_HOST_MOFED`` is set to true despite the MOFED being pre-installed on the host.
84+
85+
3686
.. _v25.10.0:
3787

3888
25.10.0

repo.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -168,7 +168,7 @@ docs_root = "${root}/gpu-operator"
168168
project = "gpu-operator"
169169
name = "NVIDIA GPU Operator"
170170
version = "25.10" # Update repo_docs.projects.openshift.version to match latest patch version maj.min.patch
171-
source_substitutions = { minor_version = "25.10", version = "v25.10.0", recommended = "580.95.05" }
171+
source_substitutions = { minor_version = "25.10", version = "v25.10.1", recommended = "580.105.08" }
172172
copyright_start = 2020
173173
sphinx_exclude_patterns = [
174174
"life-cycle-policy.rst",

0 commit comments

Comments
 (0)