Skip to content

Commit 11fba2e

Browse files
authored
Add known issue to GPU Operator for issue 1361 (#201)
* Add footnote to Platform Support for issue 1361 Signed-off-by: Andrew Chen <[email protected]> * fix formatting Signed-off-by: Andrew Chen <[email protected]> * add known issue to v25.3.1 release note Signed-off-by: Andrew Chen <[email protected]> * add known issue to MIG Signed-off-by: Andrew Chen <[email protected]> * add known issue to troubleshooting Signed-off-by: Andrew Chen <[email protected]> * fix syntax Signed-off-by: Andrew Chen <[email protected]> * remove unrelease version reference and fix capitalization Signed-off-by: Andrew Chen <[email protected]> * accept recommendations Signed-off-by: Andrew Chen <[email protected]> * rephase Signed-off-by: Andrew Chen <[email protected]> * change link syntax Signed-off-by: Andrew Chen <[email protected]> --------- Signed-off-by: Andrew Chen <[email protected]>
1 parent 85b6039 commit 11fba2e

File tree

4 files changed

+64
-5
lines changed

4 files changed

+64
-5
lines changed

gpu-operator/gpu-operator-mig.rst

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -97,6 +97,14 @@ Perform the following steps to install the Operator and configure MIG:
9797
In some cases, the node may need to be rebooted, such as a CSP, so the node might need to be cordoned
9898
before changing the MIG mode or the MIG geometry on the GPUs.
9999

100+
.. note::
101+
102+
Known Issue: For drivers 570.124.06, 570.133.20, and 570.148.08,
103+
GPU workloads cannot be scheduled on nodes that have a mix of MIG slices and full GPUs.
104+
This manifests as GPU pods getting stuck indefinitely in the ``Pending`` state.
105+
It's recommended that you downgrade the driver to version 570.86.15 to work around this issue.
106+
For more detailed information, see GitHub issue [NVIDIA/gpu-operator#1361](https://github.com/NVIDIA/gpu-operator/issues/1361).
107+
100108

101109
************************
102110
Configuring MIG Profiles

gpu-operator/life-cycle-policy.rst

Lines changed: 17 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -71,8 +71,10 @@ The product life cycle and versioning are subject to change in the future.
7171
GPU Operator Component Matrix
7272
*****************************
7373

74+
.. _ki: #known-issue
75+
.. |ki| replace:: :sup:`1`
7476
.. _gds: #gds-open-kernel
75-
.. |gds| replace:: :sup:`1`
77+
.. |gds| replace:: :sup:`2`
7678

7779
The following table shows the operands and default operand versions that correspond to a GPU Operator version.
7880

@@ -86,9 +88,9 @@ Refer to :ref:`Upgrading the NVIDIA GPU Operator` for more information.
8688
- Version
8789

8890
* - NVIDIA GPU Operator
89-
- ${version}
91+
- ${version}
9092

91-
* - NVIDIA GPU Driver
93+
* - NVIDIA GPU Driver |ki|_
9294
- | `575.57.08 <https://docs.nvidia.com/datacenter/tesla/tesla-release-notes-575-57-08/index.html>`_
9395
| `570.148.08 <https://docs.nvidia.com/datacenter/tesla/tesla-release-notes-570-148-08/index.html>`_ (default, recommended)
9496
| `550.163.01 <https://docs.nvidia.com/datacenter/tesla/tesla-release-notes-550-163-01/index.html>`_
@@ -141,12 +143,22 @@ Refer to :ref:`Upgrading the NVIDIA GPU Operator` for more information.
141143
* - NVIDIA GDRCopy Driver
142144
- `v2.5.0 <https://github.com/NVIDIA/gdrcopy/releases>`__
143145

144-
.. _gds-open-kernel:
146+
.. _known-issue:
145147

146148
:sup:`1`
149+
Known Issue: For drivers 570.124.06, 570.133.20, and 570.148.08,
150+
GPU workloads cannot be scheduled on nodes that have a mix of MIG slices and full GPUs.
151+
This manifests as GPU pods getting stuck indefinitely in the ``Pending`` state.
152+
It's recommended that you downgrade the driver to version 570.86.15 to work around this issue.
153+
For more detailed information, see GitHub issue [NVIDIA/gpu-operator#1361](https://github.com/NVIDIA/gpu-operator/issues/1361).
154+
155+
156+
.. _gds-open-kernel:
157+
158+
:sup:`2`
147159
This release of the GDS driver requires that you use the NVIDIA Open GPU Kernel module driver for the GPUs.
148160
Refer to :doc:`gpu-operator-rdma` for more information.
149-
161+
150162
.. note::
151163

152164
- Driver version could be different with NVIDIA vGPU, as it depends on the driver

gpu-operator/release-notes.rst

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -69,6 +69,18 @@ New Features
6969
You can configure this in the Helm chart value by setting `dcgmexporter.service.internalTrafficPolicy` to `Local` or `Cluster` (default).
7070
Choose Local if you want to route internal traffic within the node only.
7171

72+
.. _v25.3.1-known-issues:
73+
74+
Known Issues
75+
------------
76+
77+
* For drivers 570.124.06, 570.133.20, and 570.148.08,
78+
GPU workloads cannot be scheduled on nodes that have a mix of MIG slices and full GPUs.
79+
This manifests as GPU pods getting stuck indefinitely in the ``Pending`` state.
80+
It's recommended that you downgrade the driver to version 570.86.15 to work around this issue.
81+
For more detailed information, see GitHub issue [NVIDIA/gpu-operator#1361](https://github.com/NVIDIA/gpu-operator/issues/1361).
82+
83+
7284
.. _v25.3.1-fixed-issues:
7385

7486
Fixed Issues

gpu-operator/troubleshooting.rst

Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -20,6 +20,33 @@
2020
Troubleshooting the NVIDIA GPU Operator
2121
#######################################
2222

23+
****************************************************************
24+
Pods stuck in Pending state in mixed MIG + full GPU environments
25+
****************************************************************
26+
27+
.. rubric:: Issue
28+
:class: h4
29+
30+
For drivers 570.124.06, 570.133.20, and 570.148.08,
31+
GPU workloads cannot be scheduled on nodes that have a mix of MIG slices and full GPUs.
32+
For more detailed information, see GitHub issue [NVIDIA/gpu-operator#1361](https://github.com/NVIDIA/gpu-operator/issues/1361).
33+
34+
.. rubric:: Observation
35+
:class: h4
36+
37+
When a GPU pod is created on a node that has a mix of MIG slices and full GPUs,
38+
the GPU pod gets stuck indefinitely in the ``Pending`` state.
39+
40+
.. rubric:: Root Cause
41+
:class: h4
42+
43+
This is due to a regression in NVML introduced in the R570 drivers starting from 570.124.06.
44+
45+
.. rubric:: Action
46+
:class: h4
47+
48+
It's recommended that you downgrade to driver version 570.86.15 to work around this issue.
49+
2350
****************************************************
2451
GPU Operator Validator: Failed to Create Pod Sandbox
2552
****************************************************

0 commit comments

Comments
 (0)