Add known issue to GPU Operator for issue 1361 (#201)

chenopis · web-flow · commit 11fba2edf4fa · 2025-06-30T11:32:22.000-07:00
* Add footnote to Platform Support for issue 1361

Signed-off-by: Andrew Chen &lt;andrewch@nvidia.com&gt;

* fix formatting

Signed-off-by: Andrew Chen &lt;andrewch@nvidia.com&gt;

* add known issue to v25.3.1 release note

Signed-off-by: Andrew Chen &lt;andrewch@nvidia.com&gt;

* add known issue to MIG

Signed-off-by: Andrew Chen &lt;andrewch@nvidia.com&gt;

* add known issue to troubleshooting

Signed-off-by: Andrew Chen &lt;andrewch@nvidia.com&gt;

* fix syntax

Signed-off-by: Andrew Chen &lt;andrewch@nvidia.com&gt;

* remove unrelease version reference and fix capitalization

Signed-off-by: Andrew Chen &lt;andrewch@nvidia.com&gt;

* accept recommendations

Signed-off-by: Andrew Chen &lt;andrewch@nvidia.com&gt;

* rephase

Signed-off-by: Andrew Chen &lt;andrewch@nvidia.com&gt;

* change link syntax

Signed-off-by: Andrew Chen &lt;andrewch@nvidia.com&gt;

---------

Signed-off-by: Andrew Chen &lt;andrewch@nvidia.com&gt;
diff --git a/gpu-operator/gpu-operator-mig.rst b/gpu-operator/gpu-operator-mig.rst
@@ -97,6 +97,14 @@ Perform the following steps to install the Operator and configure MIG:
       In some cases, the node may need to be rebooted, such as a CSP, so the node might need to be cordoned
       before changing the MIG mode or the MIG geometry on the GPUs.
 
+   .. note::
+
+      Known Issue: For drivers 570.124.06, 570.133.20, and 570.148.08, 
+      GPU workloads cannot be scheduled on nodes that have a mix of MIG slices and full GPUs. 
+      This manifests as GPU pods getting stuck indefinitely in the ``Pending`` state. 
+      It's recommended that you downgrade the driver to version 570.86.15 to work around this issue.
+      For more detailed information, see GitHub issue [NVIDIA/gpu-operator#1361](https://github.com/NVIDIA/gpu-operator/issues/1361).
+
 
 ************************
 Configuring MIG Profiles
diff --git a/gpu-operator/life-cycle-policy.rst b/gpu-operator/life-cycle-policy.rst
@@ -71,8 +71,10 @@ The product life cycle and versioning are subject to change in the future.
 GPU Operator Component Matrix
 *****************************
 
+.. _ki: #known-issue
+.. |ki| replace:: :sup:`1`
 .. _gds: #gds-open-kernel
-.. |gds| replace:: :sup:`1`
+.. |gds| replace:: :sup:`2`
 
 The following table shows the operands and default operand versions that correspond to a GPU Operator version.
 
@@ -86,9 +88,9 @@ Refer to :ref:`Upgrading the NVIDIA GPU Operator` for more information.
      - Version
 
    * - NVIDIA GPU Operator
-     - ${version}
+     - ${version} 
 
-   * - NVIDIA GPU Driver
+   * - NVIDIA GPU Driver |ki|_
      - | `575.57.08 <https://docs.nvidia.com/datacenter/tesla/tesla-release-notes-575-57-08/index.html>`_ 
        | `570.148.08 <https://docs.nvidia.com/datacenter/tesla/tesla-release-notes-570-148-08/index.html>`_ (default, recommended)
        | `550.163.01 <https://docs.nvidia.com/datacenter/tesla/tesla-release-notes-550-163-01/index.html>`_
@@ -141,12 +143,22 @@ Refer to :ref:`Upgrading the NVIDIA GPU Operator` for more information.
    * - NVIDIA GDRCopy Driver
      - `v2.5.0 <https://github.com/NVIDIA/gdrcopy/releases>`__
 
-.. _gds-open-kernel:
+.. _known-issue:
 
    :sup:`1`
+   Known Issue: For drivers 570.124.06, 570.133.20, and 570.148.08, 
+   GPU workloads cannot be scheduled on nodes that have a mix of MIG slices and full GPUs. 
+   This manifests as GPU pods getting stuck indefinitely in the ``Pending`` state. 
+   It's recommended that you downgrade the driver to version 570.86.15 to work around this issue.
+   For more detailed information, see GitHub issue [NVIDIA/gpu-operator#1361](https://github.com/NVIDIA/gpu-operator/issues/1361).
+
+
+.. _gds-open-kernel:
+
+   :sup:`2`
    This release of the GDS driver requires that you use the NVIDIA Open GPU Kernel module driver for the GPUs.
    Refer to :doc:`gpu-operator-rdma` for more information.
-
+   
 .. note::
 
    - Driver version could be different with NVIDIA vGPU, as it depends on the driver
diff --git a/gpu-operator/release-notes.rst b/gpu-operator/release-notes.rst
@@ -69,6 +69,18 @@ New Features
   You can configure this in the Helm chart value by setting `dcgmexporter.service.internalTrafficPolicy` to `Local` or `Cluster` (default). 
   Choose Local if you want to route internal traffic within the node only.
 
+.. _v25.3.1-known-issues:
+
+Known Issues
+------------
+
+* For drivers 570.124.06, 570.133.20, and 570.148.08, 
+  GPU workloads cannot be scheduled on nodes that have a mix of MIG slices and full GPUs. 
+  This manifests as GPU pods getting stuck indefinitely in the ``Pending`` state. 
+  It's recommended that you downgrade the driver to version 570.86.15 to work around this issue.
+  For more detailed information, see GitHub issue [NVIDIA/gpu-operator#1361](https://github.com/NVIDIA/gpu-operator/issues/1361).
+
+
 .. _v25.3.1-fixed-issues:
 
 Fixed Issues
diff --git a/gpu-operator/troubleshooting.rst b/gpu-operator/troubleshooting.rst
@@ -20,6 +20,33 @@
 Troubleshooting the NVIDIA GPU Operator
 #######################################
 
+****************************************************************
+Pods stuck in Pending state in mixed MIG + full GPU environments
+****************************************************************
+
+.. rubric:: Issue
+   :class: h4
+
+For drivers 570.124.06, 570.133.20, and 570.148.08, 
+GPU workloads cannot be scheduled on nodes that have a mix of MIG slices and full GPUs.
+For more detailed information, see GitHub issue [NVIDIA/gpu-operator#1361](https://github.com/NVIDIA/gpu-operator/issues/1361).
+
+.. rubric:: Observation
+   :class: h4
+
+When a GPU pod is created on a node that has a mix of MIG slices and full GPUs, 
+the GPU pod gets stuck indefinitely in the ``Pending`` state. 
+
+.. rubric:: Root Cause
+   :class: h4
+
+This is due to a regression in NVML introduced in the R570 drivers starting from 570.124.06.
+
+.. rubric:: Action
+   :class: h4
+
+It's recommended that you downgrade to driver version 570.86.15 to work around this issue.
+
 ****************************************************
 GPU Operator Validator: Failed to Create Pod Sandbox
 ****************************************************