You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -22,7 +22,7 @@ Troubleshooting the NVIDIA GPU Operator
22
22
23
23
This page outlines common issues and troubleshooting steps for the NVIDIA GPU Operator.
24
24
25
-
If you are facing a gpu-operator and/or operand(s) issue that is not documented in this guide, its recommended that you run the ``must-gather`` utility, prepare a bug report, then file an issue in the `NVIDIA GPU Operator GitHub repository <https://github.com/NVIDIA/gpu-operator/issues>`_.
25
+
If you are facing a gpu-operator and/or operand(s) issue that is not documented in this guide, it is recommended that you run the ``must-gather`` utility, prepare a bug report, then file an issue in the `NVIDIA GPU Operator GitHub repository <https://github.com/NVIDIA/gpu-operator/issues>`_.
26
26
27
27
.. code-block:: console
28
28
@@ -98,7 +98,7 @@ Note that the operand pods will only come up when the driver daemonset and toolk
98
98
99
99
2. **Check the dmesg logs**
100
100
101
-
- ``dmesg`` displays the messages generated by the Linux Kernel. ``dmesg`` helps us detect any issues loading the GPU driver modules especially when the driver daemonset logs don't provide a lot of information
101
+
- ``dmesg`` displays the messages generated by the Linux Kernel. ``dmesg`` helps us detect any issues loading the GPU driver modules especially when the driver daemonset logs do not provide a lot of information
102
102
- You can retrieve ``dmesg`` using either: kubectl exec or execute ``dmesg`` in your host terminal.
103
103
104
104
kubectl exec
@@ -166,7 +166,7 @@ The runtime handler is added by the nvidia-container-toolkit, so this error mess
166
166
167
167
3. **Review the container runtime configuration TOML**
168
168
169
-
- CRI-O and Containerd are the two main container runtimes supported by the toolkit. You can view the runtime configuration file and verify that the "nvidia" container runtime handler actually exists
169
+
- CRI-O and Containerd are the two main container runtimes supported by the toolkit. You can view the runtime configuration file and verify that the "nvidia" container runtime handler exists
170
170
- Here are some ways to retrieve the container runtime config:
171
171
172
172
- If using "containerd", run the ``containerd config`` command to retrieve the active containerd configuration
@@ -279,7 +279,7 @@ GPU Node does not have the expected number of GPUs
279
279
280
280
When inspecting your GPU node, you may not see the expected number of "Allocatable" GPUs advertised in the node.
281
281
282
-
For e.g., Given a GPU node with 8 GPUs, its kubectl describe output may look something like the snippet below:
282
+
For example, given a GPU node with eight GPUs, the kubectl describe output might look like the following snippet:
283
283
284
284
.. code-block:: console
285
285
@@ -309,7 +309,7 @@ For e.g., Given a GPU node with 8 GPUs, its kubectl describe output may look som
309
309
....
310
310
....
311
311
312
-
The above node only advertises 7 GPU devices as allocatable when we expect it to display 8 instead
312
+
The above node only advertises seven GPU devices as allocatable when we expect it to display eight instead
313
313
314
314
.. rubric:: Action
315
315
:class: h4
@@ -336,7 +336,7 @@ The above node only advertises 7 GPU devices as allocatable when we expect it to
336
336
DCGM Exporter pods go into CrashLoopBackoff
337
337
*******************************************
338
338
339
-
By default, the GPU Operator only deploys the ``dcgm-exporter`` while disabling the standalone ``dcgm``. In this setup, the ``dcgm-exporter`` spawns a dcgm process locally. If, however, ``dcgm`` is enabled and deployed as a separate pod/container, then the ``dcgm-exporter`` will attempt to connect to the ``dcgm`` pod through a Kubernetes service. If the cluster networking settings aren't applied correctly, you would likely see error messages as mentioned below in the ``dcgm-exporter`` logs:
339
+
By default, the GPU Operator only deploys the ``dcgm-exporter`` while disabling the standalone ``dcgm``. In this setup, the ``dcgm-exporter`` spawns a dcgm process locally. If, however, ``dcgm`` is enabled and deployed as a separate pod/container, then the ``dcgm-exporter`` will attempt to connect to the ``dcgm`` pod through a Kubernetes service. If the cluster networking settings are not applied correctly, you would likely see the following error messages in the ``dcgm-exporter`` logs:
340
340
341
341
.. code-block:: console
342
342
@@ -374,7 +374,7 @@ Despite initiating a cluster-wide driver upgrade, not every driver daemonset get
374
374
kubectl get nodes -l nvidia.com/gpu-driver-upgrade-state=upgrade-failed
375
375
376
376
2. Check the driver daemonset pod logs in these nodes
377
-
3. If the driver daemonset pod logs aren't informative, check the node's ``dmesg``
377
+
3. If the driver daemonset pod logs are not informative, check the node's ``dmesg``
378
378
4. Once the issue is resolved, you can re-label the node with the command below:
379
379
380
380
.. code-block:: console
@@ -392,7 +392,7 @@ Pods stuck in Pending state in mixed MIG + full GPU environments
392
392
393
393
For drivers 570.124.06, 570.133.20, 570.148.08, and 570.158.01,
394
394
GPU workloads cannot be scheduled on nodes that have a mix of MIG slices and full GPUs.
395
-
For more detailed information, see GitHub issue https://github.com/NVIDIA/gpu-operator/issues/1361.
395
+
For more detailed information, refer to GitHub issue https://github.com/NVIDIA/gpu-operator/issues/1361.
396
396
397
397
.. rubric:: Observation
398
398
:class: h4
@@ -665,7 +665,7 @@ GPU Operator pods in ``Init:RunContainerError`` or ``Init:CreateContainerError``
665
665
.. rubric:: Issue
666
666
:class: h4
667
667
668
-
If you are installing, upgrading, or upgrading the GPU driver daemonset to v25.10 or later with CRI-O as the container runtime, you may notice several of the GPU Operator pods are stuck in the ``Init:RunContainerError`` or ``Init:CreateContainerError`` state.
668
+
If you are installing, upgrading, or upgrading the GPU driver daemonset to v25.10.0 with CRI-O as the container runtime, you may notice several of the GPU Operator pods are stuck in the ``Init:RunContainerError`` or ``Init:CreateContainerError`` state.
669
669
670
670
.. rubric:: Root Cause
671
671
:class: h4
@@ -675,4 +675,108 @@ Refer to this `GitHub issue <https://github.com/cri-o/cri-o/issues/9521>`_ for d
675
675
.. rubric:: Action
676
676
:class: h4
677
677
678
-
The errors will eventually resolve on their own after the driver daemonset is installed or the upgrade is complete.
678
+
The errors will eventually resolve on their own after the driver daemonset is installed or the upgrade is complete.
679
+
680
+
This issue was fixed in GPU Operator v25.10.1 and later.
CUDA validator fails with the following error message:
690
+
691
+
.. code-block:: output
692
+
693
+
[Vector addition of 50000 elements] Failed to allocate device vector A (error code initialization error)!
694
+
695
+
This issue has been observed with multiple types of GPUs:
696
+
697
+
- L40
698
+
- L40S
699
+
- H200
700
+
- RTX Pro 6000BSE
701
+
702
+
.. rubric:: Root Cause
703
+
:class: h4
704
+
705
+
This is a known issue with Kernel Address Space Layout Randomization (KASLR) affecting GPU initialization.
706
+
Refer to the `CUDA Toolkit Release Notes <https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html#id3>`_ for details.
707
+
708
+
.. rubric:: Action
709
+
:class: h4
710
+
711
+
To ensure stable GPU discovery and driver loading, disable KASLR on all GPU nodes by adding the kernel parameter ``nokaslr`` to the bootloader configuration.
712
+
713
+
**Disable KASLR on Kubernetes clusters running with Ubuntu or RHEL**
714
+
715
+
1. Edit ``/etc/default/grub`` and add ``nokaslr`` to ``GRUB_CMDLINE_LINUX_DEFAULT``:
716
+
717
+
.. code-block:: bash
718
+
719
+
GRUB_CMDLINE_LINUX_DEFAULT="quiet splash nokaslr"
720
+
721
+
2. Update GRUB and reboot:
722
+
723
+
.. code-block:: console
724
+
725
+
$ sudo update-grub
726
+
$ sudo reboot
727
+
728
+
**Disable KASLR on OpenShift clusters**
729
+
730
+
#. Create a MachineConfig targeting a specific node group:
731
+
732
+
.. code-block:: yaml
733
+
734
+
apiVersion: machineconfiguration.openshift.io/v1
735
+
kind: MachineConfig
736
+
metadata:
737
+
labels:
738
+
machineconfiguration.openshift.io/role: worker
739
+
name: disable-worker-kaslr
740
+
spec:
741
+
config:
742
+
ignition:
743
+
version: 3.2.0
744
+
kernelArguments:
745
+
- nokaslr
746
+
747
+
Refer to the `OpenShift documentation <https://docs.okd.io/latest/machine_configuration/machine-configs-configure.html#nodes-nodes-kernel-arguments_machine-configs-configure>`_ for more details.
748
+
749
+
#. Apply the MachineConfig:
750
+
751
+
.. code-block:: console
752
+
753
+
$ oc apply -f nokaslr.yaml
754
+
machineconfig.machineconfiguration.openshift.io/disable-worker-kaslr created
755
+
756
+
**Verify that KASLR is disabled**
757
+
758
+
#. Confirm that KASLR is disabled by checking the kernel command line
759
+
760
+
.. code-block:: console
761
+
762
+
$ cat /proc/cmdline
763
+
764
+
#. Ensure the output includes the ``nokaslr`` flag. For example:
If ``nokaslr`` is present, KASLR has been successfully disabled.
774
+
775
+
.. note::
776
+
777
+
On OpenShift clusters, the update may not be applied automatically. If KASLR remains enabled after the MachineConfig change, manually drain the node to force the MachineConfig Operator to reboot and apply the update.
778
+
779
+
.. rubric:: Result
780
+
:class: h4
781
+
782
+
The CUDA validator will run successfully after KASLR is disabled.
0 commit comments