Skip to content

Commit d8d49c1

Browse files
committed
Update toubleshooting style, add nokaslr guide
Signed-off-by: Abigail McCarthy <[email protected]>
1 parent 41120ae commit d8d49c1

File tree

1 file changed

+114
-10
lines changed

1 file changed

+114
-10
lines changed

gpu-operator/troubleshooting.rst

Lines changed: 114 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -22,7 +22,7 @@ Troubleshooting the NVIDIA GPU Operator
2222

2323
This page outlines common issues and troubleshooting steps for the NVIDIA GPU Operator.
2424

25-
If you are facing a gpu-operator and/or operand(s) issue that is not documented in this guide, its recommended that you run the ``must-gather`` utility, prepare a bug report, then file an issue in the `NVIDIA GPU Operator GitHub repository <https://github.com/NVIDIA/gpu-operator/issues>`_.
25+
If you are facing a gpu-operator and/or operand(s) issue that is not documented in this guide, it is recommended that you run the ``must-gather`` utility, prepare a bug report, then file an issue in the `NVIDIA GPU Operator GitHub repository <https://github.com/NVIDIA/gpu-operator/issues>`_.
2626

2727
.. code-block:: console
2828
@@ -98,7 +98,7 @@ Note that the operand pods will only come up when the driver daemonset and toolk
9898
9999
2. **Check the dmesg logs**
100100

101-
- ``dmesg`` displays the messages generated by the Linux Kernel. ``dmesg`` helps us detect any issues loading the GPU driver modules especially when the driver daemonset logs don't provide a lot of information
101+
- ``dmesg`` displays the messages generated by the Linux Kernel. ``dmesg`` helps us detect any issues loading the GPU driver modules especially when the driver daemonset logs do not provide a lot of information
102102
- You can retrieve ``dmesg`` using either: kubectl exec or execute ``dmesg`` in your host terminal.
103103

104104
kubectl exec
@@ -166,7 +166,7 @@ The runtime handler is added by the nvidia-container-toolkit, so this error mess
166166

167167
3. **Review the container runtime configuration TOML**
168168

169-
- CRI-O and Containerd are the two main container runtimes supported by the toolkit. You can view the runtime configuration file and verify that the "nvidia" container runtime handler actually exists
169+
- CRI-O and Containerd are the two main container runtimes supported by the toolkit. You can view the runtime configuration file and verify that the "nvidia" container runtime handler exists
170170
- Here are some ways to retrieve the container runtime config:
171171

172172
- If using "containerd", run the ``containerd config`` command to retrieve the active containerd configuration
@@ -279,7 +279,7 @@ GPU Node does not have the expected number of GPUs
279279

280280
When inspecting your GPU node, you may not see the expected number of "Allocatable" GPUs advertised in the node.
281281

282-
For e.g., Given a GPU node with 8 GPUs, its kubectl describe output may look something like the snippet below:
282+
For example, given a GPU node with eight GPUs, the kubectl describe output might look like the following snippet:
283283

284284
.. code-block:: console
285285
@@ -309,7 +309,7 @@ For e.g., Given a GPU node with 8 GPUs, its kubectl describe output may look som
309309
....
310310
....
311311
312-
The above node only advertises 7 GPU devices as allocatable when we expect it to display 8 instead
312+
The above node only advertises seven GPU devices as allocatable when we expect it to display eight instead
313313

314314
.. rubric:: Action
315315
:class: h4
@@ -336,7 +336,7 @@ The above node only advertises 7 GPU devices as allocatable when we expect it to
336336
DCGM Exporter pods go into CrashLoopBackoff
337337
*******************************************
338338

339-
By default, the GPU Operator only deploys the ``dcgm-exporter`` while disabling the standalone ``dcgm``. In this setup, the ``dcgm-exporter`` spawns a dcgm process locally. If, however, ``dcgm`` is enabled and deployed as a separate pod/container, then the ``dcgm-exporter`` will attempt to connect to the ``dcgm`` pod through a Kubernetes service. If the cluster networking settings aren't applied correctly, you would likely see error messages as mentioned below in the ``dcgm-exporter`` logs:
339+
By default, the GPU Operator only deploys the ``dcgm-exporter`` while disabling the standalone ``dcgm``. In this setup, the ``dcgm-exporter`` spawns a dcgm process locally. If, however, ``dcgm`` is enabled and deployed as a separate pod/container, then the ``dcgm-exporter`` will attempt to connect to the ``dcgm`` pod through a Kubernetes service. If the cluster networking settings are not applied correctly, you would likely see the following error messages in the ``dcgm-exporter`` logs:
340340

341341
.. code-block:: console
342342
@@ -374,7 +374,7 @@ Despite initiating a cluster-wide driver upgrade, not every driver daemonset get
374374
kubectl get nodes -l nvidia.com/gpu-driver-upgrade-state=upgrade-failed
375375
376376
2. Check the driver daemonset pod logs in these nodes
377-
3. If the driver daemonset pod logs aren't informative, check the node's ``dmesg``
377+
3. If the driver daemonset pod logs are not informative, check the node's ``dmesg``
378378
4. Once the issue is resolved, you can re-label the node with the command below:
379379

380380
.. code-block:: console
@@ -392,7 +392,7 @@ Pods stuck in Pending state in mixed MIG + full GPU environments
392392

393393
For drivers 570.124.06, 570.133.20, 570.148.08, and 570.158.01,
394394
GPU workloads cannot be scheduled on nodes that have a mix of MIG slices and full GPUs.
395-
For more detailed information, see GitHub issue https://github.com/NVIDIA/gpu-operator/issues/1361.
395+
For more detailed information, refer to GitHub issue https://github.com/NVIDIA/gpu-operator/issues/1361.
396396

397397
.. rubric:: Observation
398398
:class: h4
@@ -665,7 +665,7 @@ GPU Operator pods in ``Init:RunContainerError`` or ``Init:CreateContainerError``
665665
.. rubric:: Issue
666666
:class: h4
667667

668-
If you are installing, upgrading, or upgrading the GPU driver daemonset to v25.10 or later with CRI-O as the container runtime, you may notice several of the GPU Operator pods are stuck in the ``Init:RunContainerError`` or ``Init:CreateContainerError`` state.
668+
If you are installing, upgrading, or upgrading the GPU driver daemonset to v25.10.0 with CRI-O as the container runtime, you may notice several of the GPU Operator pods are stuck in the ``Init:RunContainerError`` or ``Init:CreateContainerError`` state.
669669

670670
.. rubric:: Root Cause
671671
:class: h4
@@ -675,4 +675,108 @@ Refer to this `GitHub issue <https://github.com/cri-o/cri-o/issues/9521>`_ for d
675675
.. rubric:: Action
676676
:class: h4
677677

678-
The errors will eventually resolve on their own after the driver daemonset is installed or the upgrade is complete.
678+
The errors will eventually resolve on their own after the driver daemonset is installed or the upgrade is complete.
679+
680+
This issue was fixed in GPU Operator v25.10.1 and later.
681+
682+
***************************************************************************
683+
CUDA validator fails with "Failed to allocate device vector" error (KASLR)
684+
***************************************************************************
685+
686+
.. rubric:: Issue
687+
:class: h4
688+
689+
CUDA validator fails with the following error message:
690+
691+
.. code-block:: output
692+
693+
[Vector addition of 50000 elements] Failed to allocate device vector A (error code initialization error)!
694+
695+
This issue has been observed with multiple types of GPUs:
696+
697+
- L40
698+
- L40S
699+
- H200
700+
- RTX Pro 6000BSE
701+
702+
.. rubric:: Root Cause
703+
:class: h4
704+
705+
This is a known issue with Kernel Address Space Layout Randomization (KASLR) affecting GPU initialization.
706+
Refer to the `CUDA Toolkit Release Notes <https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html#id3>`_ for details.
707+
708+
.. rubric:: Action
709+
:class: h4
710+
711+
To ensure stable GPU discovery and driver loading, disable KASLR on all GPU nodes by adding the kernel parameter ``nokaslr`` to the bootloader configuration.
712+
713+
**Disable KASLR on Kubernetes clusters running with Ubuntu or RHEL**
714+
715+
1. Edit ``/etc/default/grub`` and add ``nokaslr`` to ``GRUB_CMDLINE_LINUX_DEFAULT``:
716+
717+
.. code-block:: bash
718+
719+
GRUB_CMDLINE_LINUX_DEFAULT="quiet splash nokaslr"
720+
721+
2. Update GRUB and reboot:
722+
723+
.. code-block:: console
724+
725+
$ sudo update-grub
726+
$ sudo reboot
727+
728+
**Disable KASLR on OpenShift clusters**
729+
730+
#. Create a MachineConfig targeting a specific node group:
731+
732+
.. code-block:: yaml
733+
734+
apiVersion: machineconfiguration.openshift.io/v1
735+
kind: MachineConfig
736+
metadata:
737+
labels:
738+
machineconfiguration.openshift.io/role: worker
739+
name: disable-worker-kaslr
740+
spec:
741+
config:
742+
ignition:
743+
version: 3.2.0
744+
kernelArguments:
745+
- nokaslr
746+
747+
Refer to the `OpenShift documentation <https://docs.okd.io/latest/machine_configuration/machine-configs-configure.html#nodes-nodes-kernel-arguments_machine-configs-configure>`_ for more details.
748+
749+
#. Apply the MachineConfig:
750+
751+
.. code-block:: console
752+
753+
$ oc apply -f nokaslr.yaml
754+
machineconfig.machineconfiguration.openshift.io/disable-worker-kaslr created
755+
756+
**Verify that KASLR is disabled**
757+
758+
#. Confirm that KASLR is disabled by checking the kernel command line
759+
760+
.. code-block:: console
761+
762+
$ cat /proc/cmdline
763+
764+
#. Ensure the output includes the ``nokaslr`` flag. For example:
765+
766+
767+
.. code-block:: output
768+
769+
[core@host1 ~]$ cat /proc/cmdline
770+
BOOT_IMAGE=(hd5,gpt3)/boot/ostree/rhcos-169d36e079c1193bc7ab64f25220a2cf4335bd8daac20961e66e88d5451686af/vmlinuz-5.14.0-427.68.1.el9_4.x86_64 ignition.platform.id=metal ip=6c-fe-54-90-af-c0:dhcp ostree=/ostree/boot.0/rhcos/169d36e079c1193bc7ab64f25220a2cf4335bd8daac20961e66e88d5451686af/0 root=UUID=1f3af9d2-c485-482c-ad6e-29f2029dfdee rw rootflags=prjquota boot=UUID=078e4760-1950-eab4-ba54-a59dfca0b5c9 nokaslr systemd.unified_cgroup_hierarchy=1 cgroup_no_v1=all psi=0
771+
[core@host1 ~]$
772+
773+
If ``nokaslr`` is present, KASLR has been successfully disabled.
774+
775+
.. note::
776+
777+
On OpenShift clusters, the update may not be applied automatically. If KASLR remains enabled after the MachineConfig change, manually drain the node to force the MachineConfig Operator to reboot and apply the update.
778+
779+
.. rubric:: Result
780+
:class: h4
781+
782+
The CUDA validator will run successfully after KASLR is disabled.

0 commit comments

Comments
 (0)