Update toubleshooting style, add nokaslr guide

a-mccarthy · a-mccarthy · commit d8d49c17344b · 2025-12-05T10:11:48.000-05:00
Signed-off-by: Abigail McCarthy &lt;20771501+a-mccarthy@users.noreply.github.com&gt;
diff --git a/gpu-operator/troubleshooting.rst b/gpu-operator/troubleshooting.rst
@@ -22,7 +22,7 @@ Troubleshooting the NVIDIA GPU Operator
 
 This page outlines common issues and troubleshooting steps for the NVIDIA GPU Operator. 
 
-If you are facing a gpu-operator and/or operand(s) issue that is not documented in this guide, its recommended that you  run the ``must-gather`` utility, prepare a bug report, then file an issue in the `NVIDIA GPU Operator GitHub repository <https://github.com/NVIDIA/gpu-operator/issues>`_.
+If you are facing a gpu-operator and/or operand(s) issue that is not documented in this guide, it is recommended that you run the ``must-gather`` utility, prepare a bug report, then file an issue in the `NVIDIA GPU Operator GitHub repository <https://github.com/NVIDIA/gpu-operator/issues>`_.
 
 .. code-block:: console
 
@@ -98,7 +98,7 @@ Note that the operand pods will only come up when the driver daemonset and toolk
 
 2. **Check the dmesg logs**
    
-   - ``dmesg`` displays the messages generated by the Linux Kernel. ``dmesg`` helps us detect any issues loading the GPU driver modules especially when the driver daemonset logs don't provide a lot of information
+   - ``dmesg`` displays the messages generated by the Linux Kernel. ``dmesg`` helps us detect any issues loading the GPU driver modules especially when the driver daemonset logs do not provide a lot of information
    - You can retrieve ``dmesg`` using either: kubectl exec or execute ``dmesg`` in your host terminal.
    
    kubectl exec
@@ -166,7 +166,7 @@ The runtime handler is added by the nvidia-container-toolkit, so this error mess
 
 3. **Review the container runtime configuration TOML**
    
-   - CRI-O and Containerd are the two main container runtimes supported by the toolkit. You can view the runtime configuration file and verify that the "nvidia" container runtime handler actually exists
+   - CRI-O and Containerd are the two main container runtimes supported by the toolkit. You can view the runtime configuration file and verify that the "nvidia" container runtime handler exists
    - Here are some ways to retrieve the container runtime config:
    
      - If using "containerd", run the ``containerd config`` command to retrieve the active containerd configuration
@@ -279,7 +279,7 @@ GPU Node does not have the expected number of GPUs
 
 When inspecting your GPU node, you may not see the expected number of "Allocatable" GPUs advertised in the node.
 
-For e.g., Given a GPU node with 8 GPUs, its kubectl describe output may look something like the snippet below:
+For example, given a GPU node with eight GPUs, the kubectl describe output might look like the following snippet:
 
 .. code-block:: console
 
@@ -309,7 +309,7 @@ For e.g., Given a GPU node with 8 GPUs, its kubectl describe output may look som
    ....
    ....
 
-The above node only advertises 7 GPU devices as allocatable when we expect it to display 8 instead
+The above node only advertises seven GPU devices as allocatable when we expect it to display eight instead
 
 .. rubric:: Action
    :class: h4
@@ -336,7 +336,7 @@ The above node only advertises 7 GPU devices as allocatable when we expect it to
 DCGM Exporter pods go into CrashLoopBackoff
 *******************************************
 
-By default, the GPU Operator only deploys the ``dcgm-exporter`` while disabling the standalone ``dcgm``. In this setup, the ``dcgm-exporter`` spawns a dcgm process locally. If, however, ``dcgm`` is enabled and deployed as a separate pod/container, then the ``dcgm-exporter`` will attempt to connect to the ``dcgm`` pod through a Kubernetes service. If the cluster networking settings aren't applied correctly, you would likely see error messages as mentioned below in the ``dcgm-exporter`` logs:
+By default, the GPU Operator only deploys the ``dcgm-exporter`` while disabling the standalone ``dcgm``. In this setup, the ``dcgm-exporter`` spawns a dcgm process locally. If, however, ``dcgm`` is enabled and deployed as a separate pod/container, then the ``dcgm-exporter`` will attempt to connect to the ``dcgm`` pod through a Kubernetes service. If the cluster networking settings are not applied correctly, you would likely see the following error messages in the ``dcgm-exporter`` logs:
 
 .. code-block:: console
 
@@ -374,7 +374,7 @@ Despite initiating a cluster-wide driver upgrade, not every driver daemonset get
       kubectl get nodes -l nvidia.com/gpu-driver-upgrade-state=upgrade-failed
 
 2. Check the driver daemonset pod logs in these nodes
-3. If the driver daemonset pod logs aren't informative, check the node's ``dmesg``
+3. If the driver daemonset pod logs are not informative, check the node's ``dmesg``
 4. Once the issue is resolved, you can re-label the node with the command below:
    
    .. code-block:: console
@@ -392,7 +392,7 @@ Pods stuck in Pending state in mixed MIG + full GPU environments
 
 For drivers 570.124.06, 570.133.20, 570.148.08, and 570.158.01,
 GPU workloads cannot be scheduled on nodes that have a mix of MIG slices and full GPUs.
-For more detailed information, see GitHub issue https://github.com/NVIDIA/gpu-operator/issues/1361.
+For more detailed information, refer to GitHub issue https://github.com/NVIDIA/gpu-operator/issues/1361.
 
 .. rubric:: Observation
    :class: h4
@@ -665,7 +665,7 @@ GPU Operator pods in ``Init:RunContainerError`` or ``Init:CreateContainerError``
 .. rubric:: Issue
    :class: h4  
 
-If you are installing, upgrading, or upgrading the GPU driver daemonset to v25.10 or later with CRI-O as the container runtime, you may notice several of the GPU Operator pods are stuck in the ``Init:RunContainerError`` or ``Init:CreateContainerError`` state.
+If you are installing, upgrading, or upgrading the GPU driver daemonset to v25.10.0 with CRI-O as the container runtime, you may notice several of the GPU Operator pods are stuck in the ``Init:RunContainerError`` or ``Init:CreateContainerError`` state.
 
 .. rubric:: Root Cause
    :class: h4
@@ -675,4 +675,108 @@ Refer to this `GitHub issue <https://github.com/cri-o/cri-o/issues/9521>`_ for d
 .. rubric:: Action
    :class: h4
 
-The errors will eventually resolve on their own after the driver daemonset is installed or the upgrade is complete.
+The errors will eventually resolve on their own after the driver daemonset is installed or the upgrade is complete.
+
+This issue was fixed in GPU Operator v25.10.1 and later.
+
+***************************************************************************
+CUDA validator fails with "Failed to allocate device vector" error (KASLR)
+***************************************************************************
+
+.. rubric:: Issue
+   :class: h4
+
+CUDA validator fails with the following error message:
+
+.. code-block:: output
+
+   [Vector addition of 50000 elements] Failed to allocate device vector A (error code initialization error)!
+
+This issue has been observed with multiple types of GPUs:
+
+- L40
+- L40S
+- H200
+- RTX Pro 6000BSE
+
+.. rubric:: Root Cause
+   :class: h4
+
+This is a known issue with Kernel Address Space Layout Randomization (KASLR) affecting GPU initialization.
+Refer to the `CUDA Toolkit Release Notes <https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html#id3>`_ for details.
+
+.. rubric:: Action
+   :class: h4
+
+To ensure stable GPU discovery and driver loading, disable KASLR on all GPU nodes by adding the kernel parameter ``nokaslr`` to the bootloader configuration.
+
+**Disable KASLR on Kubernetes clusters running with Ubuntu or RHEL**
+
+1. Edit ``/etc/default/grub`` and add ``nokaslr`` to ``GRUB_CMDLINE_LINUX_DEFAULT``:
+
+   .. code-block:: bash
+
+      GRUB_CMDLINE_LINUX_DEFAULT="quiet splash nokaslr"
+
+2. Update GRUB and reboot:
+
+   .. code-block:: console
+
+      $ sudo update-grub
+      $ sudo reboot
+
+**Disable KASLR on OpenShift clusters**
+
+#. Create a MachineConfig targeting a specific node group:
+
+   .. code-block:: yaml
+
+      apiVersion: machineconfiguration.openshift.io/v1
+      kind: MachineConfig
+      metadata:
+        labels:
+          machineconfiguration.openshift.io/role: worker
+        name: disable-worker-kaslr
+      spec:
+        config:
+          ignition:
+            version: 3.2.0
+        kernelArguments:
+            - nokaslr
+
+   Refer to the `OpenShift documentation <https://docs.okd.io/latest/machine_configuration/machine-configs-configure.html#nodes-nodes-kernel-arguments_machine-configs-configure>`_ for more details.
+
+#. Apply the MachineConfig:
+
+   .. code-block:: console
+
+      $ oc apply -f nokaslr.yaml
+      machineconfig.machineconfiguration.openshift.io/disable-worker-kaslr created
+
+**Verify that KASLR is disabled** 
+
+#. Confirm that KASLR is disabled by checking the kernel command line 
+
+   .. code-block:: console
+
+      $ cat /proc/cmdline
+
+#. Ensure the output includes the ``nokaslr`` flag. For example:
+
+
+   .. code-block:: output
+
+      [core@host1 ~]$ cat /proc/cmdline
+      BOOT_IMAGE=(hd5,gpt3)/boot/ostree/rhcos-169d36e079c1193bc7ab64f25220a2cf4335bd8daac20961e66e88d5451686af/vmlinuz-5.14.0-427.68.1.el9_4.x86_64 ignition.platform.id=metal ip=6c-fe-54-90-af-c0:dhcp ostree=/ostree/boot.0/rhcos/169d36e079c1193bc7ab64f25220a2cf4335bd8daac20961e66e88d5451686af/0 root=UUID=1f3af9d2-c485-482c-ad6e-29f2029dfdee rw rootflags=prjquota boot=UUID=078e4760-1950-eab4-ba54-a59dfca0b5c9 nokaslr systemd.unified_cgroup_hierarchy=1 cgroup_no_v1=all psi=0
+      [core@host1 ~]$
+
+   If ``nokaslr`` is present, KASLR has been successfully disabled.
+
+.. note::
+
+   On OpenShift clusters, the update may not be applied automatically. If KASLR remains enabled after the MachineConfig change, manually drain the node to force the MachineConfig Operator to reboot and apply the update.
+
+.. rubric:: Result
+   :class: h4
+
+The CUDA validator will run successfully after KASLR is disabled.