Merge pull request #233690 from mattmcinnes/patch-55

prmerger-automator[bot] · web-flow · commit 5fa9f681f47d · 2023-04-21T20:41:15.000Z
[Doc-a-thon] Updating n-series-driver-setup.md
diff --git a/articles/virtual-machines/linux/n-series-driver-setup.md b/articles/virtual-machines/linux/n-series-driver-setup.md
@@ -8,8 +8,9 @@ ms.subservice: sizes
 ms.collection: linux
 ms.topic: how-to
 ms.workload: infrastructure-services
-ms.date: 12/16/2022
+ms.date: 04/06/2023
 ms.author: vikancha
+ms.reviewer: padmalathas, mattmcinnes
 ---
 
 # Install NVIDIA GPU drivers on N-series VMs running Linux
@@ -35,18 +36,19 @@ To install CUDA drivers, make an SSH connection to each VM. To verify that the s
 ```bash
 lspci | grep -i NVIDIA
 ```
-You will see output similar to the following example (showing an NVIDIA Tesla K80 card):
+Output is similar to the following example (showing an NVIDIA Tesla K80 card):
 
 ![lspci command output](./media/n-series-driver-setup/lspci.png)
 
-lspci lists the PCIe devices on the VM, including the InfiniBand NIC and GPUs, if any. If lspci doesn't return successfully, you may need to install LIS on CentOS/RHEL (instructions below).
+lspci lists the PCIe devices on the VM, including the InfiniBand NIC and GPUs, if any. If lspci doesn't return successfully, you may need to install LIS on CentOS/RHEL.
+
 Then run installation commands specific for your distribution.
 
 ### Ubuntu 
 
 1. Download and install the CUDA drivers from the NVIDIA website. 
     > [!NOTE]
-   >  The example below shows the CUDA package path for Ubuntu 20.04. Replace the path specific to the version you plan to use. 
+   >  The example shows the CUDA package path for Ubuntu 20.04. Replace the path specific to the version you plan to use. 
    >  
    >  Visit the [NVIDIA Download Center](https://developer.download.nvidia.com/compute/cuda/repos/) or the [NVIDIA CUDA Resources page](https://developer.nvidia.com/cuda-downloads?target_os=Linux&target_arch=x86_64&Distribution=Ubuntu&target_version=20.04&target_type=deb_network) for the full path specific to each version.
    > 
@@ -78,16 +80,16 @@ sudo reboot
 
 ### CentOS or Red Hat Enterprise Linux
 
-1. Update the kernel (recommended). If you choose not to update the kernel, ensure that the versions of `kernel-devel` and `dkms` are appropriate for your kernel.
+1. Update the kernel (recommended). If you choose not to update the kernel, ensure that the versions of `kernel-devel`, and `dkms` are appropriate for your kernel.
 
    ```
    sudo yum install kernel kernel-tools kernel-headers kernel-devel
    sudo reboot
    ```
 
-2. Install the latest [Linux Integration Services for Hyper-V and Azure](https://www.microsoft.com/download/details.aspx?id=55106). Check if LIS is required by verifying the results of lspci. If all GPU devices are listed as expected (and documented above), installing LIS is not required.
+2. Install the latest [Linux Integration Services for Hyper-V and Azure](https://www.microsoft.com/download/details.aspx?id=55106). Check if LIS is required by verifying the results of lspci. If all GPU devices are listed as expected, installing LIS isn't required.
 
-   Please note that LIS is applicable to Red Hat Enterprise Linux, CentOS, and the Oracle Linux Red Hat Compatible Kernel 5.2-5.11, 6.0-6.10, and 7.0-7.7. Please refer to the [Linux Integration Services documentation](https://www.microsoft.com/en-us/download/details.aspx?id=55106) for more details. 
+   LIS is applicable to Red Hat Enterprise Linux, CentOS, and the Oracle Linux Red Hat Compatible Kernel 5.2-5.11, 6.0-6.10, and 7.0-7.7. Refer to the [Linux Integration Services documentation](https://www.microsoft.com/en-us/download/details.aspx?id=55106) for more details. 
    Skip this step if you plan to use CentOS/RHEL 7.8 (or higher versions) as LIS is no longer required for these versions.
 
       ```bash
@@ -113,7 +115,7 @@ sudo reboot
    >  Visit [Fedora](https://dl.fedoraproject.org/pub/epel/) and [Nvidia CUDA repo](https://developer.download.nvidia.com/compute/cuda/repos/) to pick the correct package for the CentOS or RHEL version you want to use.
    >  
 
-For example, CentOS 8 and RHEL 8 will need the following steps.
+For example, CentOS 8 and RHEL 8 need the following steps.
 
    ```bash
    sudo rpm -Uvh https://dl.fedoraproject.org/pub/epel/epel-release-latest-8.noarch.rpm
@@ -139,13 +141,13 @@ For example, CentOS 8 and RHEL 8 will need the following steps.
 
 To query the GPU device state, SSH to the VM and run the [nvidia-smi](https://developer.nvidia.com/nvidia-system-management-interface) command-line utility installed with the driver. 
 
-If the driver is installed, you will see output similar to the following. Note that **GPU-Util** shows 0% unless you are currently running a GPU workload on the VM. Your driver version and GPU details may be different from the ones shown.
+If the driver is installed, Nvidia SMI lists the **GPU-Util** as 0% until you run a GPU workload on the VM. Your driver version and GPU details may be different from the ones shown.
 
 ![NVIDIA device status](./media/n-series-driver-setup/smi.png)
 
 ## RDMA network connectivity
 
-RDMA network connectivity can be enabled on RDMA-capable N-series VMs such as NC24r deployed in the same availability set or in a single placement group in a virtual machine (VM) scale set. The RDMA network supports Message Passing Interface (MPI) traffic for applications running with Intel MPI 5.x or a later version. Additional requirements follow:
+RDMA network connectivity can be enabled on RDMA-capable N-series VMs such as NC24r deployed in the same availability set or in a single placement group in a virtual machine (VM) scale set. The RDMA network supports Message Passing Interface (MPI) traffic for applications running with Intel MPI 5.x or a later version:
 
 ### Distributions
 
@@ -191,7 +193,7 @@ To install NVIDIA GRID drivers on NV or NVv3-series VMs, make an SSH connection
    sudo apt-get install build-essential ubuntu-desktop -y
    sudo apt-get install linux-azure -y
    ```
-3. Disable the Nouveau kernel driver, which is incompatible with the NVIDIA driver. (Only use the NVIDIA driver on NV or NVv2 VMs.) To do this, create a file in `/etc/modprobe.d` named `nouveau.conf` with the following contents:
+3. Disable the Nouveau kernel driver, which is incompatible with the NVIDIA driver. (Only use the NVIDIA driver on NV or NVv2 VMs.) To disable the driver, create a file in `/etc/modprobe.d` named `nouveau.conf` with the following contents:
 
    ```
    blacklist nouveau
@@ -228,7 +230,7 @@ To install NVIDIA GRID drivers on NV or NVv3-series VMs, make an SSH connection
    EnableUI=FALSE
    ```
    
-9. Remove the following from `/etc/nvidia/gridd.conf` if it is present:
+9. Remove the following from `/etc/nvidia/gridd.conf` if its present:
  
    ```
    FeatureType=0
@@ -255,7 +257,7 @@ To install NVIDIA GRID drivers on NV or NVv3-series VMs, make an SSH connection
    blacklist lbm-nouveau
    ```
 
-3. Reboot the VM, reconnect, and install the latest [Linux Integration Services for Hyper-V and Azure](https://www.microsoft.com/download/details.aspx?id=55106). Check if LIS is required by verifying the results of lspci. If all GPU devices are listed as expected (and documented above), installing LIS is not required. 
+3. Reboot the VM, reconnect, and install the latest [Linux Integration Services for Hyper-V and Azure](https://www.microsoft.com/download/details.aspx?id=55106). Check if LIS is required by verifying the results of lspci. If all GPU devices are listed as expected, installing LIS isn't required. 
 
    Skip this step if you plan to use CentOS/RHEL 7.8 (or higher versions) as LIS is no longer required for these versions.
 
@@ -287,13 +289,13 @@ To install NVIDIA GRID drivers on NV or NVv3-series VMs, make an SSH connection
    sudo cp /etc/nvidia/gridd.conf.template /etc/nvidia/gridd.conf
    ```
   
-8. Add the following to `/etc/nvidia/gridd.conf`:
+8. Add two lines to `/etc/nvidia/gridd.conf`:
  
    ```
    IgnoreSP=FALSE
    EnableUI=FALSE 
    ```
-9. Remove the following from `/etc/nvidia/gridd.conf` if it is present:
+9. Remove one line from `/etc/nvidia/gridd.conf` if it is present:
  
    ```
    FeatureType=0
@@ -306,7 +308,7 @@ To install NVIDIA GRID drivers on NV or NVv3-series VMs, make an SSH connection
 
 To query the GPU device state, SSH to the VM and run the [nvidia-smi](https://developer.nvidia.com/nvidia-system-management-interface) command-line utility installed with the driver. 
 
-If the driver is installed, you will see output similar to the following. Note that **GPU-Util** shows 0% unless you are currently running a GPU workload on the VM. Your driver version and GPU details may be different from the ones shown.
+If the driver is installed, Nvidia SMI will list the **GPU-Util** as 0% until you run a GPU workload on the VM. Your driver version and GPU details may be different from the ones shown.
 
 ![Screenshot that shows the output when the GPU device state is queried.](./media/n-series-driver-setup/smi-nv.png)
  
@@ -355,7 +357,7 @@ Then, create an entry for your update script in `/etc/rc.d/rc3.d` so the script
 * You can set persistence mode using `nvidia-smi` so the output of the command is faster when you need to query cards. To set persistence mode, execute `nvidia-smi -pm 1`. Note that if the VM is restarted, the mode setting goes away. You can always script the mode setting to execute upon startup.
 * If you updated the NVIDIA CUDA drivers to the latest version and find RDMA connectivity is no longer working, [reinstall the RDMA drivers](#rdma-network-connectivity) to reestablish that connectivity. 
 * During installation of LIS, if a certain CentOS/RHEL OS version (or kernel) is not supported for LIS, an error “Unsupported kernel version” is thrown. Please report this error along with the OS and kernel versions.
-* If jobs are interrupted by ECC errors on the GPU (either correctable or uncorrectable), first check to see if the GPU meets any of Nvidia's [RMA criteria for ECC errors](https://docs.nvidia.com/deploy/dynamic-page-retirement/index.html#faq-pre). If the GPU is eligible for RMA, please contact support about getting it serviced; otherwise, reboot your VM to reattach the GPU as described [here](https://docs.nvidia.com/deploy/dynamic-page-retirement/index.html#bl_reset_reboot). Note that less invasive methods such as `nvidia-smi -r` do not work with the virtualization solution deployed in Azure. 
+* If jobs are interrupted by ECC errors on the GPU (either correctable or uncorrectable), first check to see if the GPU meets any of Nvidia's [RMA criteria for ECC errors](https://docs.nvidia.com/deploy/dynamic-page-retirement/index.html#faq-pre). If the GPU is eligible for RMA, please contact support about getting it serviced; otherwise, reboot your VM to reattach the GPU as described [here](https://docs.nvidia.com/deploy/dynamic-page-retirement/index.html#bl_reset_reboot). Less invasive methods such as `nvidia-smi -r` don't work with the virtualization solution deployed in Azure. 
 
 ## Next steps