Skip to content

Commit 5fa9f68

Browse files
Merge pull request #233690 from mattmcinnes/patch-55
[Doc-a-thon] Updating n-series-driver-setup.md
2 parents 3d96c87 + 2b61c88 commit 5fa9f68

File tree

1 file changed

+19
-17
lines changed

1 file changed

+19
-17
lines changed

articles/virtual-machines/linux/n-series-driver-setup.md

Lines changed: 19 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -8,8 +8,9 @@ ms.subservice: sizes
88
ms.collection: linux
99
ms.topic: how-to
1010
ms.workload: infrastructure-services
11-
ms.date: 12/16/2022
11+
ms.date: 04/06/2023
1212
ms.author: vikancha
13+
ms.reviewer: padmalathas, mattmcinnes
1314
---
1415

1516
# Install NVIDIA GPU drivers on N-series VMs running Linux
@@ -35,18 +36,19 @@ To install CUDA drivers, make an SSH connection to each VM. To verify that the s
3536
```bash
3637
lspci | grep -i NVIDIA
3738
```
38-
You will see output similar to the following example (showing an NVIDIA Tesla K80 card):
39+
Output is similar to the following example (showing an NVIDIA Tesla K80 card):
3940

4041
![lspci command output](./media/n-series-driver-setup/lspci.png)
4142

42-
lspci lists the PCIe devices on the VM, including the InfiniBand NIC and GPUs, if any. If lspci doesn't return successfully, you may need to install LIS on CentOS/RHEL (instructions below).
43+
lspci lists the PCIe devices on the VM, including the InfiniBand NIC and GPUs, if any. If lspci doesn't return successfully, you may need to install LIS on CentOS/RHEL.
44+
4345
Then run installation commands specific for your distribution.
4446

4547
### Ubuntu
4648

4749
1. Download and install the CUDA drivers from the NVIDIA website.
4850
> [!NOTE]
49-
> The example below shows the CUDA package path for Ubuntu 20.04. Replace the path specific to the version you plan to use.
51+
> The example shows the CUDA package path for Ubuntu 20.04. Replace the path specific to the version you plan to use.
5052
>
5153
> Visit the [NVIDIA Download Center](https://developer.download.nvidia.com/compute/cuda/repos/) or the [NVIDIA CUDA Resources page](https://developer.nvidia.com/cuda-downloads?target_os=Linux&target_arch=x86_64&Distribution=Ubuntu&target_version=20.04&target_type=deb_network) for the full path specific to each version.
5254
>
@@ -78,16 +80,16 @@ sudo reboot
7880

7981
### CentOS or Red Hat Enterprise Linux
8082

81-
1. Update the kernel (recommended). If you choose not to update the kernel, ensure that the versions of `kernel-devel` and `dkms` are appropriate for your kernel.
83+
1. Update the kernel (recommended). If you choose not to update the kernel, ensure that the versions of `kernel-devel`, and `dkms` are appropriate for your kernel.
8284

8385
```
8486
sudo yum install kernel kernel-tools kernel-headers kernel-devel
8587
sudo reboot
8688
```
8789

88-
2. Install the latest [Linux Integration Services for Hyper-V and Azure](https://www.microsoft.com/download/details.aspx?id=55106). Check if LIS is required by verifying the results of lspci. If all GPU devices are listed as expected (and documented above), installing LIS is not required.
90+
2. Install the latest [Linux Integration Services for Hyper-V and Azure](https://www.microsoft.com/download/details.aspx?id=55106). Check if LIS is required by verifying the results of lspci. If all GPU devices are listed as expected, installing LIS isn't required.
8991

90-
Please note that LIS is applicable to Red Hat Enterprise Linux, CentOS, and the Oracle Linux Red Hat Compatible Kernel 5.2-5.11, 6.0-6.10, and 7.0-7.7. Please refer to the [Linux Integration Services documentation](https://www.microsoft.com/en-us/download/details.aspx?id=55106) for more details.
92+
LIS is applicable to Red Hat Enterprise Linux, CentOS, and the Oracle Linux Red Hat Compatible Kernel 5.2-5.11, 6.0-6.10, and 7.0-7.7. Refer to the [Linux Integration Services documentation](https://www.microsoft.com/en-us/download/details.aspx?id=55106) for more details.
9193
Skip this step if you plan to use CentOS/RHEL 7.8 (or higher versions) as LIS is no longer required for these versions.
9294

9395
```bash
@@ -113,7 +115,7 @@ sudo reboot
113115
> Visit [Fedora](https://dl.fedoraproject.org/pub/epel/) and [Nvidia CUDA repo](https://developer.download.nvidia.com/compute/cuda/repos/) to pick the correct package for the CentOS or RHEL version you want to use.
114116
>
115117

116-
For example, CentOS 8 and RHEL 8 will need the following steps.
118+
For example, CentOS 8 and RHEL 8 need the following steps.
117119

118120
```bash
119121
sudo rpm -Uvh https://dl.fedoraproject.org/pub/epel/epel-release-latest-8.noarch.rpm
@@ -139,13 +141,13 @@ For example, CentOS 8 and RHEL 8 will need the following steps.
139141

140142
To query the GPU device state, SSH to the VM and run the [nvidia-smi](https://developer.nvidia.com/nvidia-system-management-interface) command-line utility installed with the driver.
141143

142-
If the driver is installed, you will see output similar to the following. Note that **GPU-Util** shows 0% unless you are currently running a GPU workload on the VM. Your driver version and GPU details may be different from the ones shown.
144+
If the driver is installed, Nvidia SMI lists the **GPU-Util** as 0% until you run a GPU workload on the VM. Your driver version and GPU details may be different from the ones shown.
143145

144146
![NVIDIA device status](./media/n-series-driver-setup/smi.png)
145147

146148
## RDMA network connectivity
147149

148-
RDMA network connectivity can be enabled on RDMA-capable N-series VMs such as NC24r deployed in the same availability set or in a single placement group in a virtual machine (VM) scale set. The RDMA network supports Message Passing Interface (MPI) traffic for applications running with Intel MPI 5.x or a later version. Additional requirements follow:
150+
RDMA network connectivity can be enabled on RDMA-capable N-series VMs such as NC24r deployed in the same availability set or in a single placement group in a virtual machine (VM) scale set. The RDMA network supports Message Passing Interface (MPI) traffic for applications running with Intel MPI 5.x or a later version:
149151

150152
### Distributions
151153

@@ -191,7 +193,7 @@ To install NVIDIA GRID drivers on NV or NVv3-series VMs, make an SSH connection
191193
sudo apt-get install build-essential ubuntu-desktop -y
192194
sudo apt-get install linux-azure -y
193195
```
194-
3. Disable the Nouveau kernel driver, which is incompatible with the NVIDIA driver. (Only use the NVIDIA driver on NV or NVv2 VMs.) To do this, create a file in `/etc/modprobe.d` named `nouveau.conf` with the following contents:
196+
3. Disable the Nouveau kernel driver, which is incompatible with the NVIDIA driver. (Only use the NVIDIA driver on NV or NVv2 VMs.) To disable the driver, create a file in `/etc/modprobe.d` named `nouveau.conf` with the following contents:
195197
196198
```
197199
blacklist nouveau
@@ -228,7 +230,7 @@ To install NVIDIA GRID drivers on NV or NVv3-series VMs, make an SSH connection
228230
EnableUI=FALSE
229231
```
230232

231-
9. Remove the following from `/etc/nvidia/gridd.conf` if it is present:
233+
9. Remove the following from `/etc/nvidia/gridd.conf` if its present:
232234

233235
```
234236
FeatureType=0
@@ -255,7 +257,7 @@ To install NVIDIA GRID drivers on NV or NVv3-series VMs, make an SSH connection
255257
blacklist lbm-nouveau
256258
```
257259

258-
3. Reboot the VM, reconnect, and install the latest [Linux Integration Services for Hyper-V and Azure](https://www.microsoft.com/download/details.aspx?id=55106). Check if LIS is required by verifying the results of lspci. If all GPU devices are listed as expected (and documented above), installing LIS is not required.
260+
3. Reboot the VM, reconnect, and install the latest [Linux Integration Services for Hyper-V and Azure](https://www.microsoft.com/download/details.aspx?id=55106). Check if LIS is required by verifying the results of lspci. If all GPU devices are listed as expected, installing LIS isn't required.
259261
260262
Skip this step if you plan to use CentOS/RHEL 7.8 (or higher versions) as LIS is no longer required for these versions.
261263
@@ -287,13 +289,13 @@ To install NVIDIA GRID drivers on NV or NVv3-series VMs, make an SSH connection
287289
sudo cp /etc/nvidia/gridd.conf.template /etc/nvidia/gridd.conf
288290
```
289291

290-
8. Add the following to `/etc/nvidia/gridd.conf`:
292+
8. Add two lines to `/etc/nvidia/gridd.conf`:
291293

292294
```
293295
IgnoreSP=FALSE
294296
EnableUI=FALSE
295297
```
296-
9. Remove the following from `/etc/nvidia/gridd.conf` if it is present:
298+
9. Remove one line from `/etc/nvidia/gridd.conf` if it is present:
297299

298300
```
299301
FeatureType=0
@@ -306,7 +308,7 @@ To install NVIDIA GRID drivers on NV or NVv3-series VMs, make an SSH connection
306308

307309
To query the GPU device state, SSH to the VM and run the [nvidia-smi](https://developer.nvidia.com/nvidia-system-management-interface) command-line utility installed with the driver.
308310

309-
If the driver is installed, you will see output similar to the following. Note that **GPU-Util** shows 0% unless you are currently running a GPU workload on the VM. Your driver version and GPU details may be different from the ones shown.
311+
If the driver is installed, Nvidia SMI will list the **GPU-Util** as 0% until you run a GPU workload on the VM. Your driver version and GPU details may be different from the ones shown.
310312

311313
![Screenshot that shows the output when the GPU device state is queried.](./media/n-series-driver-setup/smi-nv.png)
312314

@@ -355,7 +357,7 @@ Then, create an entry for your update script in `/etc/rc.d/rc3.d` so the script
355357
* You can set persistence mode using `nvidia-smi` so the output of the command is faster when you need to query cards. To set persistence mode, execute `nvidia-smi -pm 1`. Note that if the VM is restarted, the mode setting goes away. You can always script the mode setting to execute upon startup.
356358
* If you updated the NVIDIA CUDA drivers to the latest version and find RDMA connectivity is no longer working, [reinstall the RDMA drivers](#rdma-network-connectivity) to reestablish that connectivity.
357359
* During installation of LIS, if a certain CentOS/RHEL OS version (or kernel) is not supported for LIS, an error “Unsupported kernel version” is thrown. Please report this error along with the OS and kernel versions.
358-
* If jobs are interrupted by ECC errors on the GPU (either correctable or uncorrectable), first check to see if the GPU meets any of Nvidia's [RMA criteria for ECC errors](https://docs.nvidia.com/deploy/dynamic-page-retirement/index.html#faq-pre). If the GPU is eligible for RMA, please contact support about getting it serviced; otherwise, reboot your VM to reattach the GPU as described [here](https://docs.nvidia.com/deploy/dynamic-page-retirement/index.html#bl_reset_reboot). Note that less invasive methods such as `nvidia-smi -r` do not work with the virtualization solution deployed in Azure.
360+
* If jobs are interrupted by ECC errors on the GPU (either correctable or uncorrectable), first check to see if the GPU meets any of Nvidia's [RMA criteria for ECC errors](https://docs.nvidia.com/deploy/dynamic-page-retirement/index.html#faq-pre). If the GPU is eligible for RMA, please contact support about getting it serviced; otherwise, reboot your VM to reattach the GPU as described [here](https://docs.nvidia.com/deploy/dynamic-page-retirement/index.html#bl_reset_reboot). Less invasive methods such as `nvidia-smi -r` don't work with the virtualization solution deployed in Azure.
359361

360362
## Next steps
361363

0 commit comments

Comments
 (0)