Skip to content

Commit f2c8e22

Browse files
authored
Update n-series-driver-setup.md
Lots of Acrolinx changes
1 parent a76b42a commit f2c8e22

File tree

1 file changed

+17
-16
lines changed

1 file changed

+17
-16
lines changed

articles/virtual-machines/linux/n-series-driver-setup.md

Lines changed: 17 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -36,18 +36,19 @@ To install CUDA drivers, make an SSH connection to each VM. To verify that the s
3636
```bash
3737
lspci | grep -i NVIDIA
3838
```
39-
You will see output similar to the following example (showing an NVIDIA Tesla K80 card):
39+
Output is similar to the following example (showing an NVIDIA Tesla K80 card):
4040

4141
![lspci command output](./media/n-series-driver-setup/lspci.png)
4242

43-
lspci lists the PCIe devices on the VM, including the InfiniBand NIC and GPUs, if any. If lspci doesn't return successfully, you may need to install LIS on CentOS/RHEL (instructions below).
43+
lspci lists the PCIe devices on the VM, including the InfiniBand NIC and GPUs, if any. If lspci doesn't return successfully, you may need to install LIS on CentOS/RHEL.
44+
4445
Then run installation commands specific for your distribution.
4546

4647
### Ubuntu
4748

4849
1. Download and install the CUDA drivers from the NVIDIA website.
4950
> [!NOTE]
50-
> The example below shows the CUDA package path for Ubuntu 20.04. Replace the path specific to the version you plan to use.
51+
> The example shows the CUDA package path for Ubuntu 20.04. Replace the path specific to the version you plan to use.
5152
>
5253
> Visit the [NVIDIA Download Center](https://developer.download.nvidia.com/compute/cuda/repos/) or the [NVIDIA CUDA Resources page](https://developer.nvidia.com/cuda-downloads?target_os=Linux&target_arch=x86_64&Distribution=Ubuntu&target_version=20.04&target_type=deb_network) for the full path specific to each version.
5354
>
@@ -79,16 +80,16 @@ sudo reboot
7980

8081
### CentOS or Red Hat Enterprise Linux
8182

82-
1. Update the kernel (recommended). If you choose not to update the kernel, ensure that the versions of `kernel-devel` and `dkms` are appropriate for your kernel.
83+
1. Update the kernel (recommended). If you choose not to update the kernel, ensure that the versions of `kernel-devel`, and `dkms` are appropriate for your kernel.
8384

8485
```
8586
sudo yum install kernel kernel-tools kernel-headers kernel-devel
8687
sudo reboot
8788
```
8889

89-
2. Install the latest [Linux Integration Services for Hyper-V and Azure](https://www.microsoft.com/download/details.aspx?id=55106). Check if LIS is required by verifying the results of lspci. If all GPU devices are listed as expected (and documented above), installing LIS is not required.
90+
2. Install the latest [Linux Integration Services for Hyper-V and Azure](https://www.microsoft.com/download/details.aspx?id=55106). Check if LIS is required by verifying the results of lspci. If all GPU devices are listed as expected, installing LIS isn't required.
9091

91-
Please note that LIS is applicable to Red Hat Enterprise Linux, CentOS, and the Oracle Linux Red Hat Compatible Kernel 5.2-5.11, 6.0-6.10, and 7.0-7.7. Please refer to the [Linux Integration Services documentation](https://www.microsoft.com/en-us/download/details.aspx?id=55106) for more details.
92+
Note that LIS is applicable to Red Hat Enterprise Linux, CentOS, and the Oracle Linux Red Hat Compatible Kernel 5.2-5.11, 6.0-6.10, and 7.0-7.7. Refer to the [Linux Integration Services documentation](https://www.microsoft.com/en-us/download/details.aspx?id=55106) for more details.
9293
Skip this step if you plan to use CentOS/RHEL 7.8 (or higher versions) as LIS is no longer required for these versions.
9394

9495
```bash
@@ -114,7 +115,7 @@ sudo reboot
114115
> Visit [Fedora](https://dl.fedoraproject.org/pub/epel/) and [Nvidia CUDA repo](https://developer.download.nvidia.com/compute/cuda/repos/) to pick the correct package for the CentOS or RHEL version you want to use.
115116
>
116117

117-
For example, CentOS 8 and RHEL 8 will need the following steps.
118+
For example, CentOS 8 and RHEL 8 need the following steps.
118119

119120
```bash
120121
sudo rpm -Uvh https://dl.fedoraproject.org/pub/epel/epel-release-latest-8.noarch.rpm
@@ -140,13 +141,13 @@ For example, CentOS 8 and RHEL 8 will need the following steps.
140141

141142
To query the GPU device state, SSH to the VM and run the [nvidia-smi](https://developer.nvidia.com/nvidia-system-management-interface) command-line utility installed with the driver.
142143

143-
If the driver is installed, you will see output similar to the following. Note that **GPU-Util** shows 0% unless you are currently running a GPU workload on the VM. Your driver version and GPU details may be different from the ones shown.
144+
If the driver is installed, Nvidia SMI will list the **GPU-Util** as 0% until you run a GPU workload on the VM. Your driver version and GPU details may be different from the ones shown.
144145

145146
![NVIDIA device status](./media/n-series-driver-setup/smi.png)
146147

147148
## RDMA network connectivity
148149

149-
RDMA network connectivity can be enabled on RDMA-capable N-series VMs such as NC24r deployed in the same availability set or in a single placement group in a virtual machine (VM) scale set. The RDMA network supports Message Passing Interface (MPI) traffic for applications running with Intel MPI 5.x or a later version. Additional requirements follow:
150+
RDMA network connectivity can be enabled on RDMA-capable N-series VMs such as NC24r deployed in the same availability set or in a single placement group in a virtual machine (VM) scale set. The RDMA network supports Message Passing Interface (MPI) traffic for applications running with Intel MPI 5.x or a later version:
150151

151152
### Distributions
152153

@@ -192,7 +193,7 @@ To install NVIDIA GRID drivers on NV or NVv3-series VMs, make an SSH connection
192193
sudo apt-get install build-essential ubuntu-desktop -y
193194
sudo apt-get install linux-azure -y
194195
```
195-
3. Disable the Nouveau kernel driver, which is incompatible with the NVIDIA driver. (Only use the NVIDIA driver on NV or NVv2 VMs.) To do this, create a file in `/etc/modprobe.d` named `nouveau.conf` with the following contents:
196+
3. Disable the Nouveau kernel driver, which is incompatible with the NVIDIA driver. (Only use the NVIDIA driver on NV or NVv2 VMs.) To disable the driver, create a file in `/etc/modprobe.d` named `nouveau.conf` with the following contents:
196197
197198
```
198199
blacklist nouveau
@@ -229,7 +230,7 @@ To install NVIDIA GRID drivers on NV or NVv3-series VMs, make an SSH connection
229230
EnableUI=FALSE
230231
```
231232

232-
9. Remove the following from `/etc/nvidia/gridd.conf` if it is present:
233+
9. Remove the following from `/etc/nvidia/gridd.conf` if its present:
233234

234235
```
235236
FeatureType=0
@@ -256,7 +257,7 @@ To install NVIDIA GRID drivers on NV or NVv3-series VMs, make an SSH connection
256257
blacklist lbm-nouveau
257258
```
258259

259-
3. Reboot the VM, reconnect, and install the latest [Linux Integration Services for Hyper-V and Azure](https://www.microsoft.com/download/details.aspx?id=55106). Check if LIS is required by verifying the results of lspci. If all GPU devices are listed as expected (and documented above), installing LIS is not required.
260+
3. Reboot the VM, reconnect, and install the latest [Linux Integration Services for Hyper-V and Azure](https://www.microsoft.com/download/details.aspx?id=55106). Check if LIS is required by verifying the results of lspci. If all GPU devices are listed as expected, installing LIS isn't required.
260261
261262
Skip this step if you plan to use CentOS/RHEL 7.8 (or higher versions) as LIS is no longer required for these versions.
262263
@@ -288,13 +289,13 @@ To install NVIDIA GRID drivers on NV or NVv3-series VMs, make an SSH connection
288289
sudo cp /etc/nvidia/gridd.conf.template /etc/nvidia/gridd.conf
289290
```
290291

291-
8. Add the following to `/etc/nvidia/gridd.conf`:
292+
8. Add two lines to `/etc/nvidia/gridd.conf`:
292293

293294
```
294295
IgnoreSP=FALSE
295296
EnableUI=FALSE
296297
```
297-
9. Remove the following from `/etc/nvidia/gridd.conf` if it is present:
298+
9. Remove one line from `/etc/nvidia/gridd.conf` if it is present:
298299

299300
```
300301
FeatureType=0
@@ -307,7 +308,7 @@ To install NVIDIA GRID drivers on NV or NVv3-series VMs, make an SSH connection
307308

308309
To query the GPU device state, SSH to the VM and run the [nvidia-smi](https://developer.nvidia.com/nvidia-system-management-interface) command-line utility installed with the driver.
309310

310-
If the driver is installed, you will see output similar to the following. Note that **GPU-Util** shows 0% unless you are currently running a GPU workload on the VM. Your driver version and GPU details may be different from the ones shown.
311+
If the driver is installed, Nvidia SMI will list the **GPU-Util** as 0% until you run a GPU workload on the VM. Your driver version and GPU details may be different from the ones shown.
311312

312313
![Screenshot that shows the output when the GPU device state is queried.](./media/n-series-driver-setup/smi-nv.png)
313314

@@ -356,7 +357,7 @@ Then, create an entry for your update script in `/etc/rc.d/rc3.d` so the script
356357
* You can set persistence mode using `nvidia-smi` so the output of the command is faster when you need to query cards. To set persistence mode, execute `nvidia-smi -pm 1`. Note that if the VM is restarted, the mode setting goes away. You can always script the mode setting to execute upon startup.
357358
* If you updated the NVIDIA CUDA drivers to the latest version and find RDMA connectivity is no longer working, [reinstall the RDMA drivers](#rdma-network-connectivity) to reestablish that connectivity.
358359
* During installation of LIS, if a certain CentOS/RHEL OS version (or kernel) is not supported for LIS, an error “Unsupported kernel version” is thrown. Please report this error along with the OS and kernel versions.
359-
* If jobs are interrupted by ECC errors on the GPU (either correctable or uncorrectable), first check to see if the GPU meets any of Nvidia's [RMA criteria for ECC errors](https://docs.nvidia.com/deploy/dynamic-page-retirement/index.html#faq-pre). If the GPU is eligible for RMA, please contact support about getting it serviced; otherwise, reboot your VM to reattach the GPU as described [here](https://docs.nvidia.com/deploy/dynamic-page-retirement/index.html#bl_reset_reboot). Note that less invasive methods such as `nvidia-smi -r` do not work with the virtualization solution deployed in Azure.
360+
* If jobs are interrupted by ECC errors on the GPU (either correctable or uncorrectable), first check to see if the GPU meets any of Nvidia's [RMA criteria for ECC errors](https://docs.nvidia.com/deploy/dynamic-page-retirement/index.html#faq-pre). If the GPU is eligible for RMA, please contact support about getting it serviced; otherwise, reboot your VM to reattach the GPU as described [here](https://docs.nvidia.com/deploy/dynamic-page-retirement/index.html#bl_reset_reboot). Less invasive methods such as `nvidia-smi -r` don't work with the virtualization solution deployed in Azure.
360361

361362
## Next steps
362363

0 commit comments

Comments
 (0)