You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
lspci lists the PCIe devices on the VM, including the InfiniBand NIC and GPUs, if any. If lspci doesn't return successfully, you may need to install LIS on CentOS/RHEL (instructions below).
43
+
lspci lists the PCIe devices on the VM, including the InfiniBand NIC and GPUs, if any. If lspci doesn't return successfully, you may need to install LIS on CentOS/RHEL.
44
+
44
45
Then run installation commands specific for your distribution.
45
46
46
47
### Ubuntu
47
48
48
49
1. Download and install the CUDA drivers from the NVIDIA website.
49
50
> [!NOTE]
50
-
> The example below shows the CUDA package path for Ubuntu 20.04. Replace the path specific to the version you plan to use.
51
+
> The example shows the CUDA package path for Ubuntu 20.04. Replace the path specific to the version you plan to use.
51
52
>
52
53
> Visit the [NVIDIA Download Center](https://developer.download.nvidia.com/compute/cuda/repos/) or the [NVIDIA CUDA Resources page](https://developer.nvidia.com/cuda-downloads?target_os=Linux&target_arch=x86_64&Distribution=Ubuntu&target_version=20.04&target_type=deb_network) for the full path specific to each version.
53
54
>
@@ -79,16 +80,16 @@ sudo reboot
79
80
80
81
### CentOS or Red Hat Enterprise Linux
81
82
82
-
1. Update the kernel (recommended). If you choose not to update the kernel, ensure that the versions of `kernel-devel` and `dkms` are appropriate for your kernel.
83
+
1. Update the kernel (recommended). If you choose not to update the kernel, ensure that the versions of `kernel-devel`, and `dkms` are appropriate for your kernel.
2. Install the latest [Linux Integration Services for Hyper-V and Azure](https://www.microsoft.com/download/details.aspx?id=55106). Check if LIS is required by verifying the results of lspci. If all GPU devices are listed as expected (and documented above), installing LIS is not required.
90
+
2. Install the latest [Linux Integration Services for Hyper-V and Azure](https://www.microsoft.com/download/details.aspx?id=55106). Check if LIS is required by verifying the results of lspci. If all GPU devices are listed as expected, installing LIS isn't required.
90
91
91
-
Please note that LIS is applicable to Red Hat Enterprise Linux, CentOS, and the Oracle Linux Red Hat Compatible Kernel 5.2-5.11, 6.0-6.10, and 7.0-7.7. Please refer to the [Linux Integration Services documentation](https://www.microsoft.com/en-us/download/details.aspx?id=55106) for more details.
92
+
Note that LIS is applicable to Red Hat Enterprise Linux, CentOS, and the Oracle Linux Red Hat Compatible Kernel 5.2-5.11, 6.0-6.10, and 7.0-7.7. Refer to the [Linux Integration Services documentation](https://www.microsoft.com/en-us/download/details.aspx?id=55106) for more details.
92
93
Skip this step if you plan to use CentOS/RHEL 7.8 (or higher versions) as LIS is no longer required for these versions.
93
94
94
95
```bash
@@ -114,7 +115,7 @@ sudo reboot
114
115
> Visit [Fedora](https://dl.fedoraproject.org/pub/epel/) and [Nvidia CUDA repo](https://developer.download.nvidia.com/compute/cuda/repos/) to pick the correct package for the CentOS or RHEL version you want to use.
115
116
>
116
117
117
-
For example, CentOS 8 and RHEL 8 will need the following steps.
118
+
For example, CentOS 8 and RHEL 8 need the following steps.
@@ -140,13 +141,13 @@ For example, CentOS 8 and RHEL 8 will need the following steps.
140
141
141
142
To query the GPU device state, SSH to the VM and run the [nvidia-smi](https://developer.nvidia.com/nvidia-system-management-interface) command-line utility installed with the driver.
142
143
143
-
If the driver is installed, you will see output similar to the following. Note that **GPU-Util**shows 0% unless you are currently running a GPU workload on the VM. Your driver version and GPU details may be different from the ones shown.
144
+
If the driver is installed, Nvidia SMI will list the **GPU-Util**as 0% until you run a GPU workload on the VM. Your driver version and GPU details may be different from the ones shown.
RDMA network connectivity can be enabled on RDMA-capable N-series VMs such as NC24r deployed in the same availability set or in a single placement group in a virtual machine (VM) scale set. The RDMA network supports Message Passing Interface (MPI) traffic for applications running with Intel MPI 5.x or a later version. Additional requirements follow:
150
+
RDMA network connectivity can be enabled on RDMA-capable N-series VMs such as NC24r deployed in the same availability set or in a single placement group in a virtual machine (VM) scale set. The RDMA network supports Message Passing Interface (MPI) traffic for applications running with Intel MPI 5.x or a later version:
150
151
151
152
### Distributions
152
153
@@ -192,7 +193,7 @@ To install NVIDIA GRID drivers on NV or NVv3-series VMs, make an SSH connection
3. Disable the Nouveau kernel driver, which is incompatible with the NVIDIA driver. (Only use the NVIDIA driver on NV or NVv2 VMs.) To do this, create a file in `/etc/modprobe.d` named `nouveau.conf` with the following contents:
196
+
3. Disable the Nouveau kernel driver, which is incompatible with the NVIDIA driver. (Only use the NVIDIA driver on NV or NVv2 VMs.) To disable the driver, create a file in `/etc/modprobe.d` named `nouveau.conf` with the following contents:
196
197
197
198
```
198
199
blacklist nouveau
@@ -229,7 +230,7 @@ To install NVIDIA GRID drivers on NV or NVv3-series VMs, make an SSH connection
229
230
EnableUI=FALSE
230
231
```
231
232
232
-
9. Remove the following from `/etc/nvidia/gridd.conf`ifit is present:
233
+
9. Remove the following from `/etc/nvidia/gridd.conf`ifits present:
233
234
234
235
```
235
236
FeatureType=0
@@ -256,7 +257,7 @@ To install NVIDIA GRID drivers on NV or NVv3-series VMs, make an SSH connection
256
257
blacklist lbm-nouveau
257
258
```
258
259
259
-
3. Reboot the VM, reconnect, and install the latest [Linux Integration Services for Hyper-V and Azure](https://www.microsoft.com/download/details.aspx?id=55106). Check if LIS is required by verifying the results of lspci. If all GPU devices are listed as expected (and documented above), installing LIS is not required.
260
+
3. Reboot the VM, reconnect, and install the latest [Linux Integration Services for Hyper-V and Azure](https://www.microsoft.com/download/details.aspx?id=55106). Check if LIS is required by verifying the results of lspci. If all GPU devices are listed as expected, installing LIS isn't required.
260
261
261
262
Skip this step if you plan to use CentOS/RHEL 7.8 (or higher versions) as LIS is no longer required for these versions.
262
263
@@ -288,13 +289,13 @@ To install NVIDIA GRID drivers on NV or NVv3-series VMs, make an SSH connection
9. Remove the following from `/etc/nvidia/gridd.conf` if it is present:
298
+
9. Remove one line from `/etc/nvidia/gridd.conf`if it is present:
298
299
299
300
```
300
301
FeatureType=0
@@ -307,7 +308,7 @@ To install NVIDIA GRID drivers on NV or NVv3-series VMs, make an SSH connection
307
308
308
309
To query the GPU device state, SSH to the VM and run the [nvidia-smi](https://developer.nvidia.com/nvidia-system-management-interface) command-line utility installed with the driver.
309
310
310
-
If the driver is installed, you will see output similar to the following. Note that **GPU-Util** shows 0% unless you are currently running a GPU workload on the VM. Your driver version and GPU details may be different from the ones shown.
311
+
If the driver is installed, Nvidia SMI will list the **GPU-Util**as 0% until you run a GPU workload on the VM. Your driver version and GPU details may be different from the ones shown.
311
312
312
313

313
314
@@ -356,7 +357,7 @@ Then, create an entry for your update script in `/etc/rc.d/rc3.d` so the script
356
357
* You can set persistence mode using `nvidia-smi` so the output of the command is faster when you need to query cards. To set persistence mode, execute `nvidia-smi -pm 1`. Note that if the VM is restarted, the mode setting goes away. You can always script the mode setting to execute upon startup.
357
358
* If you updated the NVIDIA CUDA drivers to the latest version and find RDMA connectivity is no longer working, [reinstall the RDMA drivers](#rdma-network-connectivity) to reestablish that connectivity.
358
359
* During installation of LIS, if a certain CentOS/RHEL OS version (or kernel) is not supported for LIS, an error “Unsupported kernel version” is thrown. Please report this error along with the OS and kernel versions.
359
-
* If jobs are interrupted by ECC errors on the GPU (either correctable or uncorrectable), first check to see if the GPU meets any of Nvidia's [RMA criteria for ECC errors](https://docs.nvidia.com/deploy/dynamic-page-retirement/index.html#faq-pre). If the GPU is eligible for RMA, please contact support about getting it serviced; otherwise, reboot your VM to reattach the GPU as described [here](https://docs.nvidia.com/deploy/dynamic-page-retirement/index.html#bl_reset_reboot). Note that less invasive methods such as `nvidia-smi -r`do not work with the virtualization solution deployed in Azure.
360
+
* If jobs are interrupted by ECC errors on the GPU (either correctable or uncorrectable), first check to see if the GPU meets any of Nvidia's [RMA criteria for ECC errors](https://docs.nvidia.com/deploy/dynamic-page-retirement/index.html#faq-pre). If the GPU is eligible for RMA, please contact support about getting it serviced; otherwise, reboot your VM to reattach the GPU as described [here](https://docs.nvidia.com/deploy/dynamic-page-retirement/index.html#bl_reset_reboot). Less invasive methods such as `nvidia-smi -r` don't work with the virtualization solution deployed in Azure.
0 commit comments