You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
lspci lists the PCIe devices on the VM, including the InfiniBand NIC and GPUs, if any. If lspci doesn't return successfully, you may need to install LIS on CentOS/RHEL (instructions below).
43
+
lspci lists the PCIe devices on the VM, including the InfiniBand NIC and GPUs, if any. If lspci doesn't return successfully, you may need to install LIS on CentOS/RHEL.
44
+
43
45
Then run installation commands specific for your distribution.
44
46
45
47
### Ubuntu
46
48
47
49
1. Download and install the CUDA drivers from the NVIDIA website.
48
50
> [!NOTE]
49
-
> The example below shows the CUDA package path for Ubuntu 20.04. Replace the path specific to the version you plan to use.
51
+
> The example shows the CUDA package path for Ubuntu 20.04. Replace the path specific to the version you plan to use.
50
52
>
51
53
> Visit the [NVIDIA Download Center](https://developer.download.nvidia.com/compute/cuda/repos/) or the [NVIDIA CUDA Resources page](https://developer.nvidia.com/cuda-downloads?target_os=Linux&target_arch=x86_64&Distribution=Ubuntu&target_version=20.04&target_type=deb_network) for the full path specific to each version.
52
54
>
@@ -78,16 +80,16 @@ sudo reboot
78
80
79
81
### CentOS or Red Hat Enterprise Linux
80
82
81
-
1. Update the kernel (recommended). If you choose not to update the kernel, ensure that the versions of `kernel-devel` and `dkms` are appropriate for your kernel.
83
+
1. Update the kernel (recommended). If you choose not to update the kernel, ensure that the versions of `kernel-devel`, and `dkms` are appropriate for your kernel.
2. Install the latest [Linux Integration Services for Hyper-V and Azure](https://www.microsoft.com/download/details.aspx?id=55106). Check if LIS is required by verifying the results of lspci. If all GPU devices are listed as expected (and documented above), installing LIS is not required.
90
+
2. Install the latest [Linux Integration Services for Hyper-V and Azure](https://www.microsoft.com/download/details.aspx?id=55106). Check if LIS is required by verifying the results of lspci. If all GPU devices are listed as expected, installing LIS isn't required.
89
91
90
-
Please note that LIS is applicable to Red Hat Enterprise Linux, CentOS, and the Oracle Linux Red Hat Compatible Kernel 5.2-5.11, 6.0-6.10, and 7.0-7.7. Please refer to the [Linux Integration Services documentation](https://www.microsoft.com/en-us/download/details.aspx?id=55106) for more details.
92
+
LIS is applicable to Red Hat Enterprise Linux, CentOS, and the Oracle Linux Red Hat Compatible Kernel 5.2-5.11, 6.0-6.10, and 7.0-7.7. Refer to the [Linux Integration Services documentation](https://www.microsoft.com/en-us/download/details.aspx?id=55106) for more details.
91
93
Skip this step if you plan to use CentOS/RHEL 7.8 (or higher versions) as LIS is no longer required for these versions.
92
94
93
95
```bash
@@ -113,7 +115,7 @@ sudo reboot
113
115
> Visit [Fedora](https://dl.fedoraproject.org/pub/epel/) and [Nvidia CUDA repo](https://developer.download.nvidia.com/compute/cuda/repos/) to pick the correct package for the CentOS or RHEL version you want to use.
114
116
>
115
117
116
-
For example, CentOS 8 and RHEL 8 will need the following steps.
118
+
For example, CentOS 8 and RHEL 8 need the following steps.
@@ -139,13 +141,13 @@ For example, CentOS 8 and RHEL 8 will need the following steps.
139
141
140
142
To query the GPU device state, SSH to the VM and run the [nvidia-smi](https://developer.nvidia.com/nvidia-system-management-interface) command-line utility installed with the driver.
141
143
142
-
If the driver is installed, you will see output similar to the following. Note that **GPU-Util**shows 0% unless you are currently running a GPU workload on the VM. Your driver version and GPU details may be different from the ones shown.
144
+
If the driver is installed, Nvidia SMI lists the **GPU-Util**as 0% until you run a GPU workload on the VM. Your driver version and GPU details may be different from the ones shown.
RDMA network connectivity can be enabled on RDMA-capable N-series VMs such as NC24r deployed in the same availability set or in a single placement group in a virtual machine (VM) scale set. The RDMA network supports Message Passing Interface (MPI) traffic for applications running with Intel MPI 5.x or a later version. Additional requirements follow:
150
+
RDMA network connectivity can be enabled on RDMA-capable N-series VMs such as NC24r deployed in the same availability set or in a single placement group in a virtual machine (VM) scale set. The RDMA network supports Message Passing Interface (MPI) traffic for applications running with Intel MPI 5.x or a later version:
149
151
150
152
### Distributions
151
153
@@ -191,7 +193,7 @@ To install NVIDIA GRID drivers on NV or NVv3-series VMs, make an SSH connection
3. Disable the Nouveau kernel driver, which is incompatible with the NVIDIA driver. (Only use the NVIDIA driver on NV or NVv2 VMs.) To do this, create a file in `/etc/modprobe.d` named `nouveau.conf` with the following contents:
196
+
3. Disable the Nouveau kernel driver, which is incompatible with the NVIDIA driver. (Only use the NVIDIA driver on NV or NVv2 VMs.) To disable the driver, create a file in `/etc/modprobe.d` named `nouveau.conf` with the following contents:
195
197
196
198
```
197
199
blacklist nouveau
@@ -228,7 +230,7 @@ To install NVIDIA GRID drivers on NV or NVv3-series VMs, make an SSH connection
228
230
EnableUI=FALSE
229
231
```
230
232
231
-
9. Remove the following from `/etc/nvidia/gridd.conf`ifit is present:
233
+
9. Remove the following from `/etc/nvidia/gridd.conf`ifits present:
232
234
233
235
```
234
236
FeatureType=0
@@ -255,7 +257,7 @@ To install NVIDIA GRID drivers on NV or NVv3-series VMs, make an SSH connection
255
257
blacklist lbm-nouveau
256
258
```
257
259
258
-
3. Reboot the VM, reconnect, and install the latest [Linux Integration Services for Hyper-V and Azure](https://www.microsoft.com/download/details.aspx?id=55106). Check if LIS is required by verifying the results of lspci. If all GPU devices are listed as expected (and documented above), installing LIS is not required.
260
+
3. Reboot the VM, reconnect, and install the latest [Linux Integration Services for Hyper-V and Azure](https://www.microsoft.com/download/details.aspx?id=55106). Check if LIS is required by verifying the results of lspci. If all GPU devices are listed as expected, installing LIS isn't required.
259
261
260
262
Skip this step if you plan to use CentOS/RHEL 7.8 (or higher versions) as LIS is no longer required for these versions.
261
263
@@ -287,13 +289,13 @@ To install NVIDIA GRID drivers on NV or NVv3-series VMs, make an SSH connection
9. Remove the following from `/etc/nvidia/gridd.conf` if it is present:
298
+
9. Remove one line from `/etc/nvidia/gridd.conf`if it is present:
297
299
298
300
```
299
301
FeatureType=0
@@ -306,7 +308,7 @@ To install NVIDIA GRID drivers on NV or NVv3-series VMs, make an SSH connection
306
308
307
309
To query the GPU device state, SSH to the VM and run the [nvidia-smi](https://developer.nvidia.com/nvidia-system-management-interface) command-line utility installed with the driver.
308
310
309
-
If the driver is installed, you will see output similar to the following. Note that **GPU-Util** shows 0% unless you are currently running a GPU workload on the VM. Your driver version and GPU details may be different from the ones shown.
311
+
If the driver is installed, Nvidia SMI will list the **GPU-Util**as 0% until you run a GPU workload on the VM. Your driver version and GPU details may be different from the ones shown.
310
312
311
313

312
314
@@ -355,7 +357,7 @@ Then, create an entry for your update script in `/etc/rc.d/rc3.d` so the script
355
357
* You can set persistence mode using `nvidia-smi` so the output of the command is faster when you need to query cards. To set persistence mode, execute `nvidia-smi -pm 1`. Note that if the VM is restarted, the mode setting goes away. You can always script the mode setting to execute upon startup.
356
358
* If you updated the NVIDIA CUDA drivers to the latest version and find RDMA connectivity is no longer working, [reinstall the RDMA drivers](#rdma-network-connectivity) to reestablish that connectivity.
357
359
* During installation of LIS, if a certain CentOS/RHEL OS version (or kernel) is not supported for LIS, an error “Unsupported kernel version” is thrown. Please report this error along with the OS and kernel versions.
358
-
* If jobs are interrupted by ECC errors on the GPU (either correctable or uncorrectable), first check to see if the GPU meets any of Nvidia's [RMA criteria for ECC errors](https://docs.nvidia.com/deploy/dynamic-page-retirement/index.html#faq-pre). If the GPU is eligible for RMA, please contact support about getting it serviced; otherwise, reboot your VM to reattach the GPU as described [here](https://docs.nvidia.com/deploy/dynamic-page-retirement/index.html#bl_reset_reboot). Note that less invasive methods such as `nvidia-smi -r`do not work with the virtualization solution deployed in Azure.
360
+
* If jobs are interrupted by ECC errors on the GPU (either correctable or uncorrectable), first check to see if the GPU meets any of Nvidia's [RMA criteria for ECC errors](https://docs.nvidia.com/deploy/dynamic-page-retirement/index.html#faq-pre). If the GPU is eligible for RMA, please contact support about getting it serviced; otherwise, reboot your VM to reattach the GPU as described [here](https://docs.nvidia.com/deploy/dynamic-page-retirement/index.html#bl_reset_reboot). Less invasive methods such as `nvidia-smi -r` don't work with the virtualization solution deployed in Azure.
0 commit comments