Skip to content

Commit 214d814

Browse files
authored
Merge pull request #1 from mimckitt/patch-46
Update hb-hc-known-issues.md
2 parents c44c143 + 90b421e commit 214d814

File tree

1 file changed

+8
-8
lines changed

1 file changed

+8
-8
lines changed

articles/virtual-machines/workloads/hpc/hb-hc-known-issues.md

Lines changed: 8 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -17,23 +17,23 @@ author: mamccrea
1717
This article attempts to list recent common issues and their solutions when using the [H-series](../../sizes-hpc.md) and [N-series](../../sizes-gpu.md) HPC and GPU VMs.
1818

1919
## InfiniBand Errors on HBv3
20-
As of the week of August 12, we have identified a bug in the firmware of the ConnectX-6 InfiniBand NIC adapters in HBv3-series VMs that can cause MPI jobs to fail on a transient basis. This issue applies to all VM sizes within the HBv3-series. This issue does not apply to other H-series VMs (HB-series, HBv2-series, or HC-series). A firmware update will be issued in the coming days to remediate this issue.
20+
As of the week of August 12, we've identified a bug in the firmware of the ConnectX-6 InfiniBand NIC adapters in HBv3-series VMs that can cause MPI jobs to fail on a transient basis. This issue applies to all VM sizes within the HBv3-series. This issue doesn't apply to other H-series VMs (HB-series, HBv2-series, or HC-series). A firmware update will be issued in the coming days to remediate this issue.
2121

2222
## Memory Capacity on Standard_HB120rs_v2
23-
As of the week of December 6, 2021 we are temporarily reducing the amount of memory (RAM) exposed to the Standard_HB120rs_v2 VM size, otherwise known as [HBv2](../../hbv2-series.md). We are reducing the memory footprint to 432 GB from its current value of 456 GB (a 5.2% reduction). This reduction is temporary and the full memory capacity should be restored in early 2022. We are making this change to ensure to address an issue that can result in long VM deployment times or VM deployments for which not all devices function correctly. Note that the reduction in memory capacity does not affect VM performance.
23+
As of the week of December 6, 2021 we've temporarily reducing the amount of memory (RAM) exposed to the Standard_HB120rs_v2 VM size, otherwise known as [HBv2](../../hbv2-series.md). We've reducing the memory footprint to 432 GB from its current value of 456 GB (a 5.2% reduction). This reduction is temporary and the full memory capacity should be restored in early 2022. We've made this change to ensure to address an issue that can result in long VM deployment times or VM deployments for which not all devices function correctly. The reduction in memory capacity doesn't affect VM performance.
2424

2525
## Cache topology on Standard_HB120rs_v3
26-
`lstopo` displays incorrect cache topology on the Standard_HB120rs_v3 VM size. It may display that there’s only 32 MB L3 per NUMA. However in practice there is indeed 120 MB L3 per NUMA as expected since the same 480 MB of L3 to the entire VM is available as with the other constrained-core HBv3 VM sizes. This is a cosmetic error in displaying the correct value, which should not impact workloads.
26+
`lstopo` displays incorrect cache topology on the Standard_HB120rs_v3 VM size. It may display that there’s only 32 MB L3 per NUMA. However in practice, there is indeed 120 MB L3 per NUMA as expected since the same 480 MB of L3 to the entire VM is available as with the other constrained-core HBv3 VM sizes. This is a cosmetic error in displaying the correct value, which should not impact workloads.
2727

2828
## qp0 Access Restriction
2929
To prevent low-level hardware access that can result in security vulnerabilities, Queue Pair 0 is not accessible to guest VMs. This should only affect actions typically associated with administration of the ConnectX InfiniBand NIC, and running some InfiniBand diagnostics like ibdiagnet, but not end-user applications.
3030

3131
## MOFED installation on Ubuntu
3232
On Ubuntu-18.04 based marketplace VM images with kernels version `5.4.0-1039-azure #42` and newer, some older Mellanox OFED are incompatible causing an increase in VM boot time up to 30 minutes in some cases. This has been reported for both Mellanox OFED versions 5.2-1.0.4.0 and 5.2-2.2.0.0. The issue is resolved with Mellanox OFED 5.3-1.0.0.1.
33-
If it is necessary to use the incompatible OFED, a solution is to use the **Canonical:UbuntuServer:18_04-lts-gen2:18.04.202101290** marketplace VM image or older and not to update the kernel.
33+
If it is necessary to use the incompatible OFED, a solution is to use the **Canonical:UbuntuServer:18_04-lts-gen2:18.04.202101290** marketplace VM image, or older and not to update the kernel.
3434

3535
## MPI QP creation errors
36-
If in the midst of running any MPI workloads, InfiniBand QP creation errors such as shown below, are thrown, we suggest rebooting the VM and re-trying the workload. This issue will be fixed in the future.
36+
If in the midst of running any MPI workloads, InfiniBand QP creation errors such as shown below, are thrown, we suggest rebooting the VM and retrying the workload. This issue will be fixed in the future.
3737

3838
```bash
3939
ib_mlx5_dv.c:150 UCX ERROR mlx5dv_devx_obj_create(QP) failed, syndrome 0: Invalid argument
@@ -47,7 +47,7 @@ max_qp: 4096
4747

4848
## Accelerated Networking on HB, HC, HBv2, HBv3 and NDv2
4949

50-
[Azure Accelerated Networking](https://azure.microsoft.com/blog/maximize-your-vm-s-performance-with-accelerated-networking-now-generally-available-for-both-windows-and-linux/) is now available on the RDMA and InfiniBand capable and SR-IOV enabled VM sizes [HB](../../hb-series.md), [HC](../../hc-series.md), [HBv2](../../hbv2-series.md), [HBv3](../../hbv3-series.md) and [NDv2](../../ndv2-series.md). This capability now allows enhanced throughout (up to 30 Gbps) and latencies over the Azure Ethernet network. Though this is separate from the RDMA capabilities over the InfiniBand network, some platform changes for this capability may impact behavior of certain MPI implementations when running jobs over InfiniBand. Specifically the InfiniBand interface on some VMs may have a slightly different name (mlx5_1 as opposed to earlier mlx5_0) and this may require tweaking of the MPI command lines especially when using the UCX interface (commonly with OpenMPI and HPC-X). The simplest solution currently may be to use the latest HPC-X on the CentOS-HPC VM images or disable Accelerated Networking if not required.
50+
[Azure Accelerated Networking](https://azure.microsoft.com/blog/maximize-your-vm-s-performance-with-accelerated-networking-now-generally-available-for-both-windows-and-linux/) is now available on the RDMA and InfiniBand capable and SR-IOV enabled VM sizes [HB](../../hb-series.md), [HC](../../hc-series.md), [HBv2](../../hbv2-series.md), [HBv3](../../hbv3-series.md) and [NDv2](../../ndv2-series.md). This capability now allows enhanced throughout (up to 30 Gbps) and latencies over the Azure Ethernet network. Though this is separate from the RDMA capabilities over the InfiniBand network, some platform changes for this capability may impact behavior of certain MPI implementations when running jobs over InfiniBand. Specifically the InfiniBand interface on some VMs may have a slightly different name (mlx5_1 as opposed to earlier mlx5_0). This may require tweaking of the MPI command lines especially when using the UCX interface (commonly with OpenMPI and HPC-X). The simplest solution currently may be to use the latest HPC-X on the CentOS-HPC VM images or disable Accelerated Networking if not required.
5151
More details on this are available on this [TechCommunity article](https://techcommunity.microsoft.com/t5/azure-compute/accelerated-networking-on-hb-hc-and-hbv2/ba-p/2067965) with instructions on how to address any observed issues.
5252

5353
## InfiniBand driver installation on non-SR-IOV VMs
@@ -57,12 +57,12 @@ InfiniBand can be configured on the SR-IOV enabled VM sizes with the OFED driver
5757

5858
## Duplicate MAC with cloud-init with Ubuntu on H-series and N-series VMs
5959

60-
There is a known issue with cloud-init on Ubuntu VM images as it tries to bring up the IB interface. This can happen either on VM reboot or when trying to create a VM image after generalization. The VM boot logs may show an error like so:
60+
There's a known issue with cloud-init on Ubuntu VM images as it tries to bring up the IB interface. This can happen either on VM reboot or when trying to create a VM image after generalization. The VM boot logs may show an error like so:
6161
```console
6262
“Starting Network Service...RuntimeError: duplicate mac found! both 'eth1' and 'ib0' have mac”.
6363
```
6464

65-
This 'duplicate MAC with cloud-init on Ubuntu" is a known issue. This will be resolved in newer kernels. IF the issue is encountered, the workaround is:
65+
This 'duplicate MAC with cloud-init on Ubuntu" is a known issue. This will be resolved in newer kernels. If this issue is encountered, the workaround is:
6666
1) Deploy the (Ubuntu 18.04) marketplace VM image
6767
2) Install the necessary software packages to enable IB ([instruction here](https://techcommunity.microsoft.com/t5/azure-compute/configuring-infiniband-for-ubuntu-hpc-and-gpu-vms/ba-p/1221351))
6868
3) Edit waagent.conf to change EnableRDMA=y

0 commit comments

Comments
 (0)