|
| 1 | +--- |
| 2 | +title: Troubleshooting storage control plane connectitivy issues. |
| 3 | +description: Troubleshooting Azure Resource Health alerts about control plane connectivity issues. |
| 4 | +author: jensheasby |
| 5 | +ms.author: jensheasby |
| 6 | +ms.date: 07/21/2025 |
| 7 | +ms.topic: troubleshooting |
| 8 | +ms.service: azure-operator-nexus |
| 9 | +--- |
| 10 | + |
| 11 | +# Troubleshooting control plane connectivity issues - Azure Resource Health |
| 12 | + |
| 13 | +This article provides troubleshooting advice and escalation methods for Operator Nexus clusters which are |
| 14 | +reporting issues with control plane connectivity in Azure Resource Health. |
| 15 | + |
| 16 | +## Symptoms |
| 17 | + |
| 18 | +This alert indicates that there are issues connecting to the storage control plane from the cluster. The two |
| 19 | +categories of alert have different symptoms: |
| 20 | + |
| 21 | +- If the cluster is marked as degraded, this means there has been a loss of redundancy to the storage control |
| 22 | + plane. This means that one of the controllers is experiencing connectivity issues. The cluster will continue |
| 23 | + to function, but this issue should be quickly fixed to restore redundancy to the system. |
| 24 | +- If the cluster is marked as unhealthy, this means the storage control plane is completely unreachable from |
| 25 | + the cluster. New workloads which depend on `nexus-volume` volumes will not come up, and existing workloads |
| 26 | + which rely on `nexus-volume` volumes will not be able to be migrated to a new node. Additonally, new cloud |
| 27 | + services networks cannot be created. |
| 28 | + |
| 29 | +## Troubleshooting |
| 30 | + |
| 31 | +The cluster may be marked as degraded during a storage appliance upgrade, since these upgrades take controllers |
| 32 | +offline one by one. The cluster should return to healthy status after the upgrade is complete. |
| 33 | + |
| 34 | +If an upgrade is not the root cause, you should check if there are any issues with the management switches in |
| 35 | +the aggregator rack. Follow these steps to check for issues: |
| 36 | + |
| 37 | +1. Start on the cluster (Operator Nexus) resource overview page. Click the link to the network fabric resource. |
| 38 | + :::image type="content" source="media/navigate-network-fabric-portal.png" alt-text="Screenshot of a cluster resource, with the network fabric link highlighted." lightbox="media/navigate-network-fabric-portal.png"::: |
| 39 | +2. Go to `Infrastructue->Devices`, and search for the aggregator rack management switches. Ensure they are succesfully |
| 40 | + provisioned and enabled. |
| 41 | + :::image type="content" source="media/navigate-mgmt-switch-portal.png" alt-text="Screenshot of the Infrastructure tab of a network fabric resource." lightbox="media/snavigate-mgmt-switch-portal.png"::: |
| 42 | +3. Click on a management switch, and go to the `Monitoring->Metrics` tab. Select `Interface Out Pkts`, then apply splitting |
| 43 | + on the `Interface Name` dimension. |
| 44 | + :::image type="content" source="media/interface-out-pkts.png" alt-text="Screenshot of a metric showing the outward packets of a management switch." lightbox="media/interface-out-pkts.png"::: |
| 45 | +4. Check for any interfaces where the packets has suddenly dropped to zero. If you find any, you should reseat any affected |
| 46 | + cables. |
| 47 | +5. Repeat the check for the second management switch. |
| 48 | + |
| 49 | +If upgrade or management switch problems are not the root cause, you should raise a ticket with Microsoft, quoting |
| 50 | +the text of this troubleshooting guide. |
0 commit comments