MicrosoftDocs
diff --git a/‎articles/operator-nexus/media/interface-out-pkts.png
92.3 KB b/‎articles/operator-nexus/media/interface-out-pkts.png
92.3 KB
diff --git a/‎articles/operator-nexus/media/navigate-mgmt-switch-portal.png
55.1 KB b/‎articles/operator-nexus/media/navigate-mgmt-switch-portal.png
55.1 KB
diff --git a/‎articles/operator-nexus/media/navigate-network-fabric-portal.png
82.9 KB b/‎articles/operator-nexus/media/navigate-network-fabric-portal.png
82.9 KB
diff --git a/‎articles/operator-nexus/troubleshoot-failed-volume-attachments.md
Lines changed: 35 additions & 0 deletions b/‎articles/operator-nexus/troubleshoot-failed-volume-attachments.md
Lines changed: 35 additions & 0 deletions
diff --git a/‎articles/operator-nexus/troubleshoot-nfs-unhealthy.md
Lines changed: 26 additions & 0 deletions b/‎articles/operator-nexus/troubleshoot-nfs-unhealthy.md
Lines changed: 26 additions & 0 deletions
diff --git a/‎articles/operator-nexus/troubleshoot-resource-health-alerts.md
Lines changed: 10 additions & 2 deletions b/‎articles/operator-nexus/troubleshoot-resource-health-alerts.md
Lines changed: 10 additions & 2 deletions
diff --git a/‎articles/operator-nexus/troubleshoot-storage-control-plane-disconnected.md
Lines changed: 50 additions & 0 deletions b/‎articles/operator-nexus/troubleshoot-storage-control-plane-disconnected.md
Lines changed: 50 additions & 0 deletions
diff --git a/‎articles/operator-nexus/troubleshoot-unhealthy-csi.md
Lines changed: 26 additions & 0 deletions b/‎articles/operator-nexus/troubleshoot-unhealthy-csi.md
Lines changed: 26 additions & 0 deletions
@@ -0,0 +1,35 @@
+---
+title: Troubleshooting failed volume attachments
+description: Troubleshooting Azure Resource Health alerts about failed volume attachments
+author: jensheasby
+ms.author: jensheasby
+ms.date: 07/21/2025
+ms.topic: troubleshooting
+ms.service: azure-operator-nexus
+---
+
+# Troubleshooting failed volume attachments - Azure Resource Health
+
+This article provides troubleshooting advice and escalation methods for Operator Nexus clusters which are
+reporting failed volume attachments in Azure Resource Health.
+
+## Symptoms
+
+This alert indicates that volumes are failing to attach in the undercloud. This can lead to delays in
+bringing up workloads in the tenant layer, or migrating existing workloads to a new node. If the cluster
+has been marked as degraded, this implies at least 1 volume is failing to attach - in this case the problem
+may be limited to this specific volume, and the impact radius is small. If the cluster has been marked as
+unhealthy, a high percentage of volumes on at least 1 node are failing to attach, indicating a more serious
+incident.
+
+## Troubleshooting
+
+This alert may be seen at the same time as the `ControlPlaneStorageConnectivityUnhealthyVIP` alert. In this
+case, the lost connectivity on the storage control plane is likely the cause of the failed attachments. You
+should follow the [troubleshooting guide for that issue]. If after resolving that incident, this alert persists,
+return to this guide.
+
+If control plane connectivity issues are not the root cause, you should raise a ticket with Microsoft, quoting the
+text of this troubleshooting guide, and the Azure resource ID of the affected cluster.
+
+[troubleshooting guide for that issue]: ./troubleshoot-storage-control-plane-disconnected.md
@@ -0,0 +1,26 @@
+---
+title: Troubleshooting unhealthy NFS pods
+description: Troubleshooting Azure Resource Health alerts about NFS
+author: jensheasby
+ms.author: jensheasby
+ms.date: 07/21/2025
+ms.topic: troubleshooting
+ms.service: azure-operator-nexus
+---
+
+# Troubleshooting unhealthy NFS pods - Azure Resource Health
+
+This article provides troubleshooting advice and escalation methods for Operator Nexus clusters which are
+reporting unhealthy NFS pods in Azure Resource Health.
+
+## Symptoms
+
+This alert indicates problems with NFS in the cluster. NFS is responsible for control plane and data plane
+operations for `nexus-shared` volumes in the tenant layer. Therefore, if a cluster has unhealthy NFS pods,
+existing `nexus-shared` volumes may experience data plane disruption, and new `nexus-shared` volumes may
+fail to be created.
+
+## Troubleshooting
+
+You should raise a ticket with Microsoft, quoting the text of this troubleshooting guide, and the Azure
+resource ID of the affected cluster.
@@ -21,12 +21,20 @@ These alerts are generated based on the status of the resource and its dependenc
 ## Cluster
 
 | Resource Health Event Name                                                                             | Troubleshooting Guide                                                 |
-|--------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------|
-| `1PExtensionsFailedInstall`                                                                            | [Requires to contact support](#please-contact-support) |
+| ------------------------------------------------------------------------------------------------------ | --------------------------------------------------------------------- |
+| `1PExtensionsFailedInstall`                                                                            | [Requires to contact support](#please-contact-support)                |
 | `ClusterHeartbeatConnectionStatusDisconnectedClusterManagerOperationsAreAffectedPossibleNetworkIssues` | [Troubleshoot Cluster heartbeat connection status shows disconnected] |
 | `ClusterHeartbeatConnectionStatusTimedoutPossiblePerformanceIssues`                                    | [Troubleshoot Cluster heartbeat connection status shows disconnected] |
+| `AttachmentFailuresDegraded` and `AttachmentFailuresUnhealthy`                                         | [Troubleshoot failed volume attachments]                              |
+| `NFSPodDegraded` and `NFSPodUnhealthy`                                                                 | [Troubleshoot NFS unhealthy]                                          |
+| `CSIControllerUnhealthy`, `CSINodeDegraded` and `CSINodeUnhealthy`                                     | [Troubleshoot unhealthy CSI (storage)]                                |
+| `ControlPlaneStorageConnectivityDegraded` and `ControlPlaneStorageConnectivityUnhealthyVIP`            | [Troubleshoot storage control plane disconnected]                     |
 
 [Troubleshoot Cluster heartbeat connection status shows disconnected]: ./troubleshoot-cluster-heartbeat-connection-status-disconnected.md
+[Troubleshoot failed volume attachments]: ./troubleshoot-failed-volume-attachments.md
+[Troubleshoot NFS unhealthy]: ./troubleshoot-nfs-unhealthy.md
+[Troubleshoot unhealthy CSI (storage)]: ./troubleshoot-unhealthy-csi.md
+[Troubleshoot storage control plane disconnected]: ./troubleshoot-storage-control-plane-disconnected.md
 
 ## Please contact support
 
 
@@ -0,0 +1,50 @@
+---
+title: Troubleshooting storage control plane connectitivy issues.
+description: Troubleshooting Azure Resource Health alerts about control plane connectivity issues.
+author: jensheasby
+ms.author: jensheasby
+ms.date: 07/21/2025
+ms.topic: troubleshooting
+ms.service: azure-operator-nexus
+---
+
+# Troubleshooting control plane connectivity issues - Azure Resource Health
+
+This article provides troubleshooting advice and escalation methods for Operator Nexus clusters which are
+reporting issues with control plane connectivity in Azure Resource Health.
+
+## Symptoms
+
+This alert indicates that there are issues connecting to the storage control plane from the cluster. The two
+categories of alert have different symptoms:
+
+- If the cluster is marked as degraded, this means there has been a loss of redundancy to the storage control
+  plane. This means that one of the controllers is experiencing connectivity issues. The cluster will continue
+  to function, but this issue should be quickly fixed to restore redundancy to the system.
+- If the cluster is marked as unhealthy, this means the storage control plane is completely unreachable from
+  the cluster. New workloads which depend on `nexus-volume` volumes will not come up, and existing workloads
+  which rely on `nexus-volume` volumes will not be able to be migrated to a new node. Additonally, new cloud
+  services networks cannot be created.
+
+## Troubleshooting
+
+The cluster may be marked as degraded during a storage appliance upgrade, since these upgrades take controllers
+offline one by one. The cluster should return to healthy status after the upgrade is complete.
+
+If an upgrade is not the root cause, you should check if there are any issues with the management switches in
+the aggregator rack. Follow these steps to check for issues:
+
+1. Start on the cluster (Operator Nexus) resource overview page. Click the link to the network fabric resource.
+   :::image type="content" source="media/navigate-network-fabric-portal.png" alt-text="Screenshot of a cluster resource, with the network fabric link highlighted." lightbox="media/navigate-network-fabric-portal.png":::
+2. Go to `Infrastructue->Devices`, and search for the aggregator rack management switches. Ensure they are succesfully
+   provisioned and enabled.
+   :::image type="content" source="media/navigate-mgmt-switch-portal.png" alt-text="Screenshot of the Infrastructure tab of a network fabric resource." lightbox="media/snavigate-mgmt-switch-portal.png":::
+3. Click on a management switch, and go to the `Monitoring->Metrics` tab. Select `Interface Out Pkts`, then apply splitting
+   on the `Interface Name` dimension.
+   :::image type="content" source="media/interface-out-pkts.png" alt-text="Screenshot of a metric showing the outward packets of a management switch." lightbox="media/interface-out-pkts.png":::
+4. Check for any interfaces where the packets has suddenly dropped to zero. If you find any, you should reseat any affected
+   cables.
+5. Repeat the check for the second management switch.
+
+If upgrade or management switch problems are not the root cause, you should raise a ticket with Microsoft, quoting
+the text of this troubleshooting guide.
@@ -0,0 +1,26 @@
+---
+title: Troubleshooting unhealthy CSI (storage)
+description: Troubleshooting Azure Resource Health alerts about unhealthy CSI pods (storage)
+author: jensheasby
+ms.author: jensheasby
+ms.date: 07/21/2025
+ms.topic: troubleshooting
+ms.service: azure-operator-nexus
+---
+
+# Troubleshooting unhealthy CSI pods (storage) - Azure Resource Health
+
+This article provides troubleshooting advice and escalation methods for Operator Nexus clusters which are
+reporting unhealthy Container Storage Interface (CSI) pods in Azure Resource Health.
+
+## Symptoms
+
+This alert indicates that there are problems with the CSI pods in the undercloud cluster. These pods are
+responsible for control plane operations for volumes in the undercloud. If these pods are unhealthy, workloads
+relying on `nexus-volume` storage may fail to come up, or existing workloads may not be able to be migrated.
+You may also experience issues provisioning new cloud services networks.
+
+## Troubleshooting
+
+You should raise a support ticket with Microsoft, quoting the text of this troubleshooting guide, and the Azure
+resource ID of the affected cluster.