Skip to content

Commit bfb1501

Browse files
committed
First draft of storage cluster TSGs
1 parent 6b26227 commit bfb1501

8 files changed

+147
-2
lines changed
92.3 KB
Loading
55.1 KB
Loading
82.9 KB
Loading
Lines changed: 35 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,35 @@
1+
---
2+
title: Troubleshooting failed volume attachments
3+
description: Troubleshooting Azure Resource Health alerts about failed volume attachments
4+
author: jensheasby
5+
ms.author: jensheasby
6+
ms.date: 07/21/2025
7+
ms.topic: troubleshooting
8+
ms.service: azure-operator-nexus
9+
---
10+
11+
# Troubleshooting failed volume attachments - Azure Resource Health
12+
13+
This article provides troubleshooting advice and escalation methods for Operator Nexus clusters which are
14+
reporting failed volume attachments in Azure Resource Health.
15+
16+
## Symptoms
17+
18+
This alert indicates that volumes are failing to attach in the undercloud. This can lead to delays in
19+
bringing up workloads in the tenant layer, or migrating existing workloads to a new node. If the cluster
20+
has been marked as degraded, this implies at least 1 volume is failing to attach - in this case the problem
21+
may be limited to this specific volume, and the impact radius is small. If the cluster has been marked as
22+
unhealthy, a high percentage of volumes on at least 1 node are failing to attach, indicating a more serious
23+
incident.
24+
25+
## Troubleshooting
26+
27+
This alert may be seen at the same time as the `ControlPlaneStorageConnectivityUnhealthyVIP` alert. In this
28+
case, the lost connectivity on the storage control plane is likely the cause of the failed attachments. You
29+
should follow the [troubleshooting guide for that issue]. If after resolving that incident, this alert persists,
30+
return to this guide.
31+
32+
If control plane connectivity issues are not the root cause, you should raise a ticket with Microsoft, quoting the
33+
text of this troubleshooting guide, and the Azure resource ID of the affected cluster.
34+
35+
[troubleshooting guide for that issue]: ./troubleshoot-storage-control-plane-disconnected.md
Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,26 @@
1+
---
2+
title: Troubleshooting unhealthy NFS pods
3+
description: Troubleshooting Azure Resource Health alerts about NFS
4+
author: jensheasby
5+
ms.author: jensheasby
6+
ms.date: 07/21/2025
7+
ms.topic: troubleshooting
8+
ms.service: azure-operator-nexus
9+
---
10+
11+
# Troubleshooting unhealthy NFS pods - Azure Resource Health
12+
13+
This article provides troubleshooting advice and escalation methods for Operator Nexus clusters which are
14+
reporting unhealthy NFS pods in Azure Resource Health.
15+
16+
## Symptoms
17+
18+
This alert indicates problems with NFS in the cluster. NFS is responsible for control plane and data plane
19+
operations for `nexus-shared` volumes in the tenant layer. Therefore, if a cluster has unhealthy NFS pods,
20+
existing `nexus-shared` volumes may experience data plane disruption, and new `nexus-shared` volumes may
21+
fail to be created.
22+
23+
## Troubleshooting
24+
25+
You should raise a ticket with Microsoft, quoting the text of this troubleshooting guide, and the Azure
26+
resource ID of the affected cluster.

articles/operator-nexus/troubleshoot-resource-health-alerts.md

Lines changed: 10 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -21,12 +21,20 @@ These alerts are generated based on the status of the resource and its dependenc
2121
## Cluster
2222

2323
| Resource Health Event Name | Troubleshooting Guide |
24-
|--------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------|
25-
| `1PExtensionsFailedInstall` | [Requires to contact support](#please-contact-support) |
24+
| ------------------------------------------------------------------------------------------------------ | --------------------------------------------------------------------- |
25+
| `1PExtensionsFailedInstall` | [Requires to contact support](#please-contact-support) |
2626
| `ClusterHeartbeatConnectionStatusDisconnectedClusterManagerOperationsAreAffectedPossibleNetworkIssues` | [Troubleshoot Cluster heartbeat connection status shows disconnected] |
2727
| `ClusterHeartbeatConnectionStatusTimedoutPossiblePerformanceIssues` | [Troubleshoot Cluster heartbeat connection status shows disconnected] |
28+
| `AttachmentFailuresDegraded` and `AttachmentFailuresUnhealthy` | [Troubleshoot failed volume attachments] |
29+
| `NFSPodDegraded` and `NFSPodUnhealthy` | [Troubleshoot NFS unhealthy] |
30+
| `CSIControllerUnhealthy`, `CSINodeDegraded` and `CSINodeUnhealthy` | [Troubleshoot unhealthy CSI (storage)] |
31+
| `ControlPlaneStorageConnectivityDegraded` and `ControlPlaneStorageConnectivityUnhealthyVIP` | [Troubleshoot storage control plane disconnected] |
2832

2933
[Troubleshoot Cluster heartbeat connection status shows disconnected]: ./troubleshoot-cluster-heartbeat-connection-status-disconnected.md
34+
[Troubleshoot failed volume attachments]: ./troubleshoot-failed-volume-attachments.md
35+
[Troubleshoot NFS unhealthy]: ./troubleshoot-nfs-unhealthy.md
36+
[Troubleshoot unhealthy CSI (storage)]: ./troubleshoot-unhealthy-csi.md
37+
[Troubleshoot storage control plane disconnected]: ./troubleshoot-storage-control-plane-disconnected.md
3038

3139
## Please contact support
3240

Lines changed: 50 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,50 @@
1+
---
2+
title: Troubleshooting storage control plane connectitivy issues.
3+
description: Troubleshooting Azure Resource Health alerts about control plane connectivity issues.
4+
author: jensheasby
5+
ms.author: jensheasby
6+
ms.date: 07/21/2025
7+
ms.topic: troubleshooting
8+
ms.service: azure-operator-nexus
9+
---
10+
11+
# Troubleshooting control plane connectivity issues - Azure Resource Health
12+
13+
This article provides troubleshooting advice and escalation methods for Operator Nexus clusters which are
14+
reporting issues with control plane connectivity in Azure Resource Health.
15+
16+
## Symptoms
17+
18+
This alert indicates that there are issues connecting to the storage control plane from the cluster. The two
19+
categories of alert have different symptoms:
20+
21+
- If the cluster is marked as degraded, this means there has been a loss of redundancy to the storage control
22+
plane. This means that one of the controllers is experiencing connectivity issues. The cluster will continue
23+
to function, but this issue should be quickly fixed to restore redundancy to the system.
24+
- If the cluster is marked as unhealthy, this means the storage control plane is completely unreachable from
25+
the cluster. New workloads which depend on `nexus-volume` volumes will not come up, and existing workloads
26+
which rely on `nexus-volume` volumes will not be able to be migrated to a new node. Additonally, new cloud
27+
services networks cannot be created.
28+
29+
## Troubleshooting
30+
31+
The cluster may be marked as degraded during a storage appliance upgrade, since these upgrades take controllers
32+
offline one by one. The cluster should return to healthy status after the upgrade is complete.
33+
34+
If an upgrade is not the root cause, you should check if there are any issues with the management switches in
35+
the aggregator rack. Follow these steps to check for issues:
36+
37+
1. Start on the cluster (Operator Nexus) resource overview page. Click the link to the network fabric resource.
38+
:::image type="content" source="media/navigate-network-fabric-portal.png" alt-text="Screenshot of a cluster resource, with the network fabric link highlighted." lightbox="media/navigate-network-fabric-portal.png":::
39+
2. Go to `Infrastructue->Devices`, and search for the aggregator rack management switches. Ensure they are succesfully
40+
provisioned and enabled.
41+
:::image type="content" source="media/navigate-mgmt-switch-portal.png" alt-text="Screenshot of the Infrastructure tab of a network fabric resource." lightbox="media/snavigate-mgmt-switch-portal.png":::
42+
3. Click on a management switch, and go to the `Monitoring->Metrics` tab. Select `Interface Out Pkts`, then apply splitting
43+
on the `Interface Name` dimension.
44+
:::image type="content" source="media/interface-out-pkts.png" alt-text="Screenshot of a metric showing the outward packets of a management switch." lightbox="media/interface-out-pkts.png":::
45+
4. Check for any interfaces where the packets has suddenly dropped to zero. If you find any, you should reseat any affected
46+
cables.
47+
5. Repeat the check for the second management switch.
48+
49+
If upgrade or management switch problems are not the root cause, you should raise a ticket with Microsoft, quoting
50+
the text of this troubleshooting guide.
Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,26 @@
1+
---
2+
title: Troubleshooting unhealthy CSI (storage)
3+
description: Troubleshooting Azure Resource Health alerts about unhealthy CSI pods (storage)
4+
author: jensheasby
5+
ms.author: jensheasby
6+
ms.date: 07/21/2025
7+
ms.topic: troubleshooting
8+
ms.service: azure-operator-nexus
9+
---
10+
11+
# Troubleshooting unhealthy CSI pods (storage) - Azure Resource Health
12+
13+
This article provides troubleshooting advice and escalation methods for Operator Nexus clusters which are
14+
reporting unhealthy Container Storage Interface (CSI) pods in Azure Resource Health.
15+
16+
## Symptoms
17+
18+
This alert indicates that there are problems with the CSI pods in the undercloud cluster. These pods are
19+
responsible for control plane operations for volumes in the undercloud. If these pods are unhealthy, workloads
20+
relying on `nexus-volume` storage may fail to come up, or existing workloads may not be able to be migrated.
21+
You may also experience issues provisioning new cloud services networks.
22+
23+
## Troubleshooting
24+
25+
You should raise a support ticket with Microsoft, quoting the text of this troubleshooting guide, and the Azure
26+
resource ID of the affected cluster.

0 commit comments

Comments
 (0)