|
| 1 | +--- |
| 2 | +title: Troubleshooting an Unhealthy or Degraded Storage Appliance |
| 3 | +description: Troubleshooting a Storage Appliance that has Azure Resource Health alerts |
| 4 | +author: jensheasby |
| 5 | +ms.author: jensheasby |
| 6 | +ms.date: 06/10/2025 |
| 7 | +ms.topic: troubleshooting |
| 8 | +ms.service: azure-operator-nexus |
| 9 | +--- |
| 10 | + |
| 11 | +# Troubleshooting an unhealthy or degraded storage appliance |
| 12 | + |
| 13 | +This article provides troubleshooting advice and escalation methods for Storage Appliances that are |
| 14 | +unhealthy or degraded. |
| 15 | + |
| 16 | +## Capacity threshold reached |
| 17 | + |
| 18 | +The health events in this list indicate that the appliance is nearing capacity: |
| 19 | + |
| 20 | +- `SACapacityThresholdDegraded`, which means the Storage Appliance is at 80% capacity or greater. |
| 21 | +- `SACapacityThresholdUnhealthy`, which means the Storage Appliance is at 90% capacity or greater. |
| 22 | + |
| 23 | +You can see the current usage of the appliance by navigating to the Storage Appliance in the portal, |
| 24 | +navigating to the `Monitoring > Metrics` tab, and selecting `Nexus Storage Array Space Utilization` from |
| 25 | +the `Metric` dropdown. |
| 26 | + |
| 27 | +:::image type="content" source="media/storage-metrics-utilization.png" alt-text="Screenshot of a metric showing the percentage utilization of a Storage Appliance." lightbox="media/storage-metrics-utilization.png"::: |
| 28 | + |
| 29 | +These issues can be addressed by reducing the load on the Storage Appliance. This outcome can be |
| 30 | +achieved by: |
| 31 | + |
| 32 | +- Moving some workloads to another cluster if one is available, and this is a supported operation for |
| 33 | + your workload: |
| 34 | + - Re-create the workload on a different cluster (Operator Nexus). |
| 35 | + - Perform steps required to migrate traffic to the new cluster (the specific steps required will |
| 36 | + depend on your workload). |
| 37 | + - Delete the workload from the current cluster. |
| 38 | +- Adding array expansions, if you have empty array expansion spaces in your aggregator rack. Speak to |
| 39 | + your storage vendor for instructions. |
| 40 | + |
| 41 | +You can confirm that utilization is reduced by checking the metric again. |
| 42 | + |
| 43 | +Note that any volume deletions may take up to 24 hours to eradicate from the appliance, and that |
| 44 | +any deletions should be carried out slowly to avoid worsening the problem. |
| 45 | + |
| 46 | +## Active alerts |
| 47 | + |
| 48 | +The health events in this list indicate that the appliance has active alerts: |
| 49 | + |
| 50 | +- `StorageApplianceActiveAlertsWarning`, which means there are one or more open warning alerts on the |
| 51 | + Storage Appliance. Warning alerts indicate that there is an issue that requires attention, but the Storage |
| 52 | + Appliance should continue to function. |
| 53 | +- `StorageApplianceActiveAlertsCritical`, which means there are one or more open critical alerts on the |
| 54 | + Storage Appliance. Critical alerts indicate a severe problem with the Storage Appliance that may impact |
| 55 | + functionality. |
| 56 | + |
| 57 | +You can find more details of the specific alert(s), using the following instructions: |
| 58 | + |
| 59 | +- If you have your Storage Appliance set up to send logs to a |
| 60 | + [Log Analytics workspace](/azure/azure-monitor/logs/log-analytics-workspace-overview) (LAW), you can gather |
| 61 | + more details by running the query from the text block below in your LAW. |
| 62 | + ``` |
| 63 | + StorageApplianceAlerts |
| 64 | + | where TIMESTAMP > <start time> |
| 65 | + ``` |
| 66 | + This log will give you more details of the alert, and may also provide a link to a specific troubleshooting |
| 67 | + article from your storage vendor. |
| 68 | +- If you do not have log streaming to a LAW set up, you can still get the details by navigating to the |
| 69 | + Storage Appliance on the portal, navigating to the `Monitoring > Metrics` tab, and selecting |
| 70 | + `Nexus Storage Alerts Open` from the `Metric` dropdown. Then, you should click `Apply splitting` and |
| 71 | + select all of the boxes. You will see a summary of the alert, and the vendor alert code. Use this information |
| 72 | + to search your vendor documentation for further details of the alert. |
| 73 | + |
| 74 | +:::image type="content" source="media/storage-metrics-alerts.png" alt-text="Screenshot of a metric showing an active alert on a Storage Appliance." lightbox="media/storage-metrics-alerts.png"::: |
| 75 | + |
| 76 | +Once you have this information, use it to determine the appropriate next action. You should either: |
| 77 | + |
| 78 | +- Take an action yourself (such as reseating a cable). |
| 79 | +- Raise a ticket with your storage vendor. |
| 80 | +- Raise a ticket with Microsoft. If you need to raise a ticket with Microsoft, please include the Storage Appliance |
| 81 | + resource ID, and the details of the health event for quicker issue triage. |
| 82 | + |
| 83 | +## Latency |
| 84 | + |
| 85 | +The health events in this list indicate that the appliance has high latency: |
| 86 | + |
| 87 | +- `StorageApplianceLatencyDegraded`, which means the self-reported latency of the Storage Appliance |
| 88 | + exceeds 3 ms. |
| 89 | +- `StorageApplianceLatencyUnavailable`, which means the self-reported latency of the Storage Appliance |
| 90 | + exceeds 100 ms. |
| 91 | + |
| 92 | +The expected latency for Pure X-series is 1 ms or less. |
| 93 | + |
| 94 | +The root cause of high latency could be an issue with the appliance, or high load. First, check if high load |
| 95 | +is the cause: |
| 96 | + |
| 97 | +- Navigate to the Storage Appliance on the portal. |
| 98 | +- Navigate to the `Monitoring > Metrics` tab. |
| 99 | +- Select the `Nexus Storage Array Latency` metric. Click `Apply splitting`, and select `Dimension` as |
| 100 | + the dimension to split on. |
| 101 | +- Click `+ New Chart`, and select the `Nexus Storage Array Performance Throughput Iops (Avg)` metric. |
| 102 | + Click `Apply Splitting`, and select `Dimension` as the dimension to split on. |
| 103 | + |
| 104 | +:::image type="content" source="media/storage-metrics-latency-throughput.png" alt-text="Screenshot of a metric showing the latency and throughput on a Storage Appliance." lightbox="media/storage-metrics-latency-throughput.png"::: |
| 105 | + |
| 106 | +By comparing the resulting graphs, you can determine whether high load is the cause. If so, reduce the |
| 107 | +load to resolve the health event. |
| 108 | + |
| 109 | +If the issue is _not_ high load, you should raise a ticket with your Storage Appliance vendor. |
| 110 | + |
| 111 | +## Network interface errors |
| 112 | + |
| 113 | +The health event in this list indicates that the appliance has network interface errors: |
| 114 | + |
| 115 | +- `StorageApplianceNetworkErrorsDegraded`, which means the average rate of network interface errors |
| 116 | + on one or more interfaces has exceeded 3%. |
| 117 | + |
| 118 | +To determine the unhealthy network interfaces, as well as the distribution of the errors, navigate |
| 119 | +to the Storage Appliance in the portal, navigate to the `Monitoring > Metrics` tab, and select |
| 120 | +`Nexus Storage Network Interface Performance Errors` in the `Metric` dropdown. Then, you should click |
| 121 | +`Apply splitting`, and select the `Dimension` and `Name` boxes, ensuring that you select a time range |
| 122 | +that starts shortly before the start time of the resource health alert. After identifying the |
| 123 | +unhealthy network interface(s), and error types, you should raise a ticket with your Storage Appliance |
| 124 | +vendor. |
| 125 | + |
| 126 | +:::image type="content" source="media/storage-metrics-network-errors.png" alt-text="Screenshot of a metric showing network interface errors on a Storage Appliance." lightbox="media/storage-metrics-network-errors.png"::: |
| 127 | + |
| 128 | +## Network latency |
| 129 | + |
| 130 | +The health events in this list indicate that the appliance has high networking latency: |
| 131 | + |
| 132 | +- `StorageApplianceNetworkLatencyDegraded`, which means the latency between the initiator and the Storage |
| 133 | + Appliance exceeds 25 ms. |
| 134 | +- `StorageApplianceNetworkLatencyUnavailable`, which means the latency between the initiator and the Storage |
| 135 | + Appliance exceeds 100 ms. |
| 136 | + |
| 137 | +This increased latency implies an underlying problem with the networking between the Bare Metal Machines |
| 138 | +(BMMs) and the Storage Appliance. Latency can be introduced on any of the hops between BMMs and Storage Appliance. |
| 139 | +You should raise a ticket with Microsoft, quoting the text of this troubleshooting article. |
0 commit comments