|
| 1 | +--- |
| 2 | +title: Troubleshooting an Unhealthy or Degraded Storage Appliance |
| 3 | +description: Troubleshooting a Storage Appliance which has Azure Resource Health alerts |
| 4 | +author: jensheasby |
| 5 | +ms.author: jensheasby |
| 6 | +ms.date: 06/10/2025 |
| 7 | +ms.topic: troubleshooting |
| 8 | +ms.service: azure-operator-nexus |
| 9 | +--- |
| 10 | + |
| 11 | +# Troubleshooting an Unhealthy or Degraded Storage Appliance |
| 12 | + |
| 13 | +This article provides troubleshooting advice and escalation methods for Storage Appliances which are |
| 14 | +unhealthy or degraded. |
| 15 | + |
| 16 | +## Capacity Threshold Reached |
| 17 | + |
| 18 | +This will have an "Availability Impacting Reason" of: |
| 19 | + |
| 20 | +- `SACapacityThresholdDegraded`, which means the Storage Appliance is at 80% capacity or above. |
| 21 | +- `SACapacityThresholdUnhealthy`, which means the Storage Appliance is at 90% capacity or above. |
| 22 | + |
| 23 | +You can see the current usage of the appliance by navigating to the Storage Appliance in the portal, |
| 24 | +navigating to the `Monitoring > Metrics` tab and selecting `Nexus Storage Array Space Utilization` from |
| 25 | +the `Metric` dropdown. |
| 26 | + |
| 27 | +These issues can be addressed by reducing the load on the Storage Appliance. This can be achieved by: |
| 28 | + |
| 29 | +- Moving some workloads to another cluster if one is available. |
| 30 | +- Activating array expansions, if those are available and unused. |
| 31 | + |
| 32 | +You can check back on the value of the utilization metric to confirm that it has returned below 80%. |
| 33 | + |
| 34 | +Note than any volume deletions may take up to 24 hours to eradicate from the appliance, and that |
| 35 | +any deletions should be carried out slowly to avoid worsening the problem. |
| 36 | + |
| 37 | +## Active Alerts |
| 38 | + |
| 39 | +This will have an "Availability Impacting Reason" of: |
| 40 | + |
| 41 | +- `StorageApplianceActiveAlertsWarning`, which means there are 1 or more open warning alerts on the |
| 42 | + Storage Appliance. This means there is an issue which needs resolving, but the Storage Appliance |
| 43 | + should continue to function. |
| 44 | +- `StorageApplianceActiveAlertsCritical`, which means there are 1 or more open critical alerts on the |
| 45 | + Storage Appliance. This implies a severe problem with the Storage Appliance. |
| 46 | + |
| 47 | +You should find more details of the specific alert(s), and from that determine whether you need to take |
| 48 | +an action yourself (such as re-seating a cable), raise a ticket with your storage vendor, or raise a |
| 49 | +ticket with Microsoft. |
| 50 | + |
| 51 | +- If you have your Storage Appliance set up to send logs to a Log Analytics Workspace (LAW), you can gather |
| 52 | + more details by running the below query in your LAW. |
| 53 | + ``` |
| 54 | + StorageApplianceAlerts |
| 55 | + | where TIMESTAMP > <start time> |
| 56 | + ``` |
| 57 | + This will give you more details of the alert, and may also provide a link to a specific troubleshooting |
| 58 | + article from your storage vendor. |
| 59 | +- If you do not have log streaming to a LAW set up, you can still get the details by navigating to the |
| 60 | + Storage Appliance on the portal, navigating to the `Monitoring > Metrics` tab and selecting |
| 61 | + `Nexus Storage Alerts Open` from the `Metric` dropdown. Then, you should click `Apply splitting` and |
| 62 | + select all of the boxes. You will then see a summary of the alert, as well as the vendor alert code. You |
| 63 | + can use this information to search your vendor documentation for further details of the alert. |
| 64 | + |
| 65 | +Once you have this information, you should be able to tell if you can fix the issue yourself, or if |
| 66 | +you need to raise a ticket with your Storage Appliance vendor or with us. If you need to raise a |
| 67 | +ticket with us, please include the Storage Appliance name and "Availability Impacting Reason" for |
| 68 | +quicker issue triage. |
| 69 | + |
| 70 | +## Latency |
| 71 | + |
| 72 | +This will have an "Availability Impacting Reason" of: |
| 73 | + |
| 74 | +- `StorageApplianceLatencyDegraded`, which means the self-reported latency of the Storage Appliance |
| 75 | + has exceeded 1.2ms. |
| 76 | +- `StorageApplianceLatencyUnavailable`, which means the self-reported latency of the Storage Appliance |
| 77 | + has exceeded 100ms. |
| 78 | + |
| 79 | +<!-- TODO: needs an update after the new threshold is set (and the new threshold may need to depend on type) --> |
| 80 | + |
| 81 | +The expected latency is 1ms or less. |
| 82 | + |
| 83 | +Latency issues could be caused by an issue with the appliance, or high load. First, check for high |
| 84 | +load by navigating to the Storage Appliance on the portal, navigating to the `Monitoring > Metrics` tab |
| 85 | +and viewing the `Nexus Storage Array Performance Throughput Iops (Avg)` metric, and the |
| 86 | +`Nexus Storage Array Latency` metric on the same chart, starting from shortly before the health event |
| 87 | +appeared. You should be able to see from this chart whether high load is the cause. If so, reducing the |
| 88 | +load will resolve the health event. |
| 89 | + |
| 90 | +If you have ruled out high load, you should raise a ticket with your Storage Appliance vendor. |
| 91 | + |
| 92 | +## Network Interface Errors |
| 93 | + |
| 94 | +This will have an "Availability Impacting Reason" of: |
| 95 | + |
| 96 | +- `StorageApplianceNetworkErrorsDegraded`, which means the average rate of network interface errors |
| 97 | + on one or more interfaces has exceeded 3%. This implies an issue with the network interface(s). |
| 98 | + |
| 99 | +To determine the unhealthy network interface(s), as well as the distribution of the errors, navigate |
| 100 | +to the Storage Appliance in the portal, navigate to the `Monitoring > Metrics` tab select |
| 101 | +`Nexus Storage Network Interface Performance Errors` in the `Metric` dropdown. Then, you should click |
| 102 | +`Apply splitting`, and select the `Dimension` and `Name` boxes, ensuring that you select a timerange |
| 103 | +which starts shortly before the start time of the resource health alert. Once you have identified the |
| 104 | +unhealthy network interface(s), and error types, you should raise a ticket with your Storage Appliance |
| 105 | +vendor. |
| 106 | + |
| 107 | +## Network Latency |
| 108 | + |
| 109 | +This will have an "Availability Impacting Reason" of: |
| 110 | + |
| 111 | +- `StorageApplianceNetworkLatencyDegraded`, which means the latency between the initiator and the Storage |
| 112 | + Appliance has exceeded 25ms. |
| 113 | +- `StorageApplianceNetworkLatencyUnavailable`, which means the latency between the initiator and the Storage |
| 114 | + Appliance has exceeded 100ms. |
| 115 | + |
| 116 | +This increased latency implies an underlying problem with the networking between the Bare Metal Machines |
| 117 | +(BMMs) and the Storage Appliance. As this can result from any of the hops between BMMs and Storage Appliance, |
| 118 | +you should raise a ticket with Microsoft, quoting the availability impacting reason and the text of this |
| 119 | +TSG. |
0 commit comments