Skip to content

Commit eab1696

Browse files
committed
Add troubleshooting for ARH Storage Appliance events
1 parent 952c7f5 commit eab1696

File tree

2 files changed

+125
-1
lines changed

2 files changed

+125
-1
lines changed

articles/operator-nexus/TOC.yml

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -244,7 +244,7 @@
244244
href: howto-restrict-serial-port-access-and-set-timeout-on-terminal-server.md
245245
- name: How to configure BGP prefix limit on Customer Edge (CE) devices for Azure Operator Nexus
246246
href: howto-configure-bgp-prefix-limit-on-customer-edge-devices.md
247-
- name: BMP log streaming in Azure Operator Nexus Network Fabric
247+
- name: BMP log streaming in Azure Operator Nexus Network Fabric
248248
href: concepts-bmp-log-streaming.md
249249
- name: How to enable / disable BMP log streaming Azure Operator Nexus
250250
href: howto-enable-log-streaming.md
@@ -362,6 +362,11 @@
362362
- name: Troubleshooting
363363
expanded: true
364364
items:
365+
- name: Resource Health
366+
expanded: false
367+
items:
368+
- name: Troubleshoot Unhealthy or Degraded Storage Appliance
369+
href: troubleshoot-unhealthy-degraded-storage-appliance.md
365370
- name: Network Fabric
366371
expanded: false
367372
items:
Lines changed: 119 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,119 @@
1+
---
2+
title: Troubleshooting an Unhealthy or Degraded Storage Appliance
3+
description: Troubleshooting a Storage Appliance which has Azure Resource Health alerts
4+
author: jensheasby
5+
ms.author: jensheasby
6+
ms.date: 06/10/2025
7+
ms.topic: troubleshooting
8+
ms.service: azure-operator-nexus
9+
---
10+
11+
# Troubleshooting an Unhealthy or Degraded Storage Appliance
12+
13+
This article provides troubleshooting advice and escalation methods for Storage Appliances which are
14+
unhealthy or degraded.
15+
16+
## Capacity Threshold Reached
17+
18+
This will have an "Availability Impacting Reason" of:
19+
20+
- `SACapacityThresholdDegraded`, which means the Storage Appliance is at 80% capacity or above.
21+
- `SACapacityThresholdUnhealthy`, which means the Storage Appliance is at 90% capacity or above.
22+
23+
You can see the current usage of the appliance by navigating to the Storage Appliance in the portal,
24+
navigating to the `Monitoring > Metrics` tab and selecting `Nexus Storage Array Space Utilization` from
25+
the `Metric` dropdown.
26+
27+
These issues can be addressed by reducing the load on the Storage Appliance. This can be achieved by:
28+
29+
- Moving some workloads to another cluster if one is available.
30+
- Activating array expansions, if those are available and unused.
31+
32+
You can check back on the value of the utilization metric to confirm that it has returned below 80%.
33+
34+
Note than any volume deletions may take up to 24 hours to eradicate from the appliance, and that
35+
any deletions should be carried out slowly to avoid worsening the problem.
36+
37+
## Active Alerts
38+
39+
This will have an "Availability Impacting Reason" of:
40+
41+
- `StorageApplianceActiveAlertsWarning`, which means there are 1 or more open warning alerts on the
42+
Storage Appliance. This means there is an issue which needs resolving, but the Storage Appliance
43+
should continue to function.
44+
- `StorageApplianceActiveAlertsCritical`, which means there are 1 or more open critical alerts on the
45+
Storage Appliance. This implies a severe problem with the Storage Appliance.
46+
47+
You should find more details of the specific alert(s), and from that determine whether you need to take
48+
an action yourself (such as re-seating a cable), raise a ticket with your storage vendor, or raise a
49+
ticket with Microsoft.
50+
51+
- If you have your Storage Appliance set up to send logs to a Log Analytics Workspace (LAW), you can gather
52+
more details by running the below query in your LAW.
53+
```
54+
StorageApplianceAlerts
55+
| where TIMESTAMP > <start time>
56+
```
57+
This will give you more details of the alert, and may also provide a link to a specific troubleshooting
58+
article from your storage vendor.
59+
- If you do not have log streaming to a LAW set up, you can still get the details by navigating to the
60+
Storage Appliance on the portal, navigating to the `Monitoring > Metrics` tab and selecting
61+
`Nexus Storage Alerts Open` from the `Metric` dropdown. Then, you should click `Apply splitting` and
62+
select all of the boxes. You will then see a summary of the alert, as well as the vendor alert code. You
63+
can use this information to search your vendor documentation for further details of the alert.
64+
65+
Once you have this information, you should be able to tell if you can fix the issue yourself, or if
66+
you need to raise a ticket with your Storage Appliance vendor or with us. If you need to raise a
67+
ticket with us, please include the Storage Appliance name and "Availability Impacting Reason" for
68+
quicker issue triage.
69+
70+
## Latency
71+
72+
This will have an "Availability Impacting Reason" of:
73+
74+
- `StorageApplianceLatencyDegraded`, which means the self-reported latency of the Storage Appliance
75+
has exceeded 1.2ms.
76+
- `StorageApplianceLatencyUnavailable`, which means the self-reported latency of the Storage Appliance
77+
has exceeded 100ms.
78+
79+
<!-- TODO: needs an update after the new threshold is set (and the new threshold may need to depend on type) -->
80+
81+
The expected latency is 1ms or less.
82+
83+
Latency issues could be caused by an issue with the appliance, or high load. First, check for high
84+
load by navigating to the Storage Appliance on the portal, navigating to the `Monitoring > Metrics` tab
85+
and viewing the `Nexus Storage Array Performance Throughput Iops (Avg)` metric, and the
86+
`Nexus Storage Array Latency` metric on the same chart, starting from shortly before the health event
87+
appeared. You should be able to see from this chart whether high load is the cause. If so, reducing the
88+
load will resolve the health event.
89+
90+
If you have ruled out high load, you should raise a ticket with your Storage Appliance vendor.
91+
92+
## Network Interface Errors
93+
94+
This will have an "Availability Impacting Reason" of:
95+
96+
- `StorageApplianceNetworkErrorsDegraded`, which means the average rate of network interface errors
97+
on one or more interfaces has exceeded 3%. This implies an issue with the network interface(s).
98+
99+
To determine the unhealthy network interface(s), as well as the distribution of the errors, navigate
100+
to the Storage Appliance in the portal, navigate to the `Monitoring > Metrics` tab select
101+
`Nexus Storage Network Interface Performance Errors` in the `Metric` dropdown. Then, you should click
102+
`Apply splitting`, and select the `Dimension` and `Name` boxes, ensuring that you select a timerange
103+
which starts shortly before the start time of the resource health alert. Once you have identified the
104+
unhealthy network interface(s), and error types, you should raise a ticket with your Storage Appliance
105+
vendor.
106+
107+
## Network Latency
108+
109+
This will have an "Availability Impacting Reason" of:
110+
111+
- `StorageApplianceNetworkLatencyDegraded`, which means the latency between the initiator and the Storage
112+
Appliance has exceeded 25ms.
113+
- `StorageApplianceNetworkLatencyUnavailable`, which means the latency between the initiator and the Storage
114+
Appliance has exceeded 100ms.
115+
116+
This increased latency implies an underlying problem with the networking between the Bare Metal Machines
117+
(BMMs) and the Storage Appliance. As this can result from any of the hops between BMMs and Storage Appliance,
118+
you should raise a ticket with Microsoft, quoting the availability impacting reason and the text of this
119+
TSG.

0 commit comments

Comments
 (0)