Skip to content

Commit f5afad5

Browse files
authored
Merge pull request #301564 from jenSheasby/ARHStorageApplianceTSG
Add troubleshooting for ARH Storage Appliance events
2 parents 6e19d99 + 235a1c8 commit f5afad5

6 files changed

+169
-29
lines changed

articles/operator-nexus/TOC.yml

Lines changed: 30 additions & 29 deletions
Original file line numberDiff line numberDiff line change
@@ -366,11 +366,30 @@
366366
- name: Troubleshooting
367367
expanded: true
368368
items:
369-
- name: Resource Health
369+
- name: Bare Metal Machine
370370
expanded: false
371371
items:
372-
- name: Troubleshoot Resource Health alerts
373-
href: troubleshoot-resource-health-alerts.md
372+
- name: Troubleshoot Bare Metal Server Problems
373+
href: troubleshoot-reboot-reimage-replace.md
374+
- name: Troubleshoot Bare Metal Machine Provisioning
375+
href: troubleshoot-bare-metal-machine-provisioning.md
376+
- name: Troubleshoot Hardware Validation Failure
377+
href: troubleshoot-hardware-validation-failure.md
378+
- name: Troubleshoot Degraded status
379+
href: troubleshoot-bare-metal-machine-degraded.md
380+
- name: Troubleshoot Warning status
381+
href: troubleshoot-bare-metal-machine-warning.md
382+
- name: Troubleshoot Out of Memory Pods
383+
href: troubleshoot-memory-limits.md
384+
- name: Cluster
385+
expanded: false
386+
items:
387+
- name: Troubleshoot Accepted Cluster Resource
388+
href: troubleshoot-accepted-cluster-hydration.md
389+
- name: Troubleshoot Control Plane Quorum
390+
href: troubleshoot-control-plane-quorum.md
391+
- name: Troubleshoot Cluster heartbeat connection status disconnected
392+
href: troubleshoot-cluster-heartbeat-connection-status-disconnected.md
374393
- name: Network Fabric
375394
expanded: false
376395
items:
@@ -382,30 +401,18 @@
382401
href: troubleshoot-dns-issues.md
383402
- name: Troubleshoot TWAMP (UDP) not working
384403
href: troubleshoot-twamp-udp-not-working.md
385-
- name: Cluster
404+
- name: Resource Health
386405
expanded: false
387406
items:
388-
- name: Troubleshoot Accepted Cluster Resource
389-
href: troubleshoot-accepted-cluster-hydration.md
390-
- name: Troubleshoot Control Plane Quorum
391-
href: troubleshoot-control-plane-quorum.md
392-
- name: Troubleshoot Cluster heartbeat connection status disconnected
393-
href: troubleshoot-cluster-heartbeat-connection-status-disconnected.md
394-
- name: Bare Metal Machine
407+
- name: Troubleshoot Resource Health alerts
408+
href: troubleshoot-resource-health-alerts.md
409+
- name: Troubleshoot Unhealthy or Degraded Storage Appliance
410+
href: troubleshoot-unhealthy-degraded-storage-appliance.md
411+
- name: Storage Appliance
395412
expanded: false
396413
items:
397-
- name: Troubleshoot Bare Metal Server Problems
398-
href: troubleshoot-reboot-reimage-replace.md
399-
- name: Troubleshoot Bare Metal Machine Provisioning
400-
href: troubleshoot-bare-metal-machine-provisioning.md
401-
- name: Troubleshoot Hardware Validation Failure
402-
href: troubleshoot-hardware-validation-failure.md
403-
- name: Troubleshoot Degraded status
404-
href: troubleshoot-bare-metal-machine-degraded.md
405-
- name: Troubleshoot Warning status
406-
href: troubleshoot-bare-metal-machine-warning.md
407-
- name: Troubleshoot Out of Memory Pods
408-
href: troubleshoot-memory-limits.md
414+
- name: Troubleshoot Multiple Storage appliances
415+
href: troubleshoot-multiple-storage-appliances.md
409416
- name: Tenant Workload
410417
expanded: false
411418
items:
@@ -439,12 +446,6 @@
439446
href: troubleshoot-kubernetes-cluster-stuck-workloads-due-to-power-failure.md
440447
- name: Troubleshoot a Kubernetes Cluster Node in NotReady,Scheduling Disabled after Runtime Upgrade
441448
href: troubleshoot-kubernetes-cluster-node-cordoned.md
442-
- name: Storage Appliance
443-
expanded: false
444-
items:
445-
- name: Troubleshoot Multiple Storage appliances
446-
href: troubleshoot-multiple-storage-appliances.md
447-
448449
- name: FAQ
449450
href: azure-operator-nexus-faq.md
450451
- name: Reference
60.4 KB
Loading
215 KB
Loading
118 KB
Loading
68.3 KB
Loading
Lines changed: 139 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,139 @@
1+
---
2+
title: Troubleshooting an Unhealthy or Degraded Storage Appliance
3+
description: Troubleshooting a Storage Appliance that has Azure Resource Health alerts
4+
author: jensheasby
5+
ms.author: jensheasby
6+
ms.date: 06/10/2025
7+
ms.topic: troubleshooting
8+
ms.service: azure-operator-nexus
9+
---
10+
11+
# Troubleshooting an unhealthy or degraded storage appliance
12+
13+
This article provides troubleshooting advice and escalation methods for Storage Appliances that are
14+
unhealthy or degraded.
15+
16+
## Capacity threshold reached
17+
18+
The health events in this list indicate that the appliance is nearing capacity:
19+
20+
- `SACapacityThresholdDegraded`, which means the Storage Appliance is at 80% capacity or greater.
21+
- `SACapacityThresholdUnhealthy`, which means the Storage Appliance is at 90% capacity or greater.
22+
23+
You can see the current usage of the appliance by navigating to the Storage Appliance in the portal,
24+
navigating to the `Monitoring > Metrics` tab, and selecting `Nexus Storage Array Space Utilization` from
25+
the `Metric` dropdown.
26+
27+
:::image type="content" source="media/storage-metrics-utilization.png" alt-text="Screenshot of a metric showing the percentage utilization of a Storage Appliance." lightbox="media/storage-metrics-utilization.png":::
28+
29+
These issues can be addressed by reducing the load on the Storage Appliance. This outcome can be
30+
achieved by:
31+
32+
- Moving some workloads to another cluster if one is available, and this is a supported operation for
33+
your workload:
34+
- Re-create the workload on a different cluster (Operator Nexus).
35+
- Perform steps required to migrate traffic to the new cluster (the specific steps required will
36+
depend on your workload).
37+
- Delete the workload from the current cluster.
38+
- Adding array expansions, if you have empty array expansion spaces in your aggregator rack. Speak to
39+
your storage vendor for instructions.
40+
41+
You can confirm that utilization is reduced by checking the metric again.
42+
43+
Note that any volume deletions may take up to 24 hours to eradicate from the appliance, and that
44+
any deletions should be carried out slowly to avoid worsening the problem.
45+
46+
## Active alerts
47+
48+
The health events in this list indicate that the appliance has active alerts:
49+
50+
- `StorageApplianceActiveAlertsWarning`, which means there are one or more open warning alerts on the
51+
Storage Appliance. Warning alerts indicate that there is an issue that requires attention, but the Storage
52+
Appliance should continue to function.
53+
- `StorageApplianceActiveAlertsCritical`, which means there are one or more open critical alerts on the
54+
Storage Appliance. Critical alerts indicate a severe problem with the Storage Appliance that may impact
55+
functionality.
56+
57+
You can find more details of the specific alert(s), using the following instructions:
58+
59+
- If you have your Storage Appliance set up to send logs to a
60+
[Log Analytics workspace](/azure/azure-monitor/logs/log-analytics-workspace-overview) (LAW), you can gather
61+
more details by running the query from the text block below in your LAW.
62+
```
63+
StorageApplianceAlerts
64+
| where TIMESTAMP > <start time>
65+
```
66+
This log will give you more details of the alert, and may also provide a link to a specific troubleshooting
67+
article from your storage vendor.
68+
- If you do not have log streaming to a LAW set up, you can still get the details by navigating to the
69+
Storage Appliance on the portal, navigating to the `Monitoring > Metrics` tab, and selecting
70+
`Nexus Storage Alerts Open` from the `Metric` dropdown. Then, you should click `Apply splitting` and
71+
select all of the boxes. You will see a summary of the alert, and the vendor alert code. Use this information
72+
to search your vendor documentation for further details of the alert.
73+
74+
:::image type="content" source="media/storage-metrics-alerts.png" alt-text="Screenshot of a metric showing an active alert on a Storage Appliance." lightbox="media/storage-metrics-alerts.png":::
75+
76+
Once you have this information, use it to determine the appropriate next action. You should either:
77+
78+
- Take an action yourself (such as reseating a cable).
79+
- Raise a ticket with your storage vendor.
80+
- Raise a ticket with Microsoft. If you need to raise a ticket with Microsoft, please include the Storage Appliance
81+
resource ID, and the details of the health event for quicker issue triage.
82+
83+
## Latency
84+
85+
The health events in this list indicate that the appliance has high latency:
86+
87+
- `StorageApplianceLatencyDegraded`, which means the self-reported latency of the Storage Appliance
88+
exceeds 3 ms.
89+
- `StorageApplianceLatencyUnavailable`, which means the self-reported latency of the Storage Appliance
90+
exceeds 100 ms.
91+
92+
The expected latency for Pure X-series is 1 ms or less.
93+
94+
The root cause of high latency could be an issue with the appliance, or high load. First, check if high load
95+
is the cause:
96+
97+
- Navigate to the Storage Appliance on the portal.
98+
- Navigate to the `Monitoring > Metrics` tab.
99+
- Select the `Nexus Storage Array Latency` metric. Click `Apply splitting`, and select `Dimension` as
100+
the dimension to split on.
101+
- Click `+ New Chart`, and select the `Nexus Storage Array Performance Throughput Iops (Avg)` metric.
102+
Click `Apply Splitting`, and select `Dimension` as the dimension to split on.
103+
104+
:::image type="content" source="media/storage-metrics-latency-throughput.png" alt-text="Screenshot of a metric showing the latency and throughput on a Storage Appliance." lightbox="media/storage-metrics-latency-throughput.png":::
105+
106+
By comparing the resulting graphs, you can determine whether high load is the cause. If so, reduce the
107+
load to resolve the health event.
108+
109+
If the issue is _not_ high load, you should raise a ticket with your Storage Appliance vendor.
110+
111+
## Network interface errors
112+
113+
The health event in this list indicates that the appliance has network interface errors:
114+
115+
- `StorageApplianceNetworkErrorsDegraded`, which means the average rate of network interface errors
116+
on one or more interfaces has exceeded 3%.
117+
118+
To determine the unhealthy network interfaces, as well as the distribution of the errors, navigate
119+
to the Storage Appliance in the portal, navigate to the `Monitoring > Metrics` tab, and select
120+
`Nexus Storage Network Interface Performance Errors` in the `Metric` dropdown. Then, you should click
121+
`Apply splitting`, and select the `Dimension` and `Name` boxes, ensuring that you select a time range
122+
that starts shortly before the start time of the resource health alert. After identifying the
123+
unhealthy network interface(s), and error types, you should raise a ticket with your Storage Appliance
124+
vendor.
125+
126+
:::image type="content" source="media/storage-metrics-network-errors.png" alt-text="Screenshot of a metric showing network interface errors on a Storage Appliance." lightbox="media/storage-metrics-network-errors.png":::
127+
128+
## Network latency
129+
130+
The health events in this list indicate that the appliance has high networking latency:
131+
132+
- `StorageApplianceNetworkLatencyDegraded`, which means the latency between the initiator and the Storage
133+
Appliance exceeds 25 ms.
134+
- `StorageApplianceNetworkLatencyUnavailable`, which means the latency between the initiator and the Storage
135+
Appliance exceeds 100 ms.
136+
137+
This increased latency implies an underlying problem with the networking between the Bare Metal Machines
138+
(BMMs) and the Storage Appliance. Latency can be introduced on any of the hops between BMMs and Storage Appliance.
139+
You should raise a ticket with Microsoft, quoting the text of this troubleshooting article.

0 commit comments

Comments
 (0)