Skip to content

Commit 73ee0f8

Browse files
committed
Make changes for acrolinx
1 parent 7c60d72 commit 73ee0f8

File tree

1 file changed

+50
-48
lines changed

1 file changed

+50
-48
lines changed
Lines changed: 50 additions & 48 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
---
22
title: Troubleshooting an Unhealthy or Degraded Storage Appliance
3-
description: Troubleshooting a Storage Appliance which has Azure Resource Health alerts
3+
description: Troubleshooting a Storage Appliance that has Azure Resource Health alerts
44
author: jensheasby
55
ms.author: jensheasby
66
ms.date: 06/10/2025
@@ -10,126 +10,128 @@ ms.service: azure-operator-nexus
1010

1111
# Troubleshooting an Unhealthy or Degraded Storage Appliance
1212

13-
This article provides troubleshooting advice and escalation methods for Storage Appliances which are
13+
This article provides troubleshooting advice and escalation methods for Storage Appliances that are
1414
unhealthy or degraded.
1515

1616
## Capacity Threshold Reached
1717

18-
This will have an "Availability Impacting Reason" of:
18+
The health events listed in the text below indicate that the appliance is nearing capacity:
1919

2020
- `SACapacityThresholdDegraded`, which means the Storage Appliance is at 80% capacity or above.
2121
- `SACapacityThresholdUnhealthy`, which means the Storage Appliance is at 90% capacity or above.
2222

2323
You can see the current usage of the appliance by navigating to the Storage Appliance in the portal,
24-
navigating to the `Monitoring > Metrics` tab and selecting `Nexus Storage Array Space Utilization` from
24+
navigating to the `Monitoring > Metrics` tab, and selecting `Nexus Storage Array Space Utilization` from
2525
the `Metric` dropdown.
2626

2727
:::image type="content" source="media/storage-metrics-utilization.png" alt-text="Metric showing the percentage utilization of a Storage Appliance":::
2828

29-
These issues can be addressed by reducing the load on the Storage Appliance. This can be achieved by:
29+
These issues can be addressed by reducing the load on the Storage Appliance. This outcome can be
30+
achieved by:
3031

31-
- Moving some workloads to another cluster if one is available, and your workload supports this.
32+
- Moving some workloads to another cluster if one is available, and this is a supported operation for
33+
your workload:
3234
- Re-create the workload on a different cluster (Operator Nexus).
33-
- Perform steps required to migrate traffic to the new cluster (this will depend on your workload).
35+
- Perform steps required to migrate traffic to the new cluster (the specific steps required will
36+
depend on your workload).
3437
- Delete the workload from the current cluster.
3538
- Adding array expansions, if you have empty array expansion spaces in your aggregator rack. Speak to
36-
your storage vendor for information on how to do this.
39+
your storage vendor for instructions.
3740

38-
You can check back on the value of the utilization metric to confirm that it has returned below 80%.
41+
You can confirm that utilization is reduced by checking the metric again.
3942

40-
Note than any volume deletions may take up to 24 hours to eradicate from the appliance, and that
43+
Note that any volume deletions may take up to 24 hours to eradicate from the appliance, and that
4144
any deletions should be carried out slowly to avoid worsening the problem.
4245

4346
## Active Alerts
4447

45-
This will have an "Availability Impacting Reason" of:
48+
The health events listed in the text below indicate that the appliance has active alerts:
4649

47-
- `StorageApplianceActiveAlertsWarning`, which means there are 1 or more open warning alerts on the
48-
Storage Appliance. This means there is an issue which needs resolving, but the Storage Appliance
49-
should continue to function.
50-
- `StorageApplianceActiveAlertsCritical`, which means there are 1 or more open critical alerts on the
51-
Storage Appliance. This implies a severe problem with the Storage Appliance.
50+
- `StorageApplianceActiveAlertsWarning`, which means there are one or more open warning alerts on the
51+
Storage Appliance. Warning alerts indicate that there is an issue that requires attention, but the Storage
52+
Appliance should continue to function.
53+
- `StorageApplianceActiveAlertsCritical`, which means there are one or more open critical alerts on the
54+
Storage Appliance. Critical alerts indicate a severe problem with the Storage Appliance.
5255

53-
You should find more details of the specific alert(s), and from that determine whether you need to take
54-
an action yourself (such as re-seating a cable), raise a ticket with your storage vendor, or raise a
55-
ticket with Microsoft.
56+
You should find more details of the specific alert(s), using the following instructions:
5657

5758
- If you have your Storage Appliance set up to send logs to a Log Analytics Workspace (LAW), you can gather
58-
more details by running the below query in your LAW.
59+
more details by running the query from the text block below in your LAW.
5960
```
6061
StorageApplianceAlerts
6162
| where TIMESTAMP > <start time>
6263
```
63-
This will give you more details of the alert, and may also provide a link to a specific troubleshooting
64+
This log will give you more details of the alert, and may also provide a link to a specific troubleshooting
6465
article from your storage vendor.
6566
- If you do not have log streaming to a LAW set up, you can still get the details by navigating to the
66-
Storage Appliance on the portal, navigating to the `Monitoring > Metrics` tab and selecting
67+
Storage Appliance on the portal, navigating to the `Monitoring > Metrics` tab, and selecting
6768
`Nexus Storage Alerts Open` from the `Metric` dropdown. Then, you should click `Apply splitting` and
68-
select all of the boxes. You will then see a summary of the alert, as well as the vendor alert code. You
69-
can use this information to search your vendor documentation for further details of the alert.
69+
select all of the boxes. You will see a summary of the alert, and the vendor alert code. Use this information
70+
to search your vendor documentation for further details of the alert.
7071

7172
:::image type="content" source="media/storage-metrics-alerts.png" alt-text="Metric showing an active alert on a Storage Appliance":::
7273

73-
Once you have this information, you should be able to tell if you can fix the issue yourself, or if
74-
you need to raise a ticket with your Storage Appliance vendor or with us. If you need to raise a
75-
ticket with us, please include the Storage Appliance name and "Availability Impacting Reason" for
76-
quicker issue triage.
74+
Once you have this information, use it to determine the appropriate next action. You should either:
75+
76+
- Take an action yourself (such as reseating a cable).
77+
- Raise a ticket with your storage vendor.
78+
- Raise a ticket with Microsoft. If you need to raise a ticket with us, please include the Storage Appliance
79+
name and the details of the health event for quicker issue triage.
7780

7881
## Latency
7982

80-
This will have an "Availability Impacting Reason" of:
83+
The health events listed in the text below indicate that the appliance has high latency:
8184

8285
- `StorageApplianceLatencyDegraded`, which means the self-reported latency of the Storage Appliance
83-
has exceeded 3ms.
86+
exceeds 3 ms.
8487
- `StorageApplianceLatencyUnavailable`, which means the self-reported latency of the Storage Appliance
85-
has exceeded 100ms.
88+
exceeds 100 ms.
8689

87-
The expected latency for Pure X-series is 1ms or less.
90+
The expected latency for Pure X-series is 1 ms or less.
8891

89-
Latency issues could be caused by an issue with the appliance, or high load. First, check if high load
92+
The root cause of high latency could be an issue with the appliance, or high load. First, check if high load
9093
is the cause:
9194

9295
- Navigate to the Storage Appliance on the portal.
9396
- Navigate to the `Monitoring > Metrics` tab.
94-
- Select the `Nexus Storage Array Latency` metric, and click `Apply splitting`, selecting `Dimension` as
97+
- Select the `Nexus Storage Array Latency` metric. Click `Apply splitting`, and select `Dimension` as
9598
the dimension to split on.
9699
- Click `+ New Chart`, and select the `Nexus Storage Array Performance Throughput Iops (Avg)` metric.
97100
Click `Apply Splitting`, and select `Dimension` as the dimension to split on
98101

99102
:::image type="content" source="media/storage-metrics-latency-throughput.png" alt-text="Metric showing the latency and throughput on a Storage Appliance":::
100103

101-
By comparing the resulting graphs, you can determine whether high load is the cause. If so, reducing the
102-
load will resolve the health event.
104+
By comparing the resulting graphs, you can determine whether high load is the cause. If so, reduce the
105+
load to resolve the health event.
103106

104-
If you have ruled out high load, you should raise a ticket with your Storage Appliance vendor.
107+
If the issue is _not_ high load, you should raise a ticket with your Storage Appliance vendor.
105108

106109
## Network Interface Errors
107110

108-
This will have an "Availability Impacting Reason" of:
111+
The health event listed in the text below indicates that the appliance has network interface errors:
109112

110113
- `StorageApplianceNetworkErrorsDegraded`, which means the average rate of network interface errors
111-
on one or more interfaces has exceeded 3%. This implies an issue with the network interface(s).
114+
on one or more interfaces has exceeded 3%.
112115

113-
To determine the unhealthy network interface(s), as well as the distribution of the errors, navigate
116+
To determine the unhealthy network interfaces, as well as the distribution of the errors, navigate
114117
to the Storage Appliance in the portal, navigate to the `Monitoring > Metrics` tab select
115118
`Nexus Storage Network Interface Performance Errors` in the `Metric` dropdown. Then, you should click
116119
`Apply splitting`, and select the `Dimension` and `Name` boxes, ensuring that you select a time range
117-
which starts shortly before the start time of the resource health alert. Once you have identified the
120+
that starts shortly before the start time of the resource health alert. After identifying the
118121
unhealthy network interface(s), and error types, you should raise a ticket with your Storage Appliance
119122
vendor.
120123

121-
:::image type="content" source="media/storage-metrics-network-error.png" alt-text="Metric showing network interface errors on a Storage Appliance":::
124+
:::image type="content" source="media/storage-metrics-network-errors.png" alt-text="Metric showing network interface errors on a Storage Appliance":::
122125

123126
## Network Latency
124127

125-
This will have an "Availability Impacting Reason" of:
128+
The health events listed in the text below indicate that the appliance has high networking latency:
126129

127130
- `StorageApplianceNetworkLatencyDegraded`, which means the latency between the initiator and the Storage
128-
Appliance has exceeded 25ms.
131+
Appliance exceeds 25 ms.
129132
- `StorageApplianceNetworkLatencyUnavailable`, which means the latency between the initiator and the Storage
130-
Appliance has exceeded 100ms.
133+
Appliance exceeds 100 ms.
131134

132135
This increased latency implies an underlying problem with the networking between the Bare Metal Machines
133-
(BMMs) and the Storage Appliance. As this can result from any of the hops between BMMs and Storage Appliance,
134-
you should raise a ticket with Microsoft, quoting the availability impacting reason and the text of this
135-
troubleshooting guide (TSG).
136+
(BMMs) and the Storage Appliance. Latency can be introduced on any of the hops between BMMs and Storage Appliance.
137+
You should raise a ticket with Microsoft, quoting the text of this troubleshooting guide (TSG).

0 commit comments

Comments
 (0)