You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
# Troubleshooting an Unhealthy or Degraded Storage Appliance
12
12
13
-
This article provides troubleshooting advice and escalation methods for Storage Appliances which are
13
+
This article provides troubleshooting advice and escalation methods for Storage Appliances that are
14
14
unhealthy or degraded.
15
15
16
16
## Capacity Threshold Reached
17
17
18
-
This will have an "Availability Impacting Reason" of:
18
+
The health events listed in the text below indicate that the appliance is nearing capacity:
19
19
20
20
-`SACapacityThresholdDegraded`, which means the Storage Appliance is at 80% capacity or above.
21
21
-`SACapacityThresholdUnhealthy`, which means the Storage Appliance is at 90% capacity or above.
22
22
23
23
You can see the current usage of the appliance by navigating to the Storage Appliance in the portal,
24
-
navigating to the `Monitoring > Metrics` tab and selecting `Nexus Storage Array Space Utilization` from
24
+
navigating to the `Monitoring > Metrics` tab, and selecting `Nexus Storage Array Space Utilization` from
25
25
the `Metric` dropdown.
26
26
27
27
:::image type="content" source="media/storage-metrics-utilization.png" alt-text="Metric showing the percentage utilization of a Storage Appliance":::
28
28
29
-
These issues can be addressed by reducing the load on the Storage Appliance. This can be achieved by:
29
+
These issues can be addressed by reducing the load on the Storage Appliance. This outcome can be
30
+
achieved by:
30
31
31
-
- Moving some workloads to another cluster if one is available, and your workload supports this.
32
+
- Moving some workloads to another cluster if one is available, and this is a supported operation for
33
+
your workload:
32
34
- Re-create the workload on a different cluster (Operator Nexus).
33
-
- Perform steps required to migrate traffic to the new cluster (this will depend on your workload).
35
+
- Perform steps required to migrate traffic to the new cluster (the specific steps required will
36
+
depend on your workload).
34
37
- Delete the workload from the current cluster.
35
38
- Adding array expansions, if you have empty array expansion spaces in your aggregator rack. Speak to
36
-
your storage vendor for information on how to do this.
39
+
your storage vendor for instructions.
37
40
38
-
You can check back on the value of the utilization metric to confirm that it has returned below 80%.
41
+
You can confirm that utilization is reduced by checking the metric again.
39
42
40
-
Note than any volume deletions may take up to 24 hours to eradicate from the appliance, and that
43
+
Note that any volume deletions may take up to 24 hours to eradicate from the appliance, and that
41
44
any deletions should be carried out slowly to avoid worsening the problem.
42
45
43
46
## Active Alerts
44
47
45
-
This will have an "Availability Impacting Reason" of:
48
+
The health events listed in the text below indicate that the appliance has active alerts:
46
49
47
-
-`StorageApplianceActiveAlertsWarning`, which means there are 1 or more open warning alerts on the
48
-
Storage Appliance. This means there is an issue which needs resolving, but the Storage Appliance
49
-
should continue to function.
50
-
-`StorageApplianceActiveAlertsCritical`, which means there are 1 or more open critical alerts on the
51
-
Storage Appliance. This implies a severe problem with the Storage Appliance.
50
+
-`StorageApplianceActiveAlertsWarning`, which means there are one or more open warning alerts on the
51
+
Storage Appliance. Warning alerts indicate that there is an issue that requires attention, but the Storage
52
+
Appliance should continue to function.
53
+
-`StorageApplianceActiveAlertsCritical`, which means there are one or more open critical alerts on the
54
+
Storage Appliance. Critical alerts indicate a severe problem with the Storage Appliance.
52
55
53
-
You should find more details of the specific alert(s), and from that determine whether you need to take
54
-
an action yourself (such as re-seating a cable), raise a ticket with your storage vendor, or raise a
55
-
ticket with Microsoft.
56
+
You should find more details of the specific alert(s), using the following instructions:
56
57
57
58
- If you have your Storage Appliance set up to send logs to a Log Analytics Workspace (LAW), you can gather
58
-
more details by running the below query in your LAW.
59
+
more details by running the query from the text block below in your LAW.
59
60
```
60
61
StorageApplianceAlerts
61
62
| where TIMESTAMP > <start time>
62
63
```
63
-
This will give you more details of the alert, and may also provide a link to a specific troubleshooting
64
+
This log will give you more details of the alert, and may also provide a link to a specific troubleshooting
64
65
article from your storage vendor.
65
66
- If you do not have log streaming to a LAW set up, you can still get the details by navigating to the
66
-
Storage Appliance on the portal, navigating to the `Monitoring > Metrics` tab and selecting
67
+
Storage Appliance on the portal, navigating to the `Monitoring > Metrics` tab, and selecting
67
68
`Nexus Storage Alerts Open` from the `Metric` dropdown. Then, you should click `Apply splitting` and
68
-
select all of the boxes. You will then see a summary of the alert, as well as the vendor alert code. You
69
-
can use this information to search your vendor documentation for further details of the alert.
69
+
select all of the boxes. You will see a summary of the alert, and the vendor alert code. Use this information
70
+
to search your vendor documentation for further details of the alert.
70
71
71
72
:::image type="content" source="media/storage-metrics-alerts.png" alt-text="Metric showing an active alert on a Storage Appliance":::
72
73
73
-
Once you have this information, you should be able to tell if you can fix the issue yourself, or if
74
-
you need to raise a ticket with your Storage Appliance vendor or with us. If you need to raise a
75
-
ticket with us, please include the Storage Appliance name and "Availability Impacting Reason" for
76
-
quicker issue triage.
74
+
Once you have this information, use it to determine the appropriate next action. You should either:
75
+
76
+
- Take an action yourself (such as reseating a cable).
77
+
- Raise a ticket with your storage vendor.
78
+
- Raise a ticket with Microsoft. If you need to raise a ticket with us, please include the Storage Appliance
79
+
name and the details of the health event for quicker issue triage.
77
80
78
81
## Latency
79
82
80
-
This will have an "Availability Impacting Reason" of:
83
+
The health events listed in the text below indicate that the appliance has high latency:
81
84
82
85
-`StorageApplianceLatencyDegraded`, which means the self-reported latency of the Storage Appliance
83
-
has exceeded 3ms.
86
+
exceeds 3 ms.
84
87
-`StorageApplianceLatencyUnavailable`, which means the self-reported latency of the Storage Appliance
85
-
has exceeded 100ms.
88
+
exceeds 100 ms.
86
89
87
-
The expected latency for Pure X-series is 1ms or less.
90
+
The expected latency for Pure X-series is 1 ms or less.
88
91
89
-
Latency issues could be caused by an issue with the appliance, or high load. First, check if high load
92
+
The root cause of high latency could be an issue with the appliance, or high load. First, check if high load
90
93
is the cause:
91
94
92
95
- Navigate to the Storage Appliance on the portal.
93
96
- Navigate to the `Monitoring > Metrics` tab.
94
-
- Select the `Nexus Storage Array Latency` metric, and click `Apply splitting`, selecting`Dimension` as
97
+
- Select the `Nexus Storage Array Latency` metric. Click `Apply splitting`, and select`Dimension` as
95
98
the dimension to split on.
96
99
- Click `+ New Chart`, and select the `Nexus Storage Array Performance Throughput Iops (Avg)` metric.
97
100
Click `Apply Splitting`, and select `Dimension` as the dimension to split on
98
101
99
102
:::image type="content" source="media/storage-metrics-latency-throughput.png" alt-text="Metric showing the latency and throughput on a Storage Appliance":::
100
103
101
-
By comparing the resulting graphs, you can determine whether high load is the cause. If so, reducing the
102
-
load will resolve the health event.
104
+
By comparing the resulting graphs, you can determine whether high load is the cause. If so, reduce the
105
+
load to resolve the health event.
103
106
104
-
If you have ruled out high load, you should raise a ticket with your Storage Appliance vendor.
107
+
If the issue is _not_ high load, you should raise a ticket with your Storage Appliance vendor.
105
108
106
109
## Network Interface Errors
107
110
108
-
This will have an "Availability Impacting Reason" of:
111
+
The health event listed in the text below indicates that the appliance has network interface errors:
109
112
110
113
-`StorageApplianceNetworkErrorsDegraded`, which means the average rate of network interface errors
111
-
on one or more interfaces has exceeded 3%. This implies an issue with the network interface(s).
114
+
on one or more interfaces has exceeded 3%.
112
115
113
-
To determine the unhealthy network interface(s), as well as the distribution of the errors, navigate
116
+
To determine the unhealthy network interfaces, as well as the distribution of the errors, navigate
114
117
to the Storage Appliance in the portal, navigate to the `Monitoring > Metrics` tab select
115
118
`Nexus Storage Network Interface Performance Errors` in the `Metric` dropdown. Then, you should click
116
119
`Apply splitting`, and select the `Dimension` and `Name` boxes, ensuring that you select a time range
117
-
which starts shortly before the start time of the resource health alert. Once you have identified the
120
+
that starts shortly before the start time of the resource health alert. After identifying the
118
121
unhealthy network interface(s), and error types, you should raise a ticket with your Storage Appliance
119
122
vendor.
120
123
121
-
:::image type="content" source="media/storage-metrics-network-error.png" alt-text="Metric showing network interface errors on a Storage Appliance":::
124
+
:::image type="content" source="media/storage-metrics-network-errors.png" alt-text="Metric showing network interface errors on a Storage Appliance":::
122
125
123
126
## Network Latency
124
127
125
-
This will have an "Availability Impacting Reason" of:
128
+
The health events listed in the text below indicate that the appliance has high networking latency:
126
129
127
130
-`StorageApplianceNetworkLatencyDegraded`, which means the latency between the initiator and the Storage
128
-
Appliance has exceeded 25ms.
131
+
Appliance exceeds 25 ms.
129
132
-`StorageApplianceNetworkLatencyUnavailable`, which means the latency between the initiator and the Storage
130
-
Appliance has exceeded 100ms.
133
+
Appliance exceeds 100 ms.
131
134
132
135
This increased latency implies an underlying problem with the networking between the Bare Metal Machines
133
-
(BMMs) and the Storage Appliance. As this can result from any of the hops between BMMs and Storage Appliance,
134
-
you should raise a ticket with Microsoft, quoting the availability impacting reason and the text of this
135
-
troubleshooting guide (TSG).
136
+
(BMMs) and the Storage Appliance. Latency can be introduced on any of the hops between BMMs and Storage Appliance.
137
+
You should raise a ticket with Microsoft, quoting the text of this troubleshooting guide (TSG).
0 commit comments