You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/operator-nexus/troubleshoot-unhealthy-degraded-storage-appliance.md
+28-12Lines changed: 28 additions & 12 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -24,10 +24,16 @@ You can see the current usage of the appliance by navigating to the Storage Appl
24
24
navigating to the `Monitoring > Metrics` tab and selecting `Nexus Storage Array Space Utilization` from
25
25
the `Metric` dropdown.
26
26
27
+
:::image type="content" source="media/storage-metrics-utilization.png" alt-text="Metric showing the percentage utilization of a Storage Appliance":::
28
+
27
29
These issues can be addressed by reducing the load on the Storage Appliance. This can be achieved by:
28
30
29
-
- Moving some workloads to another cluster if one is available.
30
-
- Activating array expansions, if those are available and unused.
31
+
- Moving some workloads to another cluster if one is available, and your workload supports this.
32
+
- Re-create the workload on a different cluster (Operator Nexus).
33
+
- Perform steps required to migrate traffic to the new cluster (this will depend on your workload).
34
+
- Delete the workload from the current cluster.
35
+
- Adding array expansions, if you have empty array expansion spaces in your aggregator rack. Speak to
36
+
your storage vendor for information on how to do this.
31
37
32
38
You can check back on the value of the utilization metric to confirm that it has returned below 80%.
33
39
@@ -62,6 +68,8 @@ ticket with Microsoft.
62
68
select all of the boxes. You will then see a summary of the alert, as well as the vendor alert code. You
63
69
can use this information to search your vendor documentation for further details of the alert.
64
70
71
+
:::image type="content" source="media/storage-metrics-alerts.png" alt-text="Metric showing an active alert on a Storage Appliance":::
72
+
65
73
Once you have this information, you should be able to tell if you can fix the issue yourself, or if
66
74
you need to raise a ticket with your Storage Appliance vendor or with us. If you need to raise a
67
75
ticket with us, please include the Storage Appliance name and "Availability Impacting Reason" for
@@ -72,19 +80,25 @@ quicker issue triage.
72
80
This will have an "Availability Impacting Reason" of:
73
81
74
82
-`StorageApplianceLatencyDegraded`, which means the self-reported latency of the Storage Appliance
75
-
has exceeded 1.2ms.
83
+
has exceeded 3ms.
76
84
-`StorageApplianceLatencyUnavailable`, which means the self-reported latency of the Storage Appliance
77
85
has exceeded 100ms.
78
86
79
-
<!-- TODO: needs an update after the new threshold is set (and the new threshold may need to depend on type) -->
87
+
The expected latency for Pure X-series is 1ms or less.
88
+
89
+
Latency issues could be caused by an issue with the appliance, or high load. First, check if high load
90
+
is the cause:
80
91
81
-
The expected latency is 1ms or less.
92
+
- Navigate to the Storage Appliance on the portal.
93
+
- Navigate to the `Monitoring > Metrics` tab.
94
+
- Select the `Nexus Storage Array Latency` metric, and click `Apply splitting`, selecting `Dimension` as
95
+
the dimension to split on.
96
+
- Click `+ New Chart`, and select the `Nexus Storage Array Performance Throughput Iops (Avg)` metric.
97
+
Click `Apply Splitting`, and select `Dimension` as the dimension to split on
82
98
83
-
Latency issues could be caused by an issue with the appliance, or high load. First, check for high
84
-
load by navigating to the Storage Appliance on the portal, navigating to the `Monitoring > Metrics` tab
85
-
and viewing the `Nexus Storage Array Performance Throughput Iops (Avg)` metric, and the
86
-
`Nexus Storage Array Latency` metric on the same chart, starting from shortly before the health event
87
-
appeared. You should be able to see from this chart whether high load is the cause. If so, reducing the
99
+
:::image type="content" source="media/storage-metrics-latency-throughput.png" alt-text="Metric showing the latency and throughput on a Storage Appliance":::
100
+
101
+
By comparing the resulting graphs, you can determine whether high load is the cause. If so, reducing the
88
102
load will resolve the health event.
89
103
90
104
If you have ruled out high load, you should raise a ticket with your Storage Appliance vendor.
@@ -99,11 +113,13 @@ This will have an "Availability Impacting Reason" of:
99
113
To determine the unhealthy network interface(s), as well as the distribution of the errors, navigate
100
114
to the Storage Appliance in the portal, navigate to the `Monitoring > Metrics` tab select
101
115
`Nexus Storage Network Interface Performance Errors` in the `Metric` dropdown. Then, you should click
102
-
`Apply splitting`, and select the `Dimension` and `Name` boxes, ensuring that you select a timerange
116
+
`Apply splitting`, and select the `Dimension` and `Name` boxes, ensuring that you select a time range
103
117
which starts shortly before the start time of the resource health alert. Once you have identified the
104
118
unhealthy network interface(s), and error types, you should raise a ticket with your Storage Appliance
105
119
vendor.
106
120
121
+
:::image type="content" source="media/storage-metrics-network-error.png" alt-text="Metric showing network interface errors on a Storage Appliance":::
122
+
107
123
## Network Latency
108
124
109
125
This will have an "Availability Impacting Reason" of:
@@ -116,4 +132,4 @@ This will have an "Availability Impacting Reason" of:
116
132
This increased latency implies an underlying problem with the networking between the Bare Metal Machines
117
133
(BMMs) and the Storage Appliance. As this can result from any of the hops between BMMs and Storage Appliance,
118
134
you should raise a ticket with Microsoft, quoting the availability impacting reason and the text of this
0 commit comments