Skip to content

Commit 804f96a

Browse files
committed
Mark-ups - CSS feedback
1 parent eab1696 commit 804f96a

6 files changed

+49
-34
lines changed

articles/operator-nexus/TOC.yml

Lines changed: 21 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -362,22 +362,6 @@
362362
- name: Troubleshooting
363363
expanded: true
364364
items:
365-
- name: Resource Health
366-
expanded: false
367-
items:
368-
- name: Troubleshoot Unhealthy or Degraded Storage Appliance
369-
href: troubleshoot-unhealthy-degraded-storage-appliance.md
370-
- name: Network Fabric
371-
expanded: false
372-
items:
373-
- name: Troubleshoot Isolation Domain
374-
href: troubleshoot-isolation-domain.md
375-
- name: Troubleshoot LACP Bonding
376-
href: troubleshoot-lacp-bonding.md
377-
- name: Troubleshoot DNS Issues
378-
href: troubleshoot-dns-issues.md
379-
- name: Troubleshoot TWAMP (UDP) not working
380-
href: troubleshoot-twamp-udp-not-working.md
381365
- name: Cluster or BMM
382366
expanded: false
383367
items:
@@ -397,6 +381,27 @@
397381
href: troubleshoot-accepted-cluster-hydration.md
398382
- name: Troubleshoot Out of Memory Pods
399383
href: troubleshoot-memory-limits.md
384+
- name: Network Fabric
385+
expanded: false
386+
items:
387+
- name: Troubleshoot Isolation Domain
388+
href: troubleshoot-isolation-domain.md
389+
- name: Troubleshoot LACP Bonding
390+
href: troubleshoot-lacp-bonding.md
391+
- name: Troubleshoot DNS Issues
392+
href: troubleshoot-dns-issues.md
393+
- name: Troubleshoot TWAMP (UDP) not working
394+
href: troubleshoot-twamp-udp-not-working.md
395+
- name: Resource Health
396+
expanded: false
397+
items:
398+
- name: Troubleshoot Unhealthy or Degraded Storage Appliance
399+
href: troubleshoot-unhealthy-degraded-storage-appliance.md
400+
- name: Storage Appliance
401+
expanded: false
402+
items:
403+
- name: Troubleshoot Multiple Storage appliances
404+
href: troubleshoot-multiple-storage-appliances.md
400405
- name: Tenant Workload
401406
expanded: false
402407
items:
@@ -428,12 +433,6 @@
428433
items:
429434
- name: Due To Bare Metal Machine Power Failure
430435
href: troubleshoot-kubernetes-cluster-stuck-workloads-due-to-power-failure.md
431-
- name: Storage Appliance
432-
expanded: false
433-
items:
434-
- name: Troubleshoot Multiple Storage appliances
435-
href: troubleshoot-multiple-storage-appliances.md
436-
437436
- name: FAQ
438437
href: azure-operator-nexus-faq.md
439438
- name: Reference
60.4 KB
Loading
215 KB
Loading
118 KB
Loading
68.3 KB
Loading

articles/operator-nexus/troubleshoot-unhealthy-degraded-storage-appliance.md

Lines changed: 28 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -24,10 +24,16 @@ You can see the current usage of the appliance by navigating to the Storage Appl
2424
navigating to the `Monitoring > Metrics` tab and selecting `Nexus Storage Array Space Utilization` from
2525
the `Metric` dropdown.
2626

27+
:::image type="content" source="media/storage-metrics-utilization.png" alt-text="Metric showing the percentage utilization of a Storage Appliance":::
28+
2729
These issues can be addressed by reducing the load on the Storage Appliance. This can be achieved by:
2830

29-
- Moving some workloads to another cluster if one is available.
30-
- Activating array expansions, if those are available and unused.
31+
- Moving some workloads to another cluster if one is available, and your workload supports this.
32+
- Re-create the workload on a different cluster (Operator Nexus).
33+
- Perform steps required to migrate traffic to the new cluster (this will depend on your workload).
34+
- Delete the workload from the current cluster.
35+
- Adding array expansions, if you have empty array expansion spaces in your aggregator rack. Speak to
36+
your storage vendor for information on how to do this.
3137

3238
You can check back on the value of the utilization metric to confirm that it has returned below 80%.
3339

@@ -62,6 +68,8 @@ ticket with Microsoft.
6268
select all of the boxes. You will then see a summary of the alert, as well as the vendor alert code. You
6369
can use this information to search your vendor documentation for further details of the alert.
6470

71+
:::image type="content" source="media/storage-metrics-alerts.png" alt-text="Metric showing an active alert on a Storage Appliance":::
72+
6573
Once you have this information, you should be able to tell if you can fix the issue yourself, or if
6674
you need to raise a ticket with your Storage Appliance vendor or with us. If you need to raise a
6775
ticket with us, please include the Storage Appliance name and "Availability Impacting Reason" for
@@ -72,19 +80,25 @@ quicker issue triage.
7280
This will have an "Availability Impacting Reason" of:
7381

7482
- `StorageApplianceLatencyDegraded`, which means the self-reported latency of the Storage Appliance
75-
has exceeded 1.2ms.
83+
has exceeded 3ms.
7684
- `StorageApplianceLatencyUnavailable`, which means the self-reported latency of the Storage Appliance
7785
has exceeded 100ms.
7886

79-
<!-- TODO: needs an update after the new threshold is set (and the new threshold may need to depend on type) -->
87+
The expected latency for Pure X-series is 1ms or less.
88+
89+
Latency issues could be caused by an issue with the appliance, or high load. First, check if high load
90+
is the cause:
8091

81-
The expected latency is 1ms or less.
92+
- Navigate to the Storage Appliance on the portal.
93+
- Navigate to the `Monitoring > Metrics` tab.
94+
- Select the `Nexus Storage Array Latency` metric, and click `Apply splitting`, selecting `Dimension` as
95+
the dimension to split on.
96+
- Click `+ New Chart`, and select the `Nexus Storage Array Performance Throughput Iops (Avg)` metric.
97+
Click `Apply Splitting`, and select `Dimension` as the dimension to split on
8298

83-
Latency issues could be caused by an issue with the appliance, or high load. First, check for high
84-
load by navigating to the Storage Appliance on the portal, navigating to the `Monitoring > Metrics` tab
85-
and viewing the `Nexus Storage Array Performance Throughput Iops (Avg)` metric, and the
86-
`Nexus Storage Array Latency` metric on the same chart, starting from shortly before the health event
87-
appeared. You should be able to see from this chart whether high load is the cause. If so, reducing the
99+
:::image type="content" source="media/storage-metrics-latency-throughput.png" alt-text="Metric showing the latency and throughput on a Storage Appliance":::
100+
101+
By comparing the resulting graphs, you can determine whether high load is the cause. If so, reducing the
88102
load will resolve the health event.
89103

90104
If you have ruled out high load, you should raise a ticket with your Storage Appliance vendor.
@@ -99,11 +113,13 @@ This will have an "Availability Impacting Reason" of:
99113
To determine the unhealthy network interface(s), as well as the distribution of the errors, navigate
100114
to the Storage Appliance in the portal, navigate to the `Monitoring > Metrics` tab select
101115
`Nexus Storage Network Interface Performance Errors` in the `Metric` dropdown. Then, you should click
102-
`Apply splitting`, and select the `Dimension` and `Name` boxes, ensuring that you select a timerange
116+
`Apply splitting`, and select the `Dimension` and `Name` boxes, ensuring that you select a time range
103117
which starts shortly before the start time of the resource health alert. Once you have identified the
104118
unhealthy network interface(s), and error types, you should raise a ticket with your Storage Appliance
105119
vendor.
106120

121+
:::image type="content" source="media/storage-metrics-network-error.png" alt-text="Metric showing network interface errors on a Storage Appliance":::
122+
107123
## Network Latency
108124

109125
This will have an "Availability Impacting Reason" of:
@@ -116,4 +132,4 @@ This will have an "Availability Impacting Reason" of:
116132
This increased latency implies an underlying problem with the networking between the Bare Metal Machines
117133
(BMMs) and the Storage Appliance. As this can result from any of the hops between BMMs and Storage Appliance,
118134
you should raise a ticket with Microsoft, quoting the availability impacting reason and the text of this
119-
TSG.
135+
troubleshooting guide (TSG).

0 commit comments

Comments
 (0)