You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
description: Learn how to troubleshoot accepted Cluster resources.
4
4
author: matternst7258
5
5
ms.author: matthewernst
6
6
ms.service: azure-operator-nexus
@@ -12,23 +12,24 @@ ms.date: 10/30/2024
12
12
13
13
# Troubleshoot accepted Cluster resources
14
14
15
-
Operator Nexus relies on mirroring, or hydrating, resources from the on-premises cluster to Azure. When this process is interrupted, the Cluster resource can move to `Accepted` state.
15
+
Operator Nexus relies on mirroring, or hydrating, resources from the on-premises cluster to Azure. When this process is interrupted, the Cluster resource can move to the `Accepted` state.
16
16
17
17
## Diagnosis
18
18
19
-
The Cluster status is viewed via the Azure portal or via Azure CLI.
19
+
The Cluster status is viewed via the Azure portal or the Azure CLI.
20
20
21
21
```bash
22
22
az networkcloud cluster show --resource-group <RESOURCE_GROUP> --name <CLUSTER_NAME>
23
23
```
24
24
25
25
## Mitigation steps
26
26
27
-
### Triggering the resource sync
27
+
Follow these steps for mitigation.
28
28
29
+
### Trigger the resource sync
29
30
30
31
1. From the Cluster resource page in the Azure portal, add a tag to the Cluster resource.
31
-
2. The resource moves out of the `Accepted` state.
32
+
1. The resource moves out of the `Accepted` state.
32
33
33
34
```bash
34
35
az login
@@ -38,17 +39,16 @@ az resource tag --tags exampleTag=exampleValue --name <CLUSTER> --resource-group
38
39
39
40
## Verification
40
41
41
-
After the tag is applied, the Cluster moves to `Running` state.
42
+
After the tag is applied, the Cluster moves to the `Running` state.
42
43
43
44
```bash
44
45
az networkcloud cluster show --resource-group <RESOURCE_GROUP> --name <CLUSTER_NAME>
45
46
```
46
47
47
-
If the Cluster resource maintains the state after a period of time, more than 5 minutes, contact Microsoft support.
48
+
If the Cluster resource maintains the state after more than five minutes, contact Microsoft support.
48
49
49
-
## Further information
50
+
## Related content
50
51
51
-
Learn more about how resources are hydrated with [Azure Arc-enabled Kubernetes](/azure/azure-arc/kubernetes/overview).
52
-
53
-
If you still have questions, [contact support](https://portal.azure.com/?#blade/Microsoft_Azure_Support/HelpAndSupportBlade).
54
-
For more information about Support plans, see [Azure Support plans](https://azure.microsoft.com/support/plans/response/).
52
+
- For more information about how resources are hydrated, see [Azure Arc-enabled Kubernetes](/azure/azure-arc/kubernetes/overview).
53
+
- If you still have questions, [contact Azure support](https://portal.azure.com/?#blade/Microsoft_Azure_Support/HelpAndSupportBlade).
54
+
- For more information about support plans, see [Azure support plans](https://azure.microsoft.com/support/plans/response/).
3. Access the iDRAC GUIusing the IP in your browser to shut down impacted management servers
61
+
1. Access the integrated iDRAC graphical user interface (GUI) by using the IP in your browser to shut down affected management servers.
64
62
65
-
:::image type="content" source="media\troubleshoot-control-plane-quorum\graceful-shutdown.png" alt-text="Screenshot of an iDRAC GUI and the button to perform a graceful shutdown." lightbox="media\troubleshoot-control-plane-quorum\graceful-shutdown.png":::
63
+
:::image type="content" source="media\troubleshoot-control-plane-quorum\graceful-shutdown.png" alt-text="Screenshot that shows an iDRAC GUI and the button to perform a graceful shutdown." lightbox="media\troubleshoot-control-plane-quorum\graceful-shutdown.png":::
66
64
67
-
4. When all impacted management servers are down, turn on the servers using the iDRAC GUI
65
+
1. When all affected management servers are down, turn on the servers by using the iDRAC GUI.
68
66
69
-
:::image type="content" source="media\troubleshoot-control-plane-quorum\graceful-power-on.png" alt-text="Screenshot of an iDRAC GUI and the button to perform power on command." lightbox="media\troubleshoot-control-plane-quorum\graceful-power-on.png":::
67
+
:::image type="content" source="media\troubleshoot-control-plane-quorum\graceful-power-on.png" alt-text="Screenshot that shows an iDRAC GUI and the button to perform the power command." lightbox="media\troubleshoot-control-plane-quorum\graceful-power-on.png":::
70
68
71
-
5. The servers should now be restored.
69
+
The servers should now be restored.
72
70
73
-
If you still have questions, [contact support](https://portal.azure.com/?#blade/Microsoft_Azure_Support/HelpAndSupportBlade).
74
-
For more information about Support plans, see [Azure Support plans](https://azure.microsoft.com/support/plans/response/).
71
+
If you still have questions, [contact Azure support](https://portal.azure.com/?#blade/Microsoft_Azure_Support/HelpAndSupportBlade).
72
+
For more information about support plans, see [Azure support plans](https://azure.microsoft.com/support/plans/response/).
Copy file name to clipboardExpand all lines: articles/operator-nexus/troubleshoot-hardware-validation-failure.md
+5-5Lines changed: 5 additions & 5 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -74,7 +74,7 @@ This section discusses troubleshooting for problems you might encounter.
74
74
* To troubleshoot a memory problem, contact the vendor.
75
75
76
76
* CPU-related failure (`cpu_sockets`)
77
-
* CPU specs are defined in the SKU. A failed `cpu_sockets` check indicates a failed CPU or CPU count mismatch. The following example shows a failed CPU check.
77
+
* CPU specs are defined in the version. A failed `cpu_sockets` check indicates a failed CPU or CPU count mismatch. The following example shows a failed CPU check.
78
78
79
79
```yaml
80
80
{
@@ -521,11 +521,11 @@ This section discusses troubleshooting for problems you might encounter.
@@ -696,5 +696,5 @@ This section discusses troubleshooting for problems you might encounter.
696
696
697
697
After the hardware is fixed, run the BMM `replace` action by following the instructions in [Manage the lifecycle of bare metal machines](howto-baremetal-functions.md).
698
698
699
-
If you still have questions, [contact Support](https://portal.azure.com/?#blade/Microsoft_Azure_Support/HelpAndSupportBlade).
700
-
For more information about support plans, see [Azure Support plans](https://azure.microsoft.com/support/plans/response/).
699
+
If you still have questions, [contact Azure support](https://portal.azure.com/?#blade/Microsoft_Azure_Support/HelpAndSupportBlade).
700
+
For more information about support plans, see [Azure support plans](https://azure.microsoft.com/support/plans/response/).
description: Checking LACP Bonding on Physical Hosts.
3
+
description: Learn how to check LACP bonding on physical hosts.
4
4
author: keithritchie73
5
5
ms.author: keithritchie
6
6
ms.service: azure-operator-nexus
@@ -9,44 +9,44 @@ ms.topic: troubleshooting
9
9
ms.date: 11/15/2024
10
10
---
11
11
12
-
# Checking LACP Bonding on Physical Hosts
12
+
# Check LACP bonding on physical hosts
13
13
14
-
On physical host startup, the two Mellanox cards are LACP bonded to a pair of Arista switches. If LACP isn't properly negotiated between the server's cards and the switches, it can cause strange packet loss or loadbalancing behavior. These errors might not be noticeable until a tenant workload attempts to pass traffic and is due to the hashing/loadbalancing nature of LACP.
14
+
On physical host startup, the two Mellanox cards are bonded to a pair of Arista switches by the Link Aggregation Control Protocol (LACP). If LACP isn't properly negotiated between the server's cards and the switches, it can cause strange packet loss or load-balancing behavior. These errors might not be noticeable until a tenant workload attempts to pass traffic. They occur because of the hashing/load-balancing nature of LACP.
15
15
16
16
## Diagnosis
17
17
18
-
If, LACP isn't negotiated correctly traffic loss can occur. But traffic can pass for some flows too. This behavior can manifest itself as a vm that can't get on the network, or even oam/storage outages.
18
+
If LACP isn't negotiated correctly, traffic loss can occur. But traffic can pass for some flows too. This behavior can manifest itself as a virtual machine that can't get on the network, or even as oam/storage outages.
19
19
20
-
## Checking LACP Bonding
20
+
## Check LACP bonding
21
21
22
-
To check the LACP bonding status on a physical host run the following command. For control plane hosts, use file 8a_pf_bond as there's only one Mellanox card on those hosts. For worker hosts, use either 4b_pf_bond or 98_pf_bond to check its two cards.
22
+
To check the LACP bonding status on a physical host, run the following command. For control plane hosts, use file `8a_pf_bond` because there's only one Mellanox card on those hosts. For worker hosts, use either `4b_pf_bond` or `98_pf_bond` to check two cards.
23
23
24
24
```bash
25
25
# cat /proc/net/bonding/8a_pf_bond
26
26
```
27
27
28
-
### Interpreting the results
28
+
### Interpret the results
29
29
30
-
Key validations to check in the /proc/net/bonding/ output are:
30
+
Key validations to check in the `/proc/net/bonding/` output are:
31
31
32
-
For Bond level (the top part):
32
+
For the bond level (the top part):
33
33
34
-
1.MII Status: up - Is the entire bond up
35
-
2.LACP active: on - Is LACP active
36
-
3.Aggregator ID: 1 - The toplevel aggregator ID should match both replicas. See each port for its aggregator ID.
37
-
4.System MAC address: 42:56:86:9c:81:89 - Is there a System MAC defined. If a bond isn't negotiated this will be undefined or all zeros, e.g 00:00:00:00:00:00
34
+
-**MII status**: Up. Is the entire bond up?
35
+
-**LACP active**: On. Is LACP active?
36
+
-**Aggregator ID**: 1. The top-level aggregator ID should match both replicas. See each port for its aggregator ID.
37
+
-**System MAC address**: 42:56:86:9c:81:89. Is there a System MAC defined? If a bond isn't negotiated, it's undefined or all zeros, for example, 00:00:00:00:00:00.
38
38
39
39
For each port:
40
40
41
-
1.MII Status: up - Is the interface up
42
-
2.Aggregator ID: 1 - Both replicas should have the same aggregator ID
43
-
3. details partner lacp pdu: port state 61 - The value is a bit mask that represents the LACP negotiation state on that port. Generally 61 and 63 are what we want. [See](https://movingpackets.net/2017/10/17/decoding-lacp-port-state)
41
+
-**MII status**: Up. Is the interface up?
42
+
-**Aggregator ID**: 1. Both replicas should have the same aggregator ID.
43
+
-**Details partner LACP protocol data unit (PDU)**: Port state 61. The value is a bit mask that represents the LACP negotiation state on that port. Generally, 61 and 63 are what we want. For more information, see [Decoding LACP Port State](https://movingpackets.net/2017/10/17/decoding-lacp-port-state).
44
44
45
-
### Fixing the issue
45
+
### Fix the issue
46
46
47
-
The most common causes for these LACP issues are host/switch miswiring or mismatched LACP/MLAG configuration on the Arista switches. Investigate the situation by tracing out and repairing any wiring issues. If the wiring is correct, then determine if the switch LACP/MLAG configuration is incorrect.
47
+
The most common causes for these LACP issues are host or switch miswiring or mismatched LACP/Multi-Chassis Link Aggregation (MLAG) configuration on the Arista switches. Investigate the situation by tracing out and repairing any wiring issues. If the wiring is correct, determine if the switch LACP/MLAG configuration is incorrect.
48
48
49
-
## Further information
49
+
## Related content
50
50
51
-
If you still have questions, [contact support](https://portal.azure.com/?#blade/Microsoft_Azure_Support/HelpAndSupportBlade).
52
-
For more information about Support plans, see [Azure Support plans](https://azure.microsoft.com/support/plans/response/).
51
+
-If you still have questions, [contact Azure support](https://portal.azure.com/?#blade/Microsoft_Azure_Support/HelpAndSupportBlade).
52
+
-For more information about support plans, see [Azure support plans](https://azure.microsoft.com/support/plans/response/).
description: Learn how to troubleshoot Kubernetes container limits.
4
4
ms.service: azure-operator-nexus
5
5
ms.custom: troubleshooting
6
6
ms.topic: troubleshooting
@@ -11,23 +11,25 @@ author: matternst7258
11
11
12
12
# Troubleshoot container memory limits
13
13
14
-
## Alerting for memory limits
14
+
Learn about troubleshooting for container memory limits in this article.
15
15
16
-
It's recommended to have alerts set up for the Operator Nexus cluster to look for Kubernetes pods restarting from OOMKill errors. These alerts allow customers to know if a component on a server is working appropriately.
16
+
## Alerts for memory limits
17
17
18
-
Metrics exposed to identify memory limits:
18
+
We recommend that you have alerts set up for the Operator Nexus cluster to look for Kubernetes pods that restart from `OOMKill` errors. These alerts let you know if a component on a server is working appropriately.
19
19
20
-
| Metric Name | Description |
20
+
The following table lists the metrics that are exposed to identify memory limits.
0 commit comments