Skip to content

Commit a588ef6

Browse files
committed
edit pass: azure-operator-nexus-cluster-and-bmm
1 parent 3e5de75 commit a588ef6

5 files changed

+93
-93
lines changed
Lines changed: 12 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
---
22
title: "Azure Operator Nexus: Accepted Cluster"
3-
description: Troubleshoot accepted Cluster resource.
3+
description: Learn how to troubleshoot accepted Cluster resources.
44
author: matternst7258
55
ms.author: matthewernst
66
ms.service: azure-operator-nexus
@@ -12,23 +12,24 @@ ms.date: 10/30/2024
1212

1313
# Troubleshoot accepted Cluster resources
1414

15-
Operator Nexus relies on mirroring, or hydrating, resources from the on-premises cluster to Azure. When this process is interrupted, the Cluster resource can move to `Accepted` state.
15+
Operator Nexus relies on mirroring, or hydrating, resources from the on-premises cluster to Azure. When this process is interrupted, the Cluster resource can move to the `Accepted` state.
1616

1717
## Diagnosis
1818

19-
The Cluster status is viewed via the Azure portal or via Azure CLI.
19+
The Cluster status is viewed via the Azure portal or the Azure CLI.
2020

2121
```bash
2222
az networkcloud cluster show --resource-group <RESOURCE_GROUP> --name <CLUSTER_NAME>
2323
```
2424

2525
## Mitigation steps
2626

27-
### Triggering the resource sync
27+
Follow these steps for mitigation.
2828

29+
### Trigger the resource sync
2930

3031
1. From the Cluster resource page in the Azure portal, add a tag to the Cluster resource.
31-
2. The resource moves out of the `Accepted` state.
32+
1. The resource moves out of the `Accepted` state.
3233

3334
```bash
3435
az login
@@ -38,17 +39,16 @@ az resource tag --tags exampleTag=exampleValue --name <CLUSTER> --resource-group
3839

3940
## Verification
4041

41-
After the tag is applied, the Cluster moves to `Running` state.
42+
After the tag is applied, the Cluster moves to the `Running` state.
4243

4344
```bash
4445
az networkcloud cluster show --resource-group <RESOURCE_GROUP> --name <CLUSTER_NAME>
4546
```
4647

47-
If the Cluster resource maintains the state after a period of time, more than 5 minutes, contact Microsoft support.
48+
If the Cluster resource maintains the state after more than five minutes, contact Microsoft support.
4849

49-
## Further information
50+
## Related content
5051

51-
Learn more about how resources are hydrated with [Azure Arc-enabled Kubernetes](/azure/azure-arc/kubernetes/overview).
52-
53-
If you still have questions, [contact support](https://portal.azure.com/?#blade/Microsoft_Azure_Support/HelpAndSupportBlade).
54-
For more information about Support plans, see [Azure Support plans](https://azure.microsoft.com/support/plans/response/).
52+
- For more information about how resources are hydrated, see [Azure Arc-enabled Kubernetes](/azure/azure-arc/kubernetes/overview).
53+
- If you still have questions, [contact Azure support](https://portal.azure.com/?#blade/Microsoft_Azure_Support/HelpAndSupportBlade).
54+
- For more information about support plans, see [Azure support plans](https://azure.microsoft.com/support/plans/response/).
Lines changed: 38 additions & 40 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
---
22
title: Troubleshoot control plane quorum loss
3-
description: Document how to restore control plane quorum loss
3+
description: Learn how to restore control plane quorum loss.
44
ms.topic: article
55
ms.date: 01/18/2024
66
author: matthewernst
@@ -10,7 +10,7 @@ ms.service: azure-operator-nexus
1010

1111
# Troubleshoot control plane quorum loss
1212

13-
Follow this troubleshooting guide when multiple control plane nodes are offline or unavailable:
13+
Follow the steps in this troubleshooting article when multiple control plane nodes are offline or unavailable.
1414

1515
## Prerequisites
1616

@@ -19,56 +19,54 @@ Follow this troubleshooting guide when multiple control plane nodes are offline
1919
- Gather the following information:
2020
- Subscription ID
2121
- Cluster name and resource group
22-
- Bare metal machine name
23-
- Ensure you're logged using `az login`
24-
22+
- Bare-metal machine name
23+
- Ensure that you're signed in by using `az login`.
2524

2625
## Symptoms
2726

28-
- Kubernetes API isn't available
29-
- Multiple control plane nodes are offline or unavailable
27+
- The Kubernetes API isn't available.
28+
- Multiple control plane nodes are offline or unavailable.
3029

3130
## Procedure
3231

33-
1. Identify the Nexus Management Node
34-
- To identify the management nodes, run `az networkcloud baremetalmachine list -g <ResourceGroup_Name>`
35-
- Log in to the identified server
36-
- Ensure the ironic-conductor service is present on this node using `crictl ps -a |grep -i ironic-conductor`
37-
Example output:
32+
1. Identify the Nexus Management Node:
33+
- To identify the management nodes, run `az networkcloud baremetalmachine list -g <ResourceGroup_Name>`.
34+
- Sign in to the identified server.
35+
- Ensure that the ironic-conductor service is present on this node by using `crictl ps -a |grep -i ironic-conductor`. Here's example output:
3836

39-
~~~
40-
testuser@<servername> [ ~ ]$ sudo crictl ps -a |grep -i ironic-conductor
41-
<id> <id> 6 hours ago Running ironic-conductor 0 <id>
42-
~~~
37+
~~~
38+
testuser@<servername> [ ~ ]$ sudo crictl ps -a |grep -i ironic-conductor
39+
<id> <id> 6 hours ago Running ironic-conductor 0 <id>
40+
~~~
4341
44-
2. Determine the iDRAC IP of the server
45-
- Run the command `az networkcloud cluster list -g <RG_Name>`
46-
- The output of the command is a JSON with the iDRAC IP
42+
1. Determine the Dell remote access controller (iDRAC) IP of the server:
43+
- Run the command `az networkcloud cluster list -g <RG_Name>`.
44+
- The output of the command is JSON with the iDRAC IP.
4745
48-
~~~
49-
{
50-
"bmcConnectionString": "redfish+https://xx.xx.xx.xx/redfish/v1/Systems/System.Embedded.1",
51-
"bmcCredentials": {
52-
"username": "<username>"
53-
},
54-
"bmcMacAddress": "<bmcMacAddress>",
55-
"bootMacAddress": "<bootMacAddress",
56-
"machineDetails": "extraDetails",
57-
"machineName": "<machineName>",
58-
"rackSlot": <rackSlot>,
59-
"serialNumber": "<serialNumber>"
60-
},
61-
~~~
46+
~~~
47+
{
48+
"bmcConnectionString": "redfish+https://xx.xx.xx.xx/redfish/v1/Systems/System.Embedded.1",
49+
"bmcCredentials": {
50+
"username": "<username>"
51+
},
52+
"bmcMacAddress": "<bmcMacAddress>",
53+
"bootMacAddress": "<bootMacAddress",
54+
"machineDetails": "extraDetails",
55+
"machineName": "<machineName>",
56+
"rackSlot": <rackSlot>,
57+
"serialNumber": "<serialNumber>"
58+
},
59+
~~~
6260
63-
3. Access the iDRAC GUI using the IP in your browser to shut down impacted management servers
61+
1. Access the integrated iDRAC graphical user interface (GUI) by using the IP in your browser to shut down affected management servers.
6462
65-
:::image type="content" source="media\troubleshoot-control-plane-quorum\graceful-shutdown.png" alt-text="Screenshot of an iDRAC GUI and the button to perform a graceful shutdown." lightbox="media\troubleshoot-control-plane-quorum\graceful-shutdown.png":::
63+
:::image type="content" source="media\troubleshoot-control-plane-quorum\graceful-shutdown.png" alt-text="Screenshot that shows an iDRAC GUI and the button to perform a graceful shutdown." lightbox="media\troubleshoot-control-plane-quorum\graceful-shutdown.png":::
6664
67-
4. When all impacted management servers are down, turn on the servers using the iDRAC GUI
65+
1. When all affected management servers are down, turn on the servers by using the iDRAC GUI.
6866
69-
:::image type="content" source="media\troubleshoot-control-plane-quorum\graceful-power-on.png" alt-text="Screenshot of an iDRAC GUI and the button to perform power on command." lightbox="media\troubleshoot-control-plane-quorum\graceful-power-on.png":::
67+
:::image type="content" source="media\troubleshoot-control-plane-quorum\graceful-power-on.png" alt-text="Screenshot that shows an iDRAC GUI and the button to perform the power command." lightbox="media\troubleshoot-control-plane-quorum\graceful-power-on.png":::
7068
71-
5. The servers should now be restored.
69+
The servers should now be restored.
7270
73-
If you still have questions, [contact support](https://portal.azure.com/?#blade/Microsoft_Azure_Support/HelpAndSupportBlade).
74-
For more information about Support plans, see [Azure Support plans](https://azure.microsoft.com/support/plans/response/).
71+
If you still have questions, [contact Azure support](https://portal.azure.com/?#blade/Microsoft_Azure_Support/HelpAndSupportBlade).
72+
For more information about support plans, see [Azure support plans](https://azure.microsoft.com/support/plans/response/).

articles/operator-nexus/troubleshoot-hardware-validation-failure.md

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -74,7 +74,7 @@ This section discusses troubleshooting for problems you might encounter.
7474
* To troubleshoot a memory problem, contact the vendor.
7575

7676
* CPU-related failure (`cpu_sockets`)
77-
* CPU specs are defined in the SKU. A failed `cpu_sockets` check indicates a failed CPU or CPU count mismatch. The following example shows a failed CPU check.
77+
* CPU specs are defined in the version. A failed `cpu_sockets` check indicates a failed CPU or CPU count mismatch. The following example shows a failed CPU check.
7878

7979
```yaml
8080
{
@@ -521,11 +521,11 @@ This section discusses troubleshooting for problems you might encounter.
521521
]
522522
```
523523

524-
* To power a server on in the BMC web UI:
524+
* To power on a server in the BMC web UI:
525525

526526
`BMC` -> `Dashboard` -> `Power On System`
527527

528-
* To power a server on with `racadm`:
528+
* To power on a server with `racadm`:
529529

530530
```bash
531531
racadm --nocertwarn -r $IP -u $BMC_USR -p $BMC_PWD serveraction powerup
@@ -696,5 +696,5 @@ This section discusses troubleshooting for problems you might encounter.
696696

697697
After the hardware is fixed, run the BMM `replace` action by following the instructions in [Manage the lifecycle of bare metal machines](howto-baremetal-functions.md).
698698

699-
If you still have questions, [contact Support](https://portal.azure.com/?#blade/Microsoft_Azure_Support/HelpAndSupportBlade).
700-
For more information about support plans, see [Azure Support plans](https://azure.microsoft.com/support/plans/response/).
699+
If you still have questions, [contact Azure support](https://portal.azure.com/?#blade/Microsoft_Azure_Support/HelpAndSupportBlade).
700+
For more information about support plans, see [Azure support plans](https://azure.microsoft.com/support/plans/response/).
Lines changed: 21 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
---
22
title: "Azure Operator Nexus: Networking"
3-
description: Checking LACP Bonding on Physical Hosts.
3+
description: Learn how to check LACP bonding on physical hosts.
44
author: keithritchie73
55
ms.author: keithritchie
66
ms.service: azure-operator-nexus
@@ -9,44 +9,44 @@ ms.topic: troubleshooting
99
ms.date: 11/15/2024
1010
---
1111

12-
# Checking LACP Bonding on Physical Hosts
12+
# Check LACP bonding on physical hosts
1313

14-
On physical host startup, the two Mellanox cards are LACP bonded to a pair of Arista switches. If LACP isn't properly negotiated between the server's cards and the switches, it can cause strange packet loss or load balancing behavior. These errors might not be noticeable until a tenant workload attempts to pass traffic and is due to the hashing/load balancing nature of LACP.
14+
On physical host startup, the two Mellanox cards are bonded to a pair of Arista switches by the Link Aggregation Control Protocol (LACP). If LACP isn't properly negotiated between the server's cards and the switches, it can cause strange packet loss or load-balancing behavior. These errors might not be noticeable until a tenant workload attempts to pass traffic. They occur because of the hashing/load-balancing nature of LACP.
1515

1616
## Diagnosis
1717

18-
If, LACP isn't negotiated correctly traffic loss can occur. But traffic can pass for some flows too. This behavior can manifest itself as a vm that can't get on the network, or even oam/storage outages.
18+
If LACP isn't negotiated correctly, traffic loss can occur. But traffic can pass for some flows too. This behavior can manifest itself as a virtual machine that can't get on the network, or even as oam/storage outages.
1919

20-
## Checking LACP Bonding
20+
## Check LACP bonding
2121

22-
To check the LACP bonding status on a physical host run the following command. For control plane hosts, use file 8a_pf_bond as there's only one Mellanox card on those hosts. For worker hosts, use either 4b_pf_bond or 98_pf_bond to check its two cards.
22+
To check the LACP bonding status on a physical host, run the following command. For control plane hosts, use file `8a_pf_bond` because there's only one Mellanox card on those hosts. For worker hosts, use either `4b_pf_bond` or `98_pf_bond` to check two cards.
2323

2424
```bash
2525
# cat /proc/net/bonding/8a_pf_bond
2626
```
2727

28-
### Interpreting the results
28+
### Interpret the results
2929

30-
Key validations to check in the /proc/net/bonding/ output are:
30+
Key validations to check in the `/proc/net/bonding/` output are:
3131

32-
For Bond level (the top part):
32+
For the bond level (the top part):
3333

34-
1. MII Status: up - Is the entire bond up
35-
2. LACP active: on - Is LACP active
36-
3. Aggregator ID: 1 - The top level aggregator ID should match both replicas. See each port for its aggregator ID.
37-
4. System MAC address: 42:56:86:9c:81:89 - Is there a System MAC defined. If a bond isn't negotiated this will be undefined or all zeros, e.g 00:00:00:00:00:00
34+
- **MII status**: Up. Is the entire bond up?
35+
- **LACP active**: On. Is LACP active?
36+
- **Aggregator ID**: 1. The top-level aggregator ID should match both replicas. See each port for its aggregator ID.
37+
- **System MAC address**: 42:56:86:9c:81:89. Is there a System MAC defined? If a bond isn't negotiated, it's undefined or all zeros, for example, 00:00:00:00:00:00.
3838

3939
For each port:
4040

41-
1. MII Status: up - Is the interface up
42-
2. Aggregator ID: 1 - Both replicas should have the same aggregator ID
43-
3. details partner lacp pdu: port state 61 - The value is a bit mask that represents the LACP negotiation state on that port. Generally 61 and 63 are what we want. [See](https://movingpackets.net/2017/10/17/decoding-lacp-port-state)
41+
- **MII status**: Up. Is the interface up?
42+
- **Aggregator ID**: 1. Both replicas should have the same aggregator ID.
43+
- **Details partner LACP protocol data unit (PDU)**: Port state 61. The value is a bit mask that represents the LACP negotiation state on that port. Generally, 61 and 63 are what we want. For more information, see [Decoding LACP Port State](https://movingpackets.net/2017/10/17/decoding-lacp-port-state).
4444

45-
### Fixing the issue
45+
### Fix the issue
4646

47-
The most common causes for these LACP issues are host/switch miswiring or mismatched LACP/MLAG configuration on the Arista switches. Investigate the situation by tracing out and repairing any wiring issues. If the wiring is correct, then determine if the switch LACP/MLAG configuration is incorrect.
47+
The most common causes for these LACP issues are host or switch miswiring or mismatched LACP/Multi-Chassis Link Aggregation (MLAG) configuration on the Arista switches. Investigate the situation by tracing out and repairing any wiring issues. If the wiring is correct, determine if the switch LACP/MLAG configuration is incorrect.
4848

49-
## Further information
49+
## Related content
5050

51-
If you still have questions, [contact support](https://portal.azure.com/?#blade/Microsoft_Azure_Support/HelpAndSupportBlade).
52-
For more information about Support plans, see [Azure Support plans](https://azure.microsoft.com/support/plans/response/).
51+
- If you still have questions, [contact Azure support](https://portal.azure.com/?#blade/Microsoft_Azure_Support/HelpAndSupportBlade).
52+
- For more information about support plans, see [Azure support plans](https://azure.microsoft.com/support/plans/response/).

articles/operator-nexus/troubleshoot-memory-limits.md

Lines changed: 17 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
---
22
title: Troubleshoot container memory limits
3-
description: Troubleshooting Kubernetes container limits
3+
description: Learn how to troubleshoot Kubernetes container limits.
44
ms.service: azure-operator-nexus
55
ms.custom: troubleshooting
66
ms.topic: troubleshooting
@@ -11,23 +11,25 @@ author: matternst7258
1111

1212
# Troubleshoot container memory limits
1313

14-
## Alerting for memory limits
14+
Learn about troubleshooting for container memory limits in this article.
1515

16-
It's recommended to have alerts set up for the Operator Nexus cluster to look for Kubernetes pods restarting from OOMKill errors. These alerts allow customers to know if a component on a server is working appropriately.
16+
## Alerts for memory limits
1717

18-
Metrics exposed to identify memory limits:
18+
We recommend that you have alerts set up for the Operator Nexus cluster to look for Kubernetes pods that restart from `OOMKill` errors. These alerts let you know if a component on a server is working appropriately.
1919

20-
| Metric Name | Description |
20+
The following table lists the metrics that are exposed to identify memory limits.
21+
22+
| Metric name | Description |
2123
| ------------------------------------ | ------------------------------------------------ |
2224
| Container Restarts | `kube_pod_container_status_restarts_total` |
2325
| Container Status Terminated Reason | `kube_pod_container_status_terminated_reason` |
2426
| Container Resource Limits | `kube_pod_container_resource_limits` |
2527

26-
`Container Status Terminated Reason` displays the OOMKill reason for impacted pods.
28+
The `Container Status Terminated Reason` metric displays the `OOMKill` reason for pods that are affected.
2729

28-
## Identifying Out of Memory (OOM) pods
30+
## Identify Out of Memory (OOM) pods
2931

30-
Start by identifying any components that are restarting or show OOMKill.
32+
Start by identifying any components that are restarting or show `OOMKill`.
3133

3234
```azcli
3335
az networkcloud baremetalmachine run-read-command --name "<bareMetalMachineName>" \
@@ -37,7 +39,7 @@ az networkcloud baremetalmachine run-read-command --name "<bareMetalMachineName>
3739
--subscription "<subscription>"
3840
```
3941

40-
Once identified, a `describe pod` command can determine the status and restart count.
42+
When components are identified, a `describe pod` command can determine the status and restart count.
4143

4244
```azcli
4345
az networkcloud baremetalmachine run-read-command --name "<bareMetalMachineName>" \
@@ -47,7 +49,7 @@ az networkcloud baremetalmachine run-read-command --name "<bareMetalMachineName>
4749
--subscription "<subscription>"
4850
```
4951

50-
At the same time, a `get events` command can provide history to see the frequency of pod restarts.
52+
At the same time, a `get events` command can provide history so that you can see the frequency of pod restarts.
5153

5254
```azcli
5355
az networkcloud baremetalmachine run-read-command --name "<bareMetalMachineName>" \
@@ -57,20 +59,20 @@ az networkcloud baremetalmachine run-read-command --name "<bareMetalMachineName>
5759
--subscription "<subscription>"
5860
```
5961

60-
The data from these commands identify whether a pod is restarting due to `OOMKill`.
62+
The data from these commands identifies whether a pod is restarting because of `OOMKill`.
6163

62-
## Patching memory limits
64+
## Patch memory limits
6365

6466
Raise a Microsoft support request for all memory limit changes for adjustments and support.
6567

6668
> [!WARNING]
67-
> Patching memory limits to a pod are not permanent and can be overwritten if the pod restarts.
69+
> Patching memory limits to a pod aren't permanent and can be overwritten if the pod restarts.
6870
6971
## Confirm memory limit changes
7072

71-
When memory limits change, the pods should return to `Ready` state and stop restarting.
73+
When memory limits change, the pods should return to the `Ready` state and stop restarting.
7274

73-
The following commands can be used to confirm the behavior.
75+
Use the following commands to confirm the behavior.
7476

7577
```azcli
7678
az networkcloud baremetalmachine run-read-command --name "<bareMetalMachineName>" \

0 commit comments

Comments
 (0)