edit pass: azure-operator-nexus-cluster-and-bmm

paulth1 · paulth1 · commit a588ef6a3a39 · 2024-12-03T16:26:14.000-08:00
diff --git a/articles/operator-nexus/troubleshoot-accepted-cluster-hydration.md b/articles/operator-nexus/troubleshoot-accepted-cluster-hydration.md
@@ -1,6 +1,6 @@
 ---
 title: "Azure Operator Nexus: Accepted Cluster"
-description: Troubleshoot accepted Cluster resource.
+description: Learn how to troubleshoot accepted Cluster resources.
 author: matternst7258
 ms.author: matthewernst
 ms.service: azure-operator-nexus
@@ -12,23 +12,24 @@ ms.date: 10/30/2024
 
 # Troubleshoot accepted Cluster resources
 
-Operator Nexus relies on mirroring, or hydrating, resources from the on-premises cluster to Azure. When this process is interrupted, the Cluster resource can move to `Accepted` state. 
+Operator Nexus relies on mirroring, or hydrating, resources from the on-premises cluster to Azure. When this process is interrupted, the Cluster resource can move to the `Accepted` state.
 
 ## Diagnosis
 
-The Cluster status is viewed via the Azure portal or via Azure CLI.
+The Cluster status is viewed via the Azure portal or the Azure CLI.
 
 ```bash
 az networkcloud cluster show --resource-group <RESOURCE_GROUP> --name <CLUSTER_NAME>
 ```
 
 ## Mitigation steps
 
-### Triggering the resource sync
+Follow these steps for mitigation.
 
+### Trigger the resource sync
 
 1. From the Cluster resource page in the Azure portal, add a tag to the Cluster resource.
-2. The resource moves out of the `Accepted` state.
+1. The resource moves out of the `Accepted` state.
 
 ```bash
 az login
@@ -38,17 +39,16 @@ az resource tag --tags exampleTag=exampleValue --name <CLUSTER> --resource-group
 
 ## Verification
 
-After the tag is applied, the Cluster moves to `Running` state.
+After the tag is applied, the Cluster moves to the `Running` state.
 
 ```bash
 az networkcloud cluster show --resource-group <RESOURCE_GROUP> --name <CLUSTER_NAME>
 ```
 
-If the Cluster resource maintains the state after a period of time, more than 5 minutes, contact Microsoft support. 
+If the Cluster resource maintains the state after more than five minutes, contact Microsoft support.
 
-## Further information
+## Related content
 
- Learn more about how resources are hydrated with [Azure Arc-enabled Kubernetes](/azure/azure-arc/kubernetes/overview).
-
-If you still have questions, [contact support](https://portal.azure.com/?#blade/Microsoft_Azure_Support/HelpAndSupportBlade).
-For more information about Support plans, see [Azure Support plans](https://azure.microsoft.com/support/plans/response/).
+- For more information about how resources are hydrated, see [Azure Arc-enabled Kubernetes](/azure/azure-arc/kubernetes/overview).
+- If you still have questions, [contact Azure support](https://portal.azure.com/?#blade/Microsoft_Azure_Support/HelpAndSupportBlade).
+- For more information about support plans, see [Azure support plans](https://azure.microsoft.com/support/plans/response/).
diff --git a/articles/operator-nexus/troubleshoot-control-plane-quorum.md b/articles/operator-nexus/troubleshoot-control-plane-quorum.md
@@ -1,6 +1,6 @@
 ---
 title: Troubleshoot control plane quorum loss
-description: Document how to restore control plane quorum loss
+description: Learn how to restore control plane quorum loss.
 ms.topic: article
 ms.date: 01/18/2024
 author: matthewernst
@@ -10,7 +10,7 @@ ms.service: azure-operator-nexus
 
 # Troubleshoot control plane quorum loss
 
-Follow this troubleshooting guide when multiple control plane nodes are offline or unavailable:
+Follow the steps in this troubleshooting article when multiple control plane nodes are offline or unavailable.
 
 ## Prerequisites
 
@@ -19,56 +19,54 @@ Follow this troubleshooting guide when multiple control plane nodes are offline
 - Gather the following information:
   - Subscription ID
   - Cluster name and resource group
-  - Bare metal machine name
-- Ensure you're logged using `az login`
-
+  - Bare-metal machine name
+- Ensure that you're signed in by using `az login`.
 
 ## Symptoms
 
-- Kubernetes API isn't available
-- Multiple control plane nodes are offline or unavailable
+- The Kubernetes API isn't available.
+- Multiple control plane nodes are offline or unavailable.
 
 ## Procedure
 
-1. Identify the Nexus Management Node
-- To identify the management nodes, run `az networkcloud baremetalmachine list -g <ResourceGroup_Name>`
-- Log in to the identified server
-- Ensure the ironic-conductor service is present on this node using `crictl ps -a |grep -i ironic-conductor`
-  Example output:
+1. Identify the Nexus Management Node:
+   - To identify the management nodes, run `az networkcloud baremetalmachine list -g <ResourceGroup_Name>`.
+   - Sign in to the identified server.
+   - Ensure that the ironic-conductor service is present on this node by using `crictl ps -a |grep -i ironic-conductor`. Here's example output:
 
-~~~
-testuser@<servername> [ ~ ]$ sudo crictl ps -a |grep -i ironic-conductor
-<id>       <id>       6 hours ago       Running       ironic-conductor       0       <id>
-~~~
+        ~~~
+        testuser@<servername> [ ~ ]$ sudo crictl ps -a |grep -i ironic-conductor
+        <id>       <id>       6 hours ago       Running       ironic-conductor       0       <id>
+        ~~~
 
-2. Determine the iDRAC IP of the server
-- Run the command `az networkcloud cluster list -g <RG_Name>`
-- The output of the command is a JSON with the iDRAC IP
+1. Determine the Dell remote access controller (iDRAC) IP of the server:
+   - Run the command `az networkcloud cluster list -g <RG_Name>`.
+   - The output of the command is JSON with the iDRAC IP.
 
-    ~~~
-    {
-            "bmcConnectionString": "redfish+https://xx.xx.xx.xx/redfish/v1/Systems/System.Embedded.1",
-            "bmcCredentials": {
-              "username": "<username>"
-            },
-            "bmcMacAddress": "<bmcMacAddress>",
-            "bootMacAddress": "<bootMacAddress",
-            "machineDetails": "extraDetails",
-            "machineName": "<machineName>",
-            "rackSlot": <rackSlot>,
-            "serialNumber": "<serialNumber>"
-    },
-    ~~~
+        ~~~
+        {
+                "bmcConnectionString": "redfish+https://xx.xx.xx.xx/redfish/v1/Systems/System.Embedded.1",
+                "bmcCredentials": {
+                  "username": "<username>"
+                },
+                "bmcMacAddress": "<bmcMacAddress>",
+                "bootMacAddress": "<bootMacAddress",
+                "machineDetails": "extraDetails",
+                "machineName": "<machineName>",
+                "rackSlot": <rackSlot>,
+                "serialNumber": "<serialNumber>"
+        },
+        ~~~
 
-3. Access the iDRAC GUI using the IP in your browser to shut down impacted management servers
+1. Access the integrated iDRAC graphical user interface (GUI) by using the IP in your browser to shut down affected management servers.
 
-   :::image type="content" source="media\troubleshoot-control-plane-quorum\graceful-shutdown.png" alt-text="Screenshot of an iDRAC GUI and the button to perform a graceful shutdown." lightbox="media\troubleshoot-control-plane-quorum\graceful-shutdown.png":::
+   :::image type="content" source="media\troubleshoot-control-plane-quorum\graceful-shutdown.png" alt-text="Screenshot that shows an iDRAC GUI and the button to perform a graceful shutdown." lightbox="media\troubleshoot-control-plane-quorum\graceful-shutdown.png":::
 
-4. When all impacted management servers are down, turn on the servers using the iDRAC GUI
+1. When all affected management servers are down, turn on the servers by using the iDRAC GUI.
 
-   :::image type="content" source="media\troubleshoot-control-plane-quorum\graceful-power-on.png" alt-text="Screenshot of an iDRAC GUI and the button to perform power on command." lightbox="media\troubleshoot-control-plane-quorum\graceful-power-on.png":::
+   :::image type="content" source="media\troubleshoot-control-plane-quorum\graceful-power-on.png" alt-text="Screenshot that shows an iDRAC GUI and the button to perform the power command." lightbox="media\troubleshoot-control-plane-quorum\graceful-power-on.png":::
 
-5. The servers should now be restored.
+The servers should now be restored.
 
-If you still have questions, [contact support](https://portal.azure.com/?#blade/Microsoft_Azure_Support/HelpAndSupportBlade).
-For more information about Support plans, see [Azure Support plans](https://azure.microsoft.com/support/plans/response/).
+If you still have questions, [contact Azure support](https://portal.azure.com/?#blade/Microsoft_Azure_Support/HelpAndSupportBlade).
+For more information about support plans, see [Azure support plans](https://azure.microsoft.com/support/plans/response/).
diff --git a/articles/operator-nexus/troubleshoot-hardware-validation-failure.md b/articles/operator-nexus/troubleshoot-hardware-validation-failure.md
@@ -74,7 +74,7 @@ This section discusses troubleshooting for problems you might encounter.
     * To troubleshoot a memory problem, contact the vendor.
 
 * CPU-related failure (`cpu_sockets`)
-    * CPU specs are defined in the SKU. A failed `cpu_sockets` check indicates a failed CPU or CPU count mismatch. The following example shows a failed CPU check.
+    * CPU specs are defined in the version. A failed `cpu_sockets` check indicates a failed CPU or CPU count mismatch. The following example shows a failed CPU check.
 
     ```yaml
         {
@@ -521,11 +521,11 @@ This section discusses troubleshooting for problems you might encounter.
         ]
     ```
 
-    * To power a server on in the BMC web UI:
+    * To power on a server in the BMC web UI:
 
         `BMC` -> `Dashboard` -> `Power On System`
 
-    * To power a server on with `racadm`:
+    * To power on a server with `racadm`:
 
     ```bash
         racadm --nocertwarn -r $IP -u $BMC_USR -p $BMC_PWD serveraction powerup
@@ -696,5 +696,5 @@ This section discusses troubleshooting for problems you might encounter.
 
 After the hardware is fixed, run the BMM `replace` action by following the instructions in [Manage the lifecycle of bare metal machines](howto-baremetal-functions.md).
 
-If you still have questions, [contact Support](https://portal.azure.com/?#blade/Microsoft_Azure_Support/HelpAndSupportBlade).
-For more information about support plans, see [Azure Support plans](https://azure.microsoft.com/support/plans/response/).
+If you still have questions, [contact Azure support](https://portal.azure.com/?#blade/Microsoft_Azure_Support/HelpAndSupportBlade).
+For more information about support plans, see [Azure support plans](https://azure.microsoft.com/support/plans/response/).
diff --git a/articles/operator-nexus/troubleshoot-lacp-bonding.md b/articles/operator-nexus/troubleshoot-lacp-bonding.md
@@ -1,6 +1,6 @@
 ---
 title: "Azure Operator Nexus: Networking"
-description: Checking LACP Bonding on Physical Hosts.
+description: Learn how to check LACP bonding on physical hosts.
 author: keithritchie73
 ms.author: keithritchie
 ms.service: azure-operator-nexus
@@ -9,44 +9,44 @@ ms.topic: troubleshooting
 ms.date: 11/15/2024
 ---
 
-# Checking LACP Bonding on Physical Hosts
+# Check LACP bonding on physical hosts
 
-On physical host startup, the two Mellanox cards are LACP bonded to a pair of Arista switches. If LACP isn't properly negotiated between the server's cards and the switches, it can cause strange packet loss or load balancing behavior. These errors might not be noticeable until a tenant workload attempts to pass traffic and is due to the hashing/load balancing nature of LACP.
+On physical host startup, the two Mellanox cards are bonded to a pair of Arista switches by the Link Aggregation Control Protocol (LACP). If LACP isn't properly negotiated between the server's cards and the switches, it can cause strange packet loss or load-balancing behavior. These errors might not be noticeable until a tenant workload attempts to pass traffic. They occur because of the hashing/load-balancing nature of LACP.
 
 ## Diagnosis
 
-If, LACP isn't negotiated correctly traffic loss can occur. But traffic can pass for some flows too. This behavior can manifest itself as a vm that can't get on the network, or even oam/storage outages.
+If LACP isn't negotiated correctly, traffic loss can occur. But traffic can pass for some flows too. This behavior can manifest itself as a virtual machine that can't get on the network, or even as oam/storage outages.
 
-## Checking LACP Bonding
+## Check LACP bonding
 
-To check the LACP bonding status on a physical host run the following command. For control plane hosts, use file 8a_pf_bond as there's only one Mellanox card on those hosts. For worker hosts, use either 4b_pf_bond or 98_pf_bond to check its two cards.
+To check the LACP bonding status on a physical host, run the following command. For control plane hosts, use file `8a_pf_bond` because there's only one Mellanox card on those hosts. For worker hosts, use either `4b_pf_bond` or `98_pf_bond` to check two cards.
 
 ```bash
 # cat /proc/net/bonding/8a_pf_bond
 ```
 
-### Interpreting the results
+### Interpret the results
 
-Key validations to check in the /proc/net/bonding/ output are:
+Key validations to check in the `/proc/net/bonding/` output are:
 
-For Bond level (the top part):
+For the bond level (the top part):
 
-1. MII Status: up - Is the entire bond up
-2. LACP active: on - Is LACP active
-3. Aggregator ID: 1 - The top level aggregator ID should match both replicas. See each port for its aggregator ID.
-4. System MAC address: 42:56:86:9c:81:89 - Is there a System MAC defined. If a bond isn't negotiated this will be undefined or all zeros, e.g 00:00:00:00:00:00
+- **MII status**: Up. Is the entire bond up?
+- **LACP active**: On. Is LACP active?
+- **Aggregator ID**: 1. The top-level aggregator ID should match both replicas. See each port for its aggregator ID.
+- **System MAC address**: 42:56:86:9c:81:89. Is there a System MAC defined? If a bond isn't negotiated, it's undefined or all zeros, for example, 00:00:00:00:00:00.
 
 For each port:
 
-1. MII Status: up - Is the interface up
-2. Aggregator ID: 1 - Both replicas should have the same aggregator ID
-3. details partner lacp pdu: port state 61 - The value is a bit mask that represents the LACP negotiation state on that port. Generally 61 and 63 are what we want. [See](https://movingpackets.net/2017/10/17/decoding-lacp-port-state)
+- **MII status**: Up. Is the interface up?
+- **Aggregator ID**: 1. Both replicas should have the same aggregator ID.
+- **Details partner LACP protocol data unit (PDU)**: Port state 61. The value is a bit mask that represents the LACP negotiation state on that port. Generally, 61 and 63 are what we want. For more information, see [Decoding LACP Port State](https://movingpackets.net/2017/10/17/decoding-lacp-port-state).
 
-### Fixing the issue
+### Fix the issue
 
-The most common causes for these LACP issues are host/switch miswiring or mismatched LACP/MLAG configuration on the Arista switches. Investigate the situation by tracing out and repairing any wiring issues. If the wiring is correct, then determine if the switch LACP/MLAG configuration is incorrect.
+The most common causes for these LACP issues are host or switch miswiring or mismatched LACP/Multi-Chassis Link Aggregation (MLAG) configuration on the Arista switches. Investigate the situation by tracing out and repairing any wiring issues. If the wiring is correct, determine if the switch LACP/MLAG configuration is incorrect.
 
-## Further information
+## Related content
 
-If you still have questions, [contact support](https://portal.azure.com/?#blade/Microsoft_Azure_Support/HelpAndSupportBlade).
-For more information about Support plans, see [Azure Support plans](https://azure.microsoft.com/support/plans/response/).
+- If you still have questions, [contact Azure support](https://portal.azure.com/?#blade/Microsoft_Azure_Support/HelpAndSupportBlade).
+- For more information about support plans, see [Azure support plans](https://azure.microsoft.com/support/plans/response/).
diff --git a/articles/operator-nexus/troubleshoot-memory-limits.md b/articles/operator-nexus/troubleshoot-memory-limits.md
@@ -1,6 +1,6 @@
 ---
 title: Troubleshoot container memory limits
-description: Troubleshooting Kubernetes container limits
+description: Learn how to troubleshoot Kubernetes container limits.
 ms.service: azure-operator-nexus
 ms.custom: troubleshooting
 ms.topic: troubleshooting
@@ -11,23 +11,25 @@ author: matternst7258
 
 # Troubleshoot container memory limits
 
-## Alerting for memory limits
+Learn about troubleshooting for container memory limits in this article.
 
-It's recommended to have alerts set up for the Operator Nexus cluster to look for Kubernetes pods restarting from OOMKill errors. These alerts allow customers to know if a component on a server is working appropriately.
+## Alerts for memory limits
 
-Metrics exposed to identify memory limits:
+We recommend that you have alerts set up for the Operator Nexus cluster to look for Kubernetes pods that restart from `OOMKill` errors. These alerts let you know if a component on a server is working appropriately.
 
-| Metric Name                          | Description                                      |
+The following table lists the metrics that are exposed to identify memory limits.
+
+| Metric name                          | Description                                      |
 | ------------------------------------ | ------------------------------------------------ |
 | Container Restarts                   | `kube_pod_container_status_restarts_total`       |
 | Container Status Terminated Reason   | `kube_pod_container_status_terminated_reason`    |
 | Container Resource Limits            | `kube_pod_container_resource_limits`             |
 
-`Container Status Terminated Reason` displays the OOMKill reason for impacted pods. 
+The `Container Status Terminated Reason` metric displays the `OOMKill` reason for pods that are affected.
 
-## Identifying Out of Memory (OOM) pods
+## Identify Out of Memory (OOM) pods
 
-Start by identifying any components that are restarting or show OOMKill.
+Start by identifying any components that are restarting or show `OOMKill`.
 
 ```azcli
 az networkcloud baremetalmachine run-read-command --name "<bareMetalMachineName>" \
@@ -37,7 +39,7 @@ az networkcloud baremetalmachine run-read-command --name "<bareMetalMachineName>
    --subscription "<subscription>"
 ```
 
-Once identified, a `describe pod` command can determine the status and restart count. 
+When components are identified, a `describe pod` command can determine the status and restart count.
 
 ```azcli
 az networkcloud baremetalmachine run-read-command --name "<bareMetalMachineName>" \
@@ -47,7 +49,7 @@ az networkcloud baremetalmachine run-read-command --name "<bareMetalMachineName>
    --subscription "<subscription>"
 ```
 
-At the same time, a `get events` command can provide history to see the frequency of pod restarts.
+At the same time, a `get events` command can provide history so that you can see the frequency of pod restarts.
 
 ```azcli
 az networkcloud baremetalmachine run-read-command --name "<bareMetalMachineName>" \
@@ -57,20 +59,20 @@ az networkcloud baremetalmachine run-read-command --name "<bareMetalMachineName>
    --subscription "<subscription>"
 ```
 
-The data from these commands identify whether a pod is restarting due to `OOMKill`.
+The data from these commands identifies whether a pod is restarting because of `OOMKill`.
 
-## Patching memory limits
+## Patch memory limits
 
 Raise a Microsoft support request for all memory limit changes for adjustments and support.
 
 > [!WARNING]
-> Patching memory limits to a pod are not permanent and can be overwritten if the pod restarts.
+> Patching memory limits to a pod aren't permanent and can be overwritten if the pod restarts.
 
 ## Confirm memory limit changes
 
-When memory limits change, the pods should return to `Ready` state and stop restarting. 
+When memory limits change, the pods should return to the `Ready` state and stop restarting.
 
-The following commands can be used to confirm the behavior.
+Use the following commands to confirm the behavior.
 
 ```azcli
 az networkcloud baremetalmachine run-read-command --name "<bareMetalMachineName>" \