Skip to content

Commit 7dc58af

Browse files
Merge pull request #273229 from eak13/main
Updates for Feature 758537 Prevent Simultaneous Disruptive Actions Against KCP nodes
2 parents 9eb5b57 + 121dc80 commit 7dc58af

File tree

2 files changed

+64
-47
lines changed

2 files changed

+64
-47
lines changed

articles/operator-nexus/howto-baremetal-functions.md

Lines changed: 42 additions & 35 deletions
Original file line numberDiff line numberDiff line change
@@ -1,36 +1,45 @@
11
---
2-
title: "Azure Operator Nexus: Platform Functions for Bare Metal Machines"
3-
description: Learn how to manage Bare Metal Machines (BMM).
4-
author: harish6724
5-
ms.author: harishrao
2+
title: "Azure Operator Nexus: Platform functions for bare metal machines"
3+
description: Learn how to manage bare metal machines (BMM).
4+
author: eak13
5+
ms.author: ekarandjeff
66
ms.service: azure-operator-nexus
77
ms.topic: how-to
8-
ms.date: 05/26/2023
8+
ms.date: 04/30/2024
99
ms.custom: template-how-to, devx-track-azurecli
1010
---
1111

12-
# Manage lifecycle of Bare Metal Machines
12+
# Manage the lifecycle of bare metal machines
1313

14-
This article describes how to perform lifecycle management operations on Bare Metal Machines (BMM). These steps should be used for troubleshooting purposes to recover from failures or when taking maintenance actions. The commands to manage the lifecycle of the BMM include:
14+
This article describes how to perform lifecycle management operations on bare metal machines (BMM). These steps should be used for troubleshooting purposes to recover from failures or when taking maintenance actions. The commands to manage the lifecycle of the BMM include:
1515

16-
- Power off the BMM
16+
> [!CAUTION]
17+
> Do not perform any action against management servers without first consulting with Microsoft support personnel. Doing so could affect the integrity of the Operator Nexus Cluster.
18+
19+
- **Power off the BMM**
1720
- Start the BMM
18-
- Restart the BMM
19-
- Make the BMM unschedulable or schedulable
20-
- Reimage the BMM
21-
- Replace the BMM
21+
- **Restart the BMM**
22+
- Make the BMM unschedulable (cordon without evacuate)
23+
- **Make the BMM unschedulable (cordon with evacuate)**
24+
- Make the BMM schedulable (uncordon)
25+
- **Reimage the BMM**
26+
- **Replace the BMM**
27+
28+
> [!IMPORTANT]
29+
> Disruptive command requests against a Kubernetes Control Plane (KCP) node are rejected if there is another disruptive action command already running against another KCP node or if the full KCP is not available. This check is done to maintain the integrity of the Nexus instance and ensure multiple KCP nodes don't go down at once due to simultaneous disruptive actions. If multiple nodes go down, it will break the healthy quorum threshold of the Kubernetes Control Plane.
30+
>
31+
> The bolded actions in the above list are considered disruptive (Power off, Restart, Reimage, Replace). Cordon without evacuate is not considered disruptive. Cordon with evacuate is considered disruptive.
32+
>
33+
> As noted in the cautionary statement, running actions against management servers, especially KCP nodes, should only be done in consultation with Microsoft support personnel.
2234
2335
## Prerequisites
2436

2537
1. Install the latest version of the
26-
[appropriate CLI extensions](./howto-install-cli-extensions.md)
27-
1. Get the name of the resource group for the BMM
28-
1. Get the name of the bare metal machine that requires a lifecycle management operation
29-
1. Ensure that the target bare metal machine `poweredState` set to `On` and `readyState` set to `True`
30-
1. This prerequisite is not applicable for the `start` command
31-
32-
> [!CAUTION]
33-
> Actions against management servers should not be run without consultation with Microsoft support personnel. Doing so could affect the integrity of the Operator Nexus Cluster.
38+
[appropriate CLI extensions](./howto-install-cli-extensions.md).
39+
1. Get the name of the resource group for the BMM.
40+
1. Get the name of the bare metal machine that requires a lifecycle management operation.
41+
1. Ensure that the target bare metal machine `poweredState` set to `On` and `readyState` set to `True`.
42+
1. This prerequisite isn't applicable for the `start` command.
3443

3544
## Power off the BMM
3645

@@ -81,9 +90,9 @@ az networkcloud baremetalmachine cordon \
8190

8291
The `evacuate "True"` removes workloads from that node while `evacuate "False"` only prevents the scheduling of new workloads.
8392

84-
## Make a BMM schedulable (uncordon)
93+
## Make a BMM "schedulable" (uncordon)
8594

86-
You can make a BMM `schedulable` (usable) by executing the [`uncordon`](#make-a-bmm-schedulable-uncordon) command. All workloads in a `pending`
95+
You can make a BMM "schedulable" (usable) by executing the [`uncordon`](#make-a-bmm-schedulable-uncordon) command. All workloads in a `pending`
8796
state on the BMM are `restarted` when the BMM is `uncordoned`.
8897

8998
```azurecli
@@ -94,15 +103,14 @@ az networkcloud baremetalmachine uncordon \
94103

95104
## Reimage a BMM
96105

97-
You can restore the runtime version on a BMM by executing `reimage` command. This process **redeploys** the runtime image on the target BMM and executes the steps to rejoin the cluster with the same identifiers. This action doesn't impact the tenant workload files on this BMM.
106+
You can restore the runtime version on a BMM by executing the `reimage` command. This process **redeploys** the runtime image on the target BMM and executes the steps to rejoin the cluster with the same identifiers. This action doesn't affect the tenant workload files on this BMM.
98107
As a best practice, make sure the BMM's workloads are drained using the [`cordon`](#make-a-bmm-unschedulable-cordon)
99-
command, with `evacuate "True"`, prior to executing the `reimage` command.
108+
command, with `evacuate "True"`, before executing the `reimage` command.
100109

101-
> [!Warning]
102-
> Running more than one baremetalmachine replace or reimage command at the same time, or running a replace
103-
> at the same time as a reimage, will leave servers in a nonworking state. Make sure one replace / reimage
104-
> has fully completed before starting another one. In a future release, we plan to either add the ability
105-
> to replace multiple servers at once or have the command return an error when attempting to do so.
110+
> [!WARNING]
111+
> Running more than one `baremetalmachine replace` or `reimage` command at the same time, or running a `replace`
112+
> at the same time as a `reimage` will leave servers in a nonworking state. Make sure one `replace`/`reimage`
113+
> has fully completed before starting another one.
106114
107115
```azurecli
108116
az networkcloud baremetalmachine reimage \
@@ -112,13 +120,12 @@ az networkcloud baremetalmachine reimage \
112120

113121
## Replace BMM
114122

115-
Use `Replace BMM` command when a server has encountered hardware issues requiring a complete or partial hardware replacement. After replacement of components such as motherboard or NIC replacement, the MAC address of BMM will change, however the IDrac IP address and hostname will remain the same.
123+
Use the `replace` command when a server encounters hardware issues requiring a complete or partial hardware replacement. After replacement of components such as motherboard or Network Interface Card (NIC) replacement, the MAC address of BMM will change, however the iDRAC IP address and hostname will remain the same.
116124

117-
> [!Warning]
118-
> Running more than one baremetalmachine replace or reimage command at the same time, or running a replace
119-
> at the same time as a reimage, will leave servers in a nonworking state. Make sure one replace / reimage
120-
> has fully completed before starting another one. In a future release, we plan to either add the ability
121-
> to replace multiple servers at once or have the command return an error when attempting to do so.
125+
> [!WARNING]
126+
> Running more than one `baremetalmachine replace` or `reimage` command at the same time, or running a `replace`
127+
> at the same time as a `reimage` will leave servers in a nonworking state. Make sure one `replace`/`reimage`
128+
> has fully completed before starting another one.
122129
123130
```azurecli
124131
az networkcloud baremetalmachine replace \

articles/operator-nexus/troubleshoot-reboot-reimage-replace.md

Lines changed: 22 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -1,32 +1,42 @@
11
---
22
title: Troubleshoot Azure Operator Nexus server problems
3-
description: Troubleshoot cluster bare-metal machines with three Rs for Azure Operator Nexus.
3+
description: Troubleshoot cluster bare metal machines with Restart, Reimage, Replace for Azure Operator Nexus.
44
ms.service: azure-operator-nexus
55
ms.custom: troubleshooting
66
ms.topic: troubleshooting
7-
ms.date: 06/12/2023
8-
author: JAC0BSMITH
9-
ms.author: jacobsmith
7+
ms.date: 04/24/2024
8+
author: eak13
9+
ms.author: ekarandjeff
1010
---
1111

1212
# Troubleshoot Azure Operator Nexus server problems
1313

14-
This article describes how to troubleshoot server problems by using restart, reimage, and replace (three Rs) actions on Azure Operator Nexus bare-metal machines (BMMs). You might need to take these actions on your server for maintenance reasons, which causes a brief disruption to specific BMMs.
14+
This article describes how to troubleshoot server problems by using restart, reimage, and replace actions on Azure Operator Nexus bare metal machines (BMMs). You might need to take these actions on your server for maintenance reasons, which causes a brief disruption to specific BMMs.
1515

1616
The time required to complete each of these actions is similar. Restarting is the fastest, whereas replacing takes slightly longer. All three actions are simple and efficient methods for troubleshooting.
1717

18+
> [!CAUTION]
19+
> Do not perform any action against management servers without first consulting with Microsoft support personnel. Doing so could affect the integrity of the Operator Nexus Cluster.
20+
1821
## Prerequisites
1922

2023
- Familiarize yourself with the capabilities referenced in this article by reviewing the [BMM actions](howto-baremetal-functions.md).
2124
- Gather the following information:
2225
- Name of the resource group for the BMM
2326
- Name of the BMM that requires a lifecycle management operation
2427

28+
> [!IMPORTANT]
29+
> Disruptive command requests against a Kubernetes Control Plane (KCP) node are rejected if there is another disruptive action command already running against another KCP node or if the full KCP is not available.
30+
>
31+
> Restart, reimage and replace are all considered disruptive actions.
32+
>
33+
> This check is done to maintain the integrity of the Nexus instance and ensure multiple KCP nodes don't go down at once due to simultaneous disruptive actions. If multiple nodes go down, it will break the healthy quorum threshold of the Kubernetes Control Plane.
34+
2535
## Identify the corrective action
2636

27-
When you're troubleshooting a BMM for failures and determining the best corrective action, it's important to understand the available options. Restarting or reimaging a BMM can be an efficient and effective way to fix problems or simply restore the software to a known-good place. This article provides direction on the best practices for each of the three Rs.
37+
When you're troubleshooting a BMM for failures and determining the best corrective action, it's important to understand the available options. Restarting or reimaging a BMM can be an efficient and effective way to fix problems or restore the software to a known-good place. Replacing a BMM might be required when one or more hardware components fail on the server. This article provides direction on the best practices for each of the three actions.
2838

29-
Troubleshooting technical problems requires a systematic approach. One effective method is to start with the simplest and least invasive solution and work your way up to more complex and drastic measures, if necessary.
39+
Troubleshooting technical problems requires a systematic approach. One effective method is to start with the least invasive solution and works your way up to more complex and drastic measures, if necessary.
3040

3141
The first step in troubleshooting is often to try restarting the device or system. Restarting can help to clear any temporary glitches or errors that might be causing the problem. If restarting doesn't solve the problem, the next step might be to try reimaging the device or system.
3242

@@ -42,7 +52,7 @@ The restart typically is the starting point for mitigating a problem.
4252

4353
## Troubleshoot with a reimage action
4454

45-
Reimaging a BMM is a process that you use to redeploy the image on the OS disk, without impact to the tenant data. This action executes the steps to rejoin the cluster with the same identifiers.
55+
Reimaging a BMM is a process that you use to redeploy the image on the OS disk, without affecting the tenant data. This action executes the steps to rejoin the cluster with the same identifiers.
4656

4757
The reimage action can be useful for troubleshooting problems by restoring the OS to a known-good working state. Common causes that can be resolved through reimaging include recovery due to doubt of host integrity, suspected or confirmed security compromise, or "break glass" write activity.
4858

@@ -54,14 +64,14 @@ Servers contain many physical components that can fail over time. It's important
5464

5565
A hardware validation process is invoked to ensure the integrity of the physical host in advance of deploying the OS image. Like the reimage action, the tenant data isn't modified during replacement.
5666

57-
As a best practice, cordon off and shut down the BMM in advance of physical repairs. When you're performing the following physical repair, a replace action isn't required because the BMM host will continue to function normally after the repair:
67+
As a best practice, first issue a `cordon` command to remove the bare metal machine from workload scheduling and then shut down the BMM in advance of physical repairs.
5868

59-
- Hot swappable power supply
69+
When you're performing a physical hot swappable power supply repair, a replace action isn't required because the BMM host will continue to function normally after the repair.
6070

6171
When you're performing the following physical repairs, we recommend a replace action, though it isn't necessary to bring the BMM back into service:
6272

6373
- CPU
64-
- DIMM
74+
- Dual In-Line Memory Module (DIMM)
6575
- Fan
6676
- Expansion board riser
6777
- Transceiver
@@ -73,7 +83,7 @@ When you're performing the following physical repairs, a replace action is requi
7383
- System board
7484
- SSD disk
7585
- PERC/RAID adapter
76-
- Mellanox NIC
86+
- Mellanox Network Interface Card (NIC)
7787
- Broadcom embedded NIC
7888

7989
## Summary

0 commit comments

Comments
 (0)