You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/operator-nexus/howto-baremetal-functions.md
+22-24Lines changed: 22 additions & 24 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,20 +1,20 @@
1
1
---
2
-
title: "Azure Operator Nexus: Platform Functions for Bare Metal Machines"
3
-
description: Learn how to manage Bare Metal Machines (BMM).
2
+
title: "Azure Operator Nexus: Platform functions for bare metal machines"
3
+
description: Learn how to manage bare metal machines (BMM).
4
4
author: eak13
5
5
ms.author: ekarandjeff
6
6
ms.service: azure-operator-nexus
7
7
ms.topic: how-to
8
-
ms.date: 04/24/2024
8
+
ms.date: 04/30/2024
9
9
ms.custom: template-how-to, devx-track-azurecli
10
10
---
11
11
12
-
# Manage lifecycle of Bare Metal Machines
12
+
# Manage the lifecycle of bare metal machines
13
13
14
-
This article describes how to perform lifecycle management operations on Bare Metal Machines (BMM). These steps should be used for troubleshooting purposes to recover from failures or when taking maintenance actions. The commands to manage the lifecycle of the BMM include:
14
+
This article describes how to perform lifecycle management operations on bare metal machines (BMM). These steps should be used for troubleshooting purposes to recover from failures or when taking maintenance actions. The commands to manage the lifecycle of the BMM include:
15
15
16
16
> [!CAUTION]
17
-
> Actions against management servers should not be run without consultation with Microsoft support personnel. Doing so could affect the integrity of the Operator Nexus Cluster.
17
+
> Do not perform any action against management servers without first consulting with Microsoft support personnel. Doing so could affect the integrity of the Operator Nexus Cluster.
18
18
19
19
-**Power off the BMM**
20
20
- Start the BMM
@@ -26,7 +26,7 @@ This article describes how to perform lifecycle management operations on Bare Me
26
26
-**Replace the BMM**
27
27
28
28
> [!IMPORTANT]
29
-
> Disruptive command requests against a Kubernetes Control Plane (KCP) node are rejected if there is another disruptive action command already running against another KCP node or the full KCP is not available. This check is done to maintain the integrity of the Nexus instance and ensure multiple KCP nodes don't go down at once due to simultaneous disruptive actions. If multiple nodes go down, it will break healthy quorum threshold of Kubernetes Control Plane.
29
+
> Disruptive command requests against a Kubernetes Control Plane (KCP) node are rejected if there is another disruptive action command already running against another KCP node or if the full KCP is not available. This check is done to maintain the integrity of the Nexus instance and ensure multiple KCP nodes don't go down at once due to simultaneous disruptive actions. If multiple nodes go down, it will break the healthy quorum threshold of the Kubernetes Control Plane.
30
30
>
31
31
> The bolded actions in the above list are considered disruptive (Power off, Restart, Reimage, Replace). Cordon without evacuate is not considered disruptive. Cordon with evacuate is considered disruptive.
32
32
>
@@ -35,11 +35,11 @@ This article describes how to perform lifecycle management operations on Bare Me
1. Get the name of the resource group for the BMM.
40
+
1. Get the name of the bare metal machine that requires a lifecycle management operation.
41
+
1. Ensure that the target bare metal machine `poweredState` set to `On` and `readyState` set to `True`.
42
+
1. This prerequisite isn't applicable for the `start` command.
43
43
44
44
## Power off the BMM
45
45
@@ -90,9 +90,9 @@ az networkcloud baremetalmachine cordon \
90
90
91
91
The `evacuate "True"` removes workloads from that node while `evacuate "False"` only prevents the scheduling of new workloads.
92
92
93
-
## Make a BMM schedulable (uncordon)
93
+
## Make a BMM "schedulable" (uncordon)
94
94
95
-
You can make a BMM `schedulable` (usable) by executing the [`uncordon`](#make-a-bmm-schedulable-uncordon) command. All workloads in a `pending`
95
+
You can make a BMM "schedulable" (usable) by executing the [`uncordon`](#make-a-bmm-schedulable-uncordon) command. All workloads in a `pending`
96
96
state on the BMM are `restarted` when the BMM is `uncordoned`.
97
97
98
98
```azurecli
@@ -103,15 +103,14 @@ az networkcloud baremetalmachine uncordon \
103
103
104
104
## Reimage a BMM
105
105
106
-
You can restore the runtime version on a BMM by executing `reimage` command. This process **redeploys** the runtime image on the target BMM and executes the steps to rejoin the cluster with the same identifiers. This action doesn't affect the tenant workload files on this BMM.
106
+
You can restore the runtime version on a BMM by executing the `reimage` command. This process **redeploys** the runtime image on the target BMM and executes the steps to rejoin the cluster with the same identifiers. This action doesn't affect the tenant workload files on this BMM.
107
107
As a best practice, make sure the BMM's workloads are drained using the [`cordon`](#make-a-bmm-unschedulable-cordon)
108
108
command, with `evacuate "True"`, before executing the `reimage` command.
109
109
110
110
> [!WARNING]
111
-
> Running more than one baremetalmachine replace or reimage command at the same time, or running a replace
112
-
> at the same time as a reimage, will leave servers in a nonworking state. Make sure one replace / reimage
113
-
> has fully completed before starting another one. In a future release, we plan to either add the ability
114
-
> to replace multiple servers at once or have the command return an error when attempting to do so.
111
+
> Running more than one `baremetalmachine replace` or `reimage` command at the same time, or running a `replace`
112
+
> at the same time as a `reimage` will leave servers in a nonworking state. Make sure one `replace`/`reimage`
113
+
> has fully completed before starting another one.
115
114
116
115
```azurecli
117
116
az networkcloud baremetalmachine reimage \
@@ -121,13 +120,12 @@ az networkcloud baremetalmachine reimage \
121
120
122
121
## Replace BMM
123
122
124
-
Use `Replace BMM` command when a server encounters hardware issues requiring a complete or partial hardware replacement. After replacement of components such as motherboard or Network Interface Card (NIC) replacement, the MAC address of BMM will change, however the iDRAC IP address and hostname will remain the same.
123
+
Use the `replace` command when a server encounters hardware issues requiring a complete or partial hardware replacement. After replacement of components such as motherboard or Network Interface Card (NIC) replacement, the MAC address of BMM will change, however the iDRAC IP address and hostname will remain the same.
125
124
126
125
> [!WARNING]
127
-
> Running more than one baremetalmachine replace or reimage command at the same time, or running a replace
128
-
> at the same time as a reimage, will leave servers in a nonworking state. Make sure one replace / reimage
129
-
> has fully completed before starting another one. In a future release, we plan to either add the ability
130
-
> to replace multiple servers at once or have the command return an error when attempting to do so.
126
+
> Running more than one `baremetalmachine replace` or `reimage` command at the same time, or running a `replace`
127
+
> at the same time as a `reimage` will leave servers in a nonworking state. Make sure one `replace`/`reimage`
128
+
> has fully completed before starting another one.
Copy file name to clipboardExpand all lines: articles/operator-nexus/troubleshoot-reboot-reimage-replace.md
+9-9Lines changed: 9 additions & 9 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,6 +1,6 @@
1
1
---
2
2
title: Troubleshoot Azure Operator Nexus server problems
3
-
description: Troubleshoot cluster bare-metal machines with Restart, Reimage, Replace for Azure Operator Nexus.
3
+
description: Troubleshoot cluster baremetal machines with Restart, Reimage, Replace for Azure Operator Nexus.
4
4
ms.service: azure-operator-nexus
5
5
ms.custom: troubleshooting
6
6
ms.topic: troubleshooting
@@ -11,12 +11,12 @@ ms.author: ekarandjeff
11
11
12
12
# Troubleshoot Azure Operator Nexus server problems
13
13
14
-
This article describes how to troubleshoot server problems by using restart, reimage, and replace actions on Azure Operator Nexus bare-metal machines (BMMs). You might need to take these actions on your server for maintenance reasons, which causes a brief disruption to specific BMMs.
14
+
This article describes how to troubleshoot server problems by using restart, reimage, and replace actions on Azure Operator Nexus baremetal machines (BMMs). You might need to take these actions on your server for maintenance reasons, which causes a brief disruption to specific BMMs.
15
15
16
16
The time required to complete each of these actions is similar. Restarting is the fastest, whereas replacing takes slightly longer. All three actions are simple and efficient methods for troubleshooting.
17
17
18
18
> [!CAUTION]
19
-
> Actions against management servers should not be run without consultation with Microsoft support personnel. Doing so could affect the integrity of the Operator Nexus Cluster.
19
+
> Do not perform any action against management servers without first consulting with Microsoft support personnel. Doing so could affect the integrity of the Operator Nexus Cluster.
20
20
21
21
## Prerequisites
22
22
@@ -26,15 +26,15 @@ The time required to complete each of these actions is similar. Restarting is th
26
26
- Name of the BMM that requires a lifecycle management operation
27
27
28
28
> [!IMPORTANT]
29
-
> Disruptive command requests against a Kubernetes Control Plane (KCP) node are rejected if there is another disruptive action command already running against another KCP node or the full KCP is not available.
29
+
> Disruptive command requests against a Kubernetes Control Plane (KCP) node are rejected if there is another disruptive action command already running against another KCP node or if the full KCP is not available.
30
30
>
31
-
> Restart, Reimage and Replace are all considered disruptive actions.
31
+
> Restart, reimage and replace are all considered disruptive actions.
32
32
>
33
-
> This check is done to maintain the integrity of the Nexus instance and ensure multiple KCP nodes don't go down at once due to simultaneous disruptive actions. If multiple nodes go down, it will break healthy quorum threshold of Kubernetes Control Plane.
33
+
> This check is done to maintain the integrity of the Nexus instance and ensure multiple KCP nodes don't go down at once due to simultaneous disruptive actions. If multiple nodes go down, it will break the healthy quorum threshold of the Kubernetes Control Plane.
34
34
35
35
## Identify the corrective action
36
36
37
-
When you're troubleshooting a BMM for failures and determining the best corrective action, it's important to understand the available options. Restarting or reimaging a BMM can be an efficient and effective way to fix problems or restore the software to a known-good place. This article provides direction on the best practices for each of the three Rs.
37
+
When you're troubleshooting a BMM for failures and determining the best corrective action, it's important to understand the available options. Restarting or reimaging a BMM can be an efficient and effective way to fix problems or restore the software to a known-good place. Replacing a BMM might be required when one or more hardware components fail on the server. This article provides direction on the best practices for each of the three actions.
38
38
39
39
Troubleshooting technical problems requires a systematic approach. One effective method is to start with the least invasive solution and works your way up to more complex and drastic measures, if necessary.
40
40
@@ -64,9 +64,9 @@ Servers contain many physical components that can fail over time. It's important
64
64
65
65
A hardware validation process is invoked to ensure the integrity of the physical host in advance of deploying the OS image. Like the reimage action, the tenant data isn't modified during replacement.
66
66
67
-
As a best practice, cordon off and shut down the BMM in advance of physical repairs. When you're performing the following physical repair, a replace action isn't required because the BMM host will continue to function normally after the repair:
67
+
As a best practice, first issue a `cordon` command to remove the bare metal machine from workload scheduling and then shut down the BMM in advance of physical repairs.
68
68
69
-
- Hot swappable power supply
69
+
When you're performing a physical hot swappable power supply repair, a replace action isn't required because the BMM host will continue to function normally after the repair.
70
70
71
71
When you're performing the following physical repairs, we recommend a replace action, though it isn't necessary to bring the BMM back into service:
0 commit comments