Skip to content

Commit a794583

Browse files
authored
Merge pull request #291624 from jmmason70/Edits-for-howto-baremetal-functions-and-troubleshoot-reboot-reimage-replace
Edits for howto baremetal functions and troubleshoot reboot reimage replace
2 parents edd83a6 + f1f0898 commit a794583

File tree

2 files changed

+138
-42
lines changed

2 files changed

+138
-42
lines changed

articles/operator-nexus/howto-baremetal-functions.md

Lines changed: 40 additions & 32 deletions
Original file line numberDiff line numberDiff line change
@@ -16,14 +16,14 @@ This article describes how to perform lifecycle management operations on bare me
1616
> [!CAUTION]
1717
> Do not perform any action against management servers without first consulting with Microsoft support personnel. Doing so could affect the integrity of the Operator Nexus Cluster.
1818
19-
- **Power off the BMM**
20-
- Start the BMM
21-
- **Restart the BMM**
22-
- Make the BMM unschedulable (cordon without evacuate)
23-
- **Make the BMM unschedulable (cordon with evacuate)**
24-
- Make the BMM schedulable (uncordon)
25-
- **Reimage the BMM**
26-
- **Replace the BMM**
19+
- **Power off a BMM**
20+
- Start a BMM
21+
- **Restart a BMM**
22+
- Make a BMM unschedulable (cordon without evacuate)
23+
- **Make a BMM unschedulable (cordon with evacuate)**
24+
- Make a BMM schedulable (uncordon)
25+
- **Reimage a BMM**
26+
- **Replace a BMM**
2727

2828
> [!IMPORTANT]
2929
> Disruptive command requests against a Kubernetes Control Plane (KCP) node are rejected if there is another disruptive action command already running against another KCP node or if the full KCP is not available. This check is done to maintain the integrity of the Nexus instance and ensure multiple KCP nodes don't become non-operational at once due to simultaneous disruptive actions. If multiple nodes become non-operational, it will break the healthy quorum threshold of the Kubernetes Control Plane.
@@ -41,51 +41,56 @@ This article describes how to perform lifecycle management operations on bare me
4141
1. Ensure that the target bare metal machine `poweredState` set to `On` and `readyState` set to `True`.
4242
1. This prerequisite isn't applicable for the `start` command.
4343

44-
## Power off the BMM
44+
## Power off a BMM
4545

4646
This command will `power-off` the specified `bareMetalMachineName`.
4747

4848
```azurecli
4949
az networkcloud baremetalmachine power-off \
50-
--name "bareMetalMachineName" \
51-
--resource-group "cluster_MRG"
50+
--name <BareMetalMachineName> \
51+
--resource-group <CLUSTER_MRG> \
52+
--subscription <SUBSCRIPTION_ID>
5253
```
5354

54-
## Start the BMM
55+
## Start a BMM
5556

5657
This command will `start` the specified `bareMetalMachineName`.
5758

5859
```azurecli
5960
az networkcloud baremetalmachine start \
60-
--name "bareMetalMachineName" \
61-
--resource-group "cluster_MRG"
61+
--name <BareMetalMachineName> \
62+
--resource-group <CLUSTER_MRG> \
63+
--subscription <SUBSCRIPTION_ID>
6264
```
6365

64-
## Restart the BMM
66+
## Restart a BMM
6567

6668
This command will `restart` the specified `bareMetalMachineName`.
6769

6870
```azurecli
6971
az networkcloud baremetalmachine restart \
70-
--name "bareMetalMachineName" \
71-
--resource-group "cluster_MRG"
72+
--name <BareMetalMachineName> \
73+
--resource-group <CLUSTER_MRG> \
74+
--subscription <SUBSCRIPTION_ID>
7275
```
7376

7477
## Make a BMM unschedulable (cordon)
78+
<!--(PLACEHOLDER: We need to explain how a customer can identify if workloads are currently running on a BMM and the az cli command used to get this information. Ask NAKS team to provide.)-->
7579

7680
You can make a BMM unschedulable by executing the [`cordon`](#make-a-bmm-unschedulable-cordon) command.
7781
On the execution of the `cordon` command,
7882
Operator Nexus workloads aren't scheduled on the BMM when cordon is set; any attempt to create a workload on a `cordoned`
7983
BMM results in the workload being set to `pending` state. Existing workloads continue to run.
8084
The cordon command supports an `evacuate` parameter with the default `False` value.
81-
On executing the `cordon` command, with the value `True` for the `evacuate`
85+
It is a best practice to set this to `True`. On executing the `cordon` command, with the value `True` for the `evacuate`
8286
parameter, the workloads that are running on the BMM are `stopped` and the BMM is set to `pending` state.
8387

8488
```azurecli
8589
az networkcloud baremetalmachine cordon \
8690
--evacuate "True" \
87-
--name "bareMetalMachineName" \
88-
--resource-group "cluster_MRG"
91+
--name <BareMetalMachineName> \
92+
--resource-group <CLUSTER_MRG> \
93+
--subscription <SUBSCRIPTION_ID>
8994
```
9095

9196
The `evacuate "True"` removes workloads from that node while `evacuate "False"` only prevents the scheduling of new workloads.
@@ -97,8 +102,9 @@ state on the BMM are `restarted` when the BMM is `uncordoned`.
97102

98103
```azurecli
99104
az networkcloud baremetalmachine uncordon \
100-
--name "bareMetalMachineName" \
101-
--resource-group "cluster_MRG"
105+
--name <BareMetalMachineName> \
106+
--resource-group <CLUSTER_MRG> \
107+
--subscription <SUBSCRIPTION_ID>
102108
```
103109

104110
## Reimage a BMM
@@ -114,11 +120,12 @@ command, with `evacuate "True"`, before executing the `reimage` command.
114120
115121
```azurecli
116122
az networkcloud baremetalmachine reimage \
117-
–-name "bareMetalMachineName" \
118-
--resource-group "cluster_MRG"
123+
--name <BareMetalMachineName> \
124+
--resource-group <CLUSTER_MRG> \
125+
--subscription <SUBSCRIPTION_ID>
119126
```
120127

121-
## Replace BMM
128+
## Replace a BMM
122129

123130
Use the `replace` command when a server encounters hardware issues requiring a complete or partial hardware replacement. After replacement of components such as motherboard or Network Interface Card (NIC) replacement, the MAC address of BMM will change, however the iDRAC IP address and hostname will remain the same.
124131

@@ -129,11 +136,12 @@ Use the `replace` command when a server encounters hardware issues requiring a c
129136
130137
```azurecli
131138
az networkcloud baremetalmachine replace \
132-
--name "bareMetalMachineName" \
133-
--resource-group "cluster_MRG" \
134-
--bmc-credentials password="{password}" username="{user}" \
135-
--bmc-mac-address "00:00:4f:00:57:ad" \
136-
--boot-mac-address "00:00:4e:00:58:af" \
137-
--machine-name "OS_hostname" \
138-
--serial-number "BM1219XXX"
139+
--name <BareMetalMachineName> \
140+
--resource-group <CLUSTER_MRG> \
141+
--bmc-credentials password=<IDRAC_PASSWORD> username=<IDRAC_USER> \
142+
--bmc-mac-address <IDRAC_MAC> \
143+
--boot-mac-address <PXE_MAC> \
144+
--machine-name <OS_HOSTNAME> \
145+
--serial-number <SERIAL_NUMBER> \
146+
--subscription <SUBSCRIPTION_ID>
139147
```

articles/operator-nexus/troubleshoot-reboot-reimage-replace.md

Lines changed: 98 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -22,25 +22,28 @@ The time required to complete each of these actions is similar. Restarting is th
2222

2323
- Familiarize yourself with the capabilities referenced in this article by reviewing the [BMM actions](howto-baremetal-functions.md).
2424
- Gather the following information:
25-
- Name of the resource group for the BMM
25+
- Name of the managed resource group for the BMM
2626
- Name of the BMM that requires a lifecycle management operation
27+
- Subscription ID
2728

2829
> [!IMPORTANT]
2930
> Disruptive command requests against a Kubernetes Control Plane (KCP) node are rejected if there is another disruptive action command already running against another KCP node or if the full KCP is not available.
3031
>
3132
> Restart, reimage and replace are all considered disruptive actions.
3233
>
33-
> This check is done to maintain the integrity of the Nexus instance and ensure multiple KCP nodes don't go down at once due to simultaneous disruptive actions. If multiple nodes go down, it will break the healthy quorum threshold of the Kubernetes Control Plane.
34+
> This check is done to maintain the integrity of the Nexus instance and ensure multiple KCP nodes do not go down at once due to simultaneous disruptive actions. If multiple nodes go down, it will break the healthy quorum threshold of the Kubernetes Control Plane.
3435
3536
## Identify the corrective action
3637

37-
When you're troubleshooting a BMM for failures and determining the best corrective action, it's important to understand the available options. Restarting or reimaging a BMM can be an efficient and effective way to fix problems or restore the software to a known-good place. Replacing a BMM might be required when one or more hardware components fail on the server. This article provides direction on the best practices for each of the three actions.
38+
When troubleshooting a BMM for failures and determining the most appropriate corrective action, it is essential to understand the available options. Restarting or reimaging a BMM can be both efficient and effective for resolving issues or restoring the software to a known-good state. In cases where one or more hardware components fail on the server, it may be necessary to replace the BMM entirely. This article outlines the best practices for each of these three actions.
3839

39-
Troubleshooting technical problems requires a systematic approach. One effective method is to start with the least invasive solution and works your way up to more complex and drastic measures, if necessary.
40+
Troubleshooting technical problems requires a systematic approach. One effective method is to start with the least invasive solution and work your way up to more complex and drastic measures, if necessary.
4041

41-
The first step in troubleshooting is often to try restarting the device or system. Restarting can help to clear any temporary glitches or errors that might be causing the problem. If restarting doesn't solve the problem, the next step might be to try reimaging the device or system.
42+
The first step in troubleshooting is to try restarting the device or system. Restarting can help to clear up any temporary glitches or errors that might be causing the problem.
4243

43-
If reimaging doesn't solve the problem, the final step might be to replace the faulty hardware component. Replacement can be a more drastic measure, but it might be necessary if the problem is related to a hardware malfunction.
44+
If restarting does not solve the problem, the next step is to try reimaging the device or system.
45+
46+
If reimaging does not solve the problem, the final step is to replace the faulty hardware component. While replacement is a more significant measure, it may be required if the issue stems from a hardware defect.
4447

4548
Keep in mind that these troubleshooting methods might not always be effective, and other factors in play might require a different approach.
4649

@@ -50,6 +53,31 @@ Restarting a BMM is a process of restarting the server through a simple API call
5053

5154
The restart typically is the starting point for mitigating a problem.
5255

56+
***The following Azure CLI command will `power-off` the specified bareMetalMachineName.***
57+
```
58+
az networkcloud baremetalmachine power-off \
59+
--name <bareMetalMachineName> \
60+
--resource-group <CLUSTER_MRG> \
61+
--subscription <SUBSCRIPTION_ID>
62+
```
63+
64+
***The following Azure CLI command will `start` the specified bareMetalMachineName.***
65+
```
66+
az networkcloud baremetalmachine start \
67+
--name <bareMetalMachineName> \
68+
--resource-group <CLUSTER_MRG> \
69+
--subscription <SUBSCRIPTION_ID>
70+
```
71+
72+
***The following Azure CLI command will `restart` the specified bareMetalMachineName.***
73+
```
74+
az networkcloud baremetalmachine restart \
75+
--name <bareMetalMachineName> \
76+
--resource-group <CLUSTER_MRG> \
77+
--subscription <SUBSCRIPTION_ID>
78+
```
79+
80+
5381
## Troubleshoot with a reimage action
5482

5583
Reimaging a BMM is a process that you use to redeploy the image on the OS disk, without affecting the tenant data. This action executes the steps to rejoin the cluster with the same identifiers.
@@ -58,9 +86,37 @@ The reimage action can be useful for troubleshooting problems by restoring the O
5886

5987
A reimage action is the best practice for lowest operational risk to ensure the integrity of the BMM.
6088

89+
As a best practice, make sure the BMM's workloads are drained using the cordon command, with evacuate "True", before executing the reimage command.
90+
<!--(PLACEHOLDER: We need to explain how a customer can identify if workloads are currently running on a BMM and the az cli command used to get this information. Ask NAKS team to provide.) -->
91+
92+
***The following Azure CLI command will `cordon` the specified bareMetalMachineName.***
93+
```
94+
az networkcloud baremetalmachine cordon \
95+
--evacuate "True" \
96+
--name <bareMetalMachineName> \
97+
--resource-group <CLUSTER_MRG> \
98+
--subscription <SUBSCRIPTION_ID>
99+
```
100+
101+
***The following Azure CLI command will `reimage` the specified bareMetalMachineName.***
102+
```
103+
az networkcloud baremetalmachine reimage \
104+
--name <bareMetalMachineName> \
105+
--resource-group <CLUSTER_MRG> \
106+
--subscription <SUBSCRIPTION_ID>
107+
```
108+
109+
***The following Azure CLI command will `uncordon` the specified bareMetalMachineName.***
110+
```
111+
az networkcloud baremetalmachine uncordon \
112+
--name <bareMetalMachineName> \
113+
--resource-group <CLUSTER_MRG> \
114+
--subscription <SUBSCRIPTION_ID>
115+
```
116+
61117
## Troubleshoot with a replace action
62118

63-
Servers contain many physical components that can fail over time. It's important to understand which physical repairs require BMM replacement and when BMM replacement is recommended but not required.
119+
Servers contain many physical components that can fail over time. It is important to understand which physical repairs require BMM replacement and when BMM replacement is recommended.
64120

65121
A hardware validation process is invoked to ensure the integrity of the physical host in advance of deploying the OS image. Like the reimage action, the tenant data isn't modified during replacement.
66122

@@ -69,9 +125,18 @@ A hardware validation process is invoked to ensure the integrity of the physical
69125
70126
As a best practice, first issue a `cordon` command to remove the bare metal machine from workload scheduling and then shut down the BMM in advance of physical repairs.
71127

72-
When you're performing a physical hot swappable power supply repair, a replace action isn't required because the BMM host will continue to function normally after the repair.
128+
***The following Azure CLI command will `cordon` the specified bareMetalMachineName.***
129+
```
130+
az networkcloud baremetalmachine cordon \
131+
--evacuate "True" \
132+
--name <bareMetalMachineName> \
133+
--resource-group <CLUSTER_MRG> \
134+
--subscription <SUBSCRIPTION_ID>
135+
```
136+
137+
When you're performing a physical hot swappable power supply repair, a replace action is not required because the BMM host will continue to function normally after the repair.
73138

74-
When you're performing the following physical repairs, we recommend a replace action, though it isn't necessary to bring the BMM back into service:
139+
When you're performing the following physical repairs, we recommend a replace action, though it is not necessary to bring the BMM back into service:
75140

76141
- CPU
77142
- Dual In-Line Memory Module (DIMM)
@@ -80,7 +145,7 @@ When you're performing the following physical repairs, we recommend a replace ac
80145
- Transceiver
81146
- Ethernet or fiber cable replacement
82147

83-
When you're performing the following physical repairs, a replace action is required to bring the BMM back into service:
148+
When you're performing the following physical repairs, a replace action ***is required*** to bring the BMM back into service:
84149

85150
- Backplane
86151
- System board
@@ -89,6 +154,29 @@ When you're performing the following physical repairs, a replace action is requi
89154
- Mellanox Network Interface Card (NIC)
90155
- Broadcom embedded NIC
91156

157+
After physical repairs are completed, perform a replace action.
158+
159+
***The following Azure CLI command will `replace` the specified bareMetalMachineName.***
160+
```
161+
az networkcloud baremetalmachine replace \
162+
--name <bareMetalMachineName> \
163+
--resource-group <CLUSTER_MRG> \
164+
--bmc-credentials password=<IDRAC_PASSWORD> username=<IDRAC_USER> \
165+
--bmc-mac-address <IDRAC_MAC> \
166+
--boot-mac-address <PXE_MAC> \
167+
--machine-name <OS_HOSTNAME> \
168+
--serial-number <SERIAL_NUM> \
169+
--subscription <SUBSCRIPTION_ID>
170+
```
171+
172+
***The following Azure CLI command will uncordon the specified bareMetalMachineName.***
173+
```
174+
az networkcloud baremetalmachine uncordon \
175+
--name <bareMetalMachineName> \
176+
--resource-group <CLUSTER_MRG> \
177+
--subscription <SUBSCRIPTION_ID>
178+
```
179+
92180
## Summary
93181

94182
Restarting, reimaging, and replacing are effective troubleshooting methods that you can use to address technical problems. However, it's important to have a systematic approach and to consider other factors before you try any drastic measures.

0 commit comments

Comments
 (0)