Skip to content

Commit 2d556b1

Browse files
committed
final feedback comments addressed
1 parent 95c5127 commit 2d556b1

File tree

2 files changed

+114
-26
lines changed

2 files changed

+114
-26
lines changed

articles/operator-nexus/howto-baremetal-best-practices.md

Lines changed: 109 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -39,9 +39,86 @@ For this reason, it's essential to understand the available options well when tr
3939
- Wait for Az CLI commands to run to completion and validate the state of the BMM resource before executing other steps.
4040
- Verify that the firmware and software versions are up-to-date before a new greenfield deployment to prevent compatibility issues between hardware and software versions.
4141
For more information about firmware compatibility, see [Operator Nexus Platform Prerequisites](./howto-platform-prerequisites.md).
42-
- Ensure stable network connectivity to avoid interruptions during the process.
43-
Validate that there are no active network stability issues with the network fabric.
44-
Ignoring network stability could make operations fail to complete successfully and leave a BMM in an unknown state.
42+
- Check the iDRAC credentials are correct and that the BMM is powered on.
43+
44+
#### Look at General Network Connectivity Health
45+
46+
Ensure stable network connectivity to avoid interruptions during the process.
47+
Ignoring network stability could make operations fail to complete successfully and leave a BMM in an error or degraded state.
48+
49+
A quick look at Cluster resource's `clusterConnectionStatus` serves as one indicator of network connectivity health.
50+
51+
```azurecli
52+
az networkcloud cluster show \
53+
-g $CLUSTER_MRG \
54+
-n $BMM_NAME \
55+
--subscription $SUBSCRIPTION \
56+
--query "clusterConnectionStatus" \
57+
-o table
58+
59+
Result
60+
---------
61+
Connected
62+
```
63+
64+
Take a deeper look at the NetworkFabric resources by checking the NetworkFabric resources statuses, alerts, and metrics.
65+
See related articles:
66+
- [How to monitor interface In and Out packet rate for network fabric devices]
67+
- [How to configure diagnostic settings and monitor configuration differences in Nexus Network Fabric].
68+
69+
Evaluate for any BMM warnings or degraded conditions which could indicate the need to resolve hardware, network, or server configuration problems.
70+
For more information, see [Troubleshoot Degraded Status Errors on Bare Metal Machines] and [Troubleshoot Bare Metal Machine Warning Status].
71+
72+
#### Determine if Firmware Update Jobs are Running
73+
74+
Validate that there are no running firmware upgrade jobs through the BMC before initiating a `replace` or `reimage` operation.
75+
Interrupting an ongoing firmware upgrade can leave the BMM in an inconsistent state.
76+
You can view in the iDRAC GUI the `jobqueue` or use a `racadm jobqueque view` to determine if there are firmware upgraded jobs running.
77+
78+
```azurecli
79+
az networkcloud baremetalmachine run-read-command \
80+
-g $CLUSTER_MRG \
81+
-n $BMM_NAME \
82+
--subscription $SUBSCRIPTION \
83+
--limit-time-seconds 60 \
84+
--commands "[{command:'nc-toolbox nc-toolbox-runread racadm jobqueue view'}]" \
85+
--output-directory .
86+
```
87+
88+
Here's an example output from the `racadm jobqueue view` command which shows `Firmware Update`.
89+
```
90+
[Job ID=JID_833540920066]
91+
Job Name=Firmware Update: iDRAC
92+
Status=Downloading
93+
Start Time= [Not Applicable]
94+
Expiration Time= [Not Applicable]
95+
Message= [RED001: Job in progress.]
96+
Percent Complete= [50%]
97+
```
98+
99+
Here's an example output from the `racadm jobqueue view` command showing common happy-path statements.
100+
```
101+
-------------------------JOB QUEUE------------------------
102+
[Job ID=JID_429400224349]
103+
Job Name=Configure: Import Server Configuration Profile
104+
Status=Completed
105+
Scheduled Start Time=[Not Applicable]
106+
Expiration Time=[Not Applicable]
107+
Actual Start Time=[Tue, 25 Mar 2025 17:00:22]
108+
Actual Completion Time=[Tue, 25 Mar 2025 17:00:32]
109+
Message=[SYS053: Successfully imported and applied Server Configuration Profile.]
110+
Percent Complete=[100]
111+
----------------------------------------------------------
112+
[Job ID=JID_429400338344]
113+
Job Name=Export: Server Configuration Profile
114+
Status=Completed
115+
Scheduled Start Time=[Not Applicable]
116+
Expiration Time=[Not Applicable]
117+
Actual Start Time=[Tue, 25 Mar 2025 17:00:33]
118+
Actual Completion Time=[Tue, 25 Mar 2025 17:00:58]
119+
Message=[SYS043: Successfully exported Server Configuration Profile]
120+
Percent Complete=[100]
121+
```
45122

46123
## Best Practices for a BMM Reimage
47124

@@ -62,14 +139,15 @@ The `reimage` action is designed to interact with the operating system partition
62139

63140
Before initiating any `reimage` operation, ensure the following preconditions are met:
64141

65-
- Ensure the BMM is in `poweredState` set to `On` and `readyState` set to `True`.
66142
- Make sure the BMM's workloads are drained using the [`cordon`](./howto-baremetal-functions.md#make-a-bmm-unschedulable-cordon) command with the parameter `evacuate` set to `True`.
67143
- Perform high level checks covered in the article [Troubleshoot Bare Metal Machine Provisioning].
68144
- Evaluate any BMM warnings or degraded conditions which could indicate the need to resolve hardware, network, or server configuration problems before a `reimage` operation.
69145
For more information, read [Troubleshoot Degraded Status Errors on Bare Metal Machines] and [Troubleshoot Bare Metal Machine Warning Status].
146+
Any hardware issues present on the BMM must be resolved, and the BMM needs a `replace` instead.
147+
See the [Best Practices for a BMM Replace](#best-practices-for-a-bmm-replace).
70148
- Validate that there are no running firmware upgrade jobs through the BMC before initiating a `reimage` operation.
71149
Interrupting an ongoing firmware upgrade can leave the BMM in an inconsistent state.
72-
Confirm the BMM resource's `detailedStatus` isn't in the `Preparing` state.
150+
Follow steps in section [Determine if Firmware Update Jobs are Running](#determine-if-firmware-update-jobs-are-running).
73151

74152
## Best Practices for a BMM Replace
75153

@@ -86,37 +164,40 @@ When one or more hardware components fail on the server (multiple failures), mak
86164
> With the `2024-07-01` GA API version, the RAID controller is reset during BMM `replace`, wiping all data from the server's virtual disks.
87165
> Baseboard Management Controller (BMC) virtual disk alerts triggered during BMM `replace` can be ignored unless there are more physical disk and/or RAID controllers alerts.
88166
89-
### Resolve Hardware Validation Issues
90-
91-
When a BMM is marked with failed hardware validation, it might indicate that physical repairs are needed.
92-
It's crucial to identify and address these repairs before performing a BMM `replace`.
93-
A hardware validation process is invoked as part of the `replace` operation to ensure the physical host's integrity before deploying the OS image.
94-
The BMM can't provision successfully when the BMM continues to have hardware validation failures.
95-
As a result, the BMM fails to complete the necessary setup steps to become operational and join the cluster.
96-
Ensure **all hardware validation issues** are cleared before the next `replace` action.
97-
98-
To understand hardware validation result, read through the article [Troubleshoot Hardware Validation Failure](./troubleshoot-hardware-validation-failure.md).
99-
100167
### Preconditions and Validations Before a BMM Replace
101168

102169
Before initiating any `replace` operation, ensure the following preconditions are met:
103170

104-
- Ensure the BMM `poweredState` is set to `On` and the `readyState` is set to `True`.
105171
- Make sure the BMM's workloads are drained using the [`cordon`](./howto-baremetal-functions.md#make-a-bmm-unschedulable-cordon) command with the parameter `evacuate` set to `True`.
106172
- Perform high level checks covered in the article [Troubleshoot Bare Metal Machine Provisioning].
107173
- Evaluate any BMM warnings or degraded conditions which could indicate the need to resolve hardware, network, or server configuration problems before a `replace` operation.
108174
For more information, see [Troubleshoot Degraded Status Errors on Bare Metal Machines] and [Troubleshoot Bare Metal Machine Warning Status].
109175
- Validate that there are no running firmware upgrade jobs through the BMC before initiating a `replace` operation.
110176
Interrupting an ongoing firmware upgrade can leave the BMM in an inconsistent state.
111-
Confirm the BMM resource's `detailedStatus` isn't in the `Preparing` state.
177+
Follow steps in section [Determine if Firmware Update Jobs are Running](#determine-if-firmware-update-jobs-are-running).
112178

113-
### BMM Replace isn't Required
179+
### Resolve Hardware Validation Issues
114180

115-
A `replace` operation isn't required when you're performing a physical hot swappable power supply repair because the BMM host will continue to function normally after the repair.
181+
When a BMM is marked with failed hardware validation, it might indicate that physical repairs are needed.
182+
It's crucial to identify and address these repairs before performing a BMM `replace`.
183+
A hardware validation process is invoked as part of the `replace` operation to ensure the physical host's integrity before deploying the OS image.
184+
The BMM can't provision successfully when the BMM continues to have hardware validation failures.
185+
As a result, the BMM fails to complete the necessary setup steps to become operational and join the cluster.
186+
Ensure **all hardware validation issues** are cleared before the next `replace` action.
116187

117-
### BMM Replace is Optional but Recommended
188+
To understand hardware validation result, read through the article [Troubleshoot Hardware Validation Failure](./troubleshoot-hardware-validation-failure.md).
118189

119-
While not strictly necessary to bring the BMM back into service, we recommend doing a `replace` operation when you're performing the following physical repairs:
190+
### BMM Replace isn't Required
191+
192+
Some repairs don't require a BMM `replace` to be executed.
193+
For example, a `replace` operation isn't required when you're performing a physical hot swappable power supply repair because the BMM host will continue to function normally after the repair.
194+
However, if the BMM failed hardware validation, the BMM `replace` is required even if the hot swappable repairs are done.
195+
Examine the BMM status messages to determine if hardware validation failures or other degraded conditions are present.
196+
- [Troubleshoot Degraded Status Errors on Bare Metal Machines]
197+
- [Troubleshoot Bare Metal Machine Warning Status]
198+
- [Troubleshoot Hardware Validation Failure](./troubleshoot-hardware-validation-failure.md).
199+
200+
Other repairs of this type might be:
120201

121202
- CPU
122203
- Dual In-Line Memory Module (DIMM)
@@ -127,6 +208,10 @@ While not strictly necessary to bring the BMM back into service, we recommend do
127208

128209
### BMM Relace is Required
129210

211+
After components such as motherboard or Network Interface Card (NIC) are replaced, the BMM MAC address changes.
212+
However, the iDRAC IP address and hostname remain the same.
213+
Motherboard changes result in MAC address changes, requiring a BMM `replace`.
214+
130215
A `replace` operation **is required** to bring the BMM back into service when you're performing the following physical repairs:
131216

132217
- Backplane
@@ -136,9 +221,6 @@ A `replace` operation **is required** to bring the BMM back into service when yo
136221
- Mellanox Network Interface Card (NIC)
137222
- Broadcom embedded NIC
138223

139-
After components such as motherboard or Network Interface Card (NIC) are replaced, the MAC address of BMM will change; however, the iDRAC IP address and hostname will remain the same.
140-
Motherboard changes result in MAC address changes, requiring a BMM `replace`.
141-
142224
### After BMM Replace
143225

144226
After the BMM `replace` operation completes successfully, ensure that the `provisioningStatus` is `Succeeded` and the `readyState` is `True`.
@@ -166,3 +248,5 @@ For more information about Support plans, see [Azure Support plans](https://azur
166248
[Troubleshoot Bare Metal Machine Warning Status]: ./troubleshoot-bare-metal-machine-warning.md
167249
[Troubleshoot Degraded Status Errors on Bare Metal Machines]: ./troubleshoot-bare-metal-machine-degraded.md
168250
[Troubleshoot Hardware Validation Failure]: ./troubleshoot-hardware-validation-failure.md
251+
[How to monitor interface In and Out packet rate for network fabric devices]: ./howto-monitor-interface-packet-rate.md
252+
[How to configure diagnostic settings and monitor configuration differences in Nexus Network Fabric]: ./howto-configure-diagnostic-settings-monitor-configuration-differences.md

articles/operator-nexus/howto-baremetal-functions.md

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -77,7 +77,11 @@ Existing workloads continue to run on the BMM unless the workloads are drained.
7777
### Drain Workloads from the BMM
7878

7979
The cordon command supports the `evacuate` parameter which its default value `False` means that the `cordon` command prevents scheduling new workloads.
80-
To drain workloads with the `cordon` command, the `evacuate` parameter must be set to `True`. The workloads running on the BMM are `stopped` and the BMM is set to `pending` state.
80+
To drain workloads with the `cordon` command, the `evacuate` parameter must be set to `True`.
81+
The workloads running on the BMM are `stopped` and the BMM is set to `pending` state.
82+
83+
> [!NOTE]
84+
> Nexus Management Workloads will continue to run on the BMM even when the server has been cordoned and evacuated.
8185
8286
It's a best practice to set the `evacuate` value to `True` when attempting to do any maintenance operations on the BMM server.
8387
For more best practices to follow, read through [Best Practices for BareMetal Machine Operations](./howto-baremetal-best-practices.md).

0 commit comments

Comments
 (0)