You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -39,9 +39,86 @@ For this reason, it's essential to understand the available options well when tr
39
39
- Wait for Az CLI commands to run to completion and validate the state of the BMM resource before executing other steps.
40
40
- Verify that the firmware and software versions are up-to-date before a new greenfield deployment to prevent compatibility issues between hardware and software versions.
41
41
For more information about firmware compatibility, see [Operator Nexus Platform Prerequisites](./howto-platform-prerequisites.md).
42
-
- Ensure stable network connectivity to avoid interruptions during the process.
43
-
Validate that there are no active network stability issues with the network fabric.
44
-
Ignoring network stability could make operations fail to complete successfully and leave a BMM in an unknown state.
42
+
- Check the iDRAC credentials are correct and that the BMM is powered on.
43
+
44
+
#### Look at General Network Connectivity Health
45
+
46
+
Ensure stable network connectivity to avoid interruptions during the process.
47
+
Ignoring network stability could make operations fail to complete successfully and leave a BMM in an error or degraded state.
48
+
49
+
A quick look at Cluster resource's `clusterConnectionStatus` serves as one indicator of network connectivity health.
50
+
51
+
```azurecli
52
+
az networkcloud cluster show \
53
+
-g $CLUSTER_MRG \
54
+
-n $BMM_NAME \
55
+
--subscription $SUBSCRIPTION \
56
+
--query "clusterConnectionStatus" \
57
+
-o table
58
+
59
+
Result
60
+
---------
61
+
Connected
62
+
```
63
+
64
+
Take a deeper look at the NetworkFabric resources by checking the NetworkFabric resources statuses, alerts, and metrics.
65
+
See related articles:
66
+
-[How to monitor interface In and Out packet rate for network fabric devices]
67
+
-[How to configure diagnostic settings and monitor configuration differences in Nexus Network Fabric].
68
+
69
+
Evaluate for any BMM warnings or degraded conditions which could indicate the need to resolve hardware, network, or server configuration problems.
70
+
For more information, see [Troubleshoot Degraded Status Errors on Bare Metal Machines] and [Troubleshoot Bare Metal Machine Warning Status].
71
+
72
+
#### Determine if Firmware Update Jobs are Running
73
+
74
+
Validate that there are no running firmware upgrade jobs through the BMC before initiating a `replace` or `reimage` operation.
75
+
Interrupting an ongoing firmware upgrade can leave the BMM in an inconsistent state.
76
+
You can view in the iDRAC GUI the `jobqueue` or use a `racadm jobqueque view` to determine if there are firmware upgraded jobs running.
77
+
78
+
```azurecli
79
+
az networkcloud baremetalmachine run-read-command \
Actual Completion Time=[Tue, 25 Mar 2025 17:00:58]
119
+
Message=[SYS043: Successfully exported Server Configuration Profile]
120
+
Percent Complete=[100]
121
+
```
45
122
46
123
## Best Practices for a BMM Reimage
47
124
@@ -62,14 +139,15 @@ The `reimage` action is designed to interact with the operating system partition
62
139
63
140
Before initiating any `reimage` operation, ensure the following preconditions are met:
64
141
65
-
- Ensure the BMM is in `poweredState` set to `On` and `readyState` set to `True`.
66
142
- Make sure the BMM's workloads are drained using the [`cordon`](./howto-baremetal-functions.md#make-a-bmm-unschedulable-cordon) command with the parameter `evacuate` set to `True`.
67
143
- Perform high level checks covered in the article [Troubleshoot Bare Metal Machine Provisioning].
68
144
- Evaluate any BMM warnings or degraded conditions which could indicate the need to resolve hardware, network, or server configuration problems before a `reimage` operation.
69
145
For more information, read [Troubleshoot Degraded Status Errors on Bare Metal Machines] and [Troubleshoot Bare Metal Machine Warning Status].
146
+
Any hardware issues present on the BMM must be resolved, and the BMM needs a `replace` instead.
147
+
See the [Best Practices for a BMM Replace](#best-practices-for-a-bmm-replace).
70
148
- Validate that there are no running firmware upgrade jobs through the BMC before initiating a `reimage` operation.
71
149
Interrupting an ongoing firmware upgrade can leave the BMM in an inconsistent state.
72
-
Confirm the BMM resource's `detailedStatus` isn't in the `Preparing` state.
150
+
Follow steps in section [Determine if Firmware Update Jobs are Running](#determine-if-firmware-update-jobs-are-running).
73
151
74
152
## Best Practices for a BMM Replace
75
153
@@ -86,37 +164,40 @@ When one or more hardware components fail on the server (multiple failures), mak
86
164
> With the `2024-07-01` GA API version, the RAID controller is reset during BMM `replace`, wiping all data from the server's virtual disks.
87
165
> Baseboard Management Controller (BMC) virtual disk alerts triggered during BMM `replace` can be ignored unless there are more physical disk and/or RAID controllers alerts.
88
166
89
-
### Resolve Hardware Validation Issues
90
-
91
-
When a BMM is marked with failed hardware validation, it might indicate that physical repairs are needed.
92
-
It's crucial to identify and address these repairs before performing a BMM `replace`.
93
-
A hardware validation process is invoked as part of the `replace` operation to ensure the physical host's integrity before deploying the OS image.
94
-
The BMM can't provision successfully when the BMM continues to have hardware validation failures.
95
-
As a result, the BMM fails to complete the necessary setup steps to become operational and join the cluster.
96
-
Ensure **all hardware validation issues** are cleared before the next `replace` action.
97
-
98
-
To understand hardware validation result, read through the article [Troubleshoot Hardware Validation Failure](./troubleshoot-hardware-validation-failure.md).
99
-
100
167
### Preconditions and Validations Before a BMM Replace
101
168
102
169
Before initiating any `replace` operation, ensure the following preconditions are met:
103
170
104
-
- Ensure the BMM `poweredState` is set to `On` and the `readyState` is set to `True`.
105
171
- Make sure the BMM's workloads are drained using the [`cordon`](./howto-baremetal-functions.md#make-a-bmm-unschedulable-cordon) command with the parameter `evacuate` set to `True`.
106
172
- Perform high level checks covered in the article [Troubleshoot Bare Metal Machine Provisioning].
107
173
- Evaluate any BMM warnings or degraded conditions which could indicate the need to resolve hardware, network, or server configuration problems before a `replace` operation.
108
174
For more information, see [Troubleshoot Degraded Status Errors on Bare Metal Machines] and [Troubleshoot Bare Metal Machine Warning Status].
109
175
- Validate that there are no running firmware upgrade jobs through the BMC before initiating a `replace` operation.
110
176
Interrupting an ongoing firmware upgrade can leave the BMM in an inconsistent state.
111
-
Confirm the BMM resource's `detailedStatus` isn't in the `Preparing` state.
177
+
Follow steps in section [Determine if Firmware Update Jobs are Running](#determine-if-firmware-update-jobs-are-running).
112
178
113
-
### BMM Replace isn't Required
179
+
### Resolve Hardware Validation Issues
114
180
115
-
A `replace` operation isn't required when you're performing a physical hot swappable power supply repair because the BMM host will continue to function normally after the repair.
181
+
When a BMM is marked with failed hardware validation, it might indicate that physical repairs are needed.
182
+
It's crucial to identify and address these repairs before performing a BMM `replace`.
183
+
A hardware validation process is invoked as part of the `replace` operation to ensure the physical host's integrity before deploying the OS image.
184
+
The BMM can't provision successfully when the BMM continues to have hardware validation failures.
185
+
As a result, the BMM fails to complete the necessary setup steps to become operational and join the cluster.
186
+
Ensure **all hardware validation issues** are cleared before the next `replace` action.
116
187
117
-
### BMM Replace is Optional but Recommended
188
+
To understand hardware validation result, read through the article [Troubleshoot Hardware Validation Failure](./troubleshoot-hardware-validation-failure.md).
118
189
119
-
While not strictly necessary to bring the BMM back into service, we recommend doing a `replace` operation when you're performing the following physical repairs:
190
+
### BMM Replace isn't Required
191
+
192
+
Some repairs don't require a BMM `replace` to be executed.
193
+
For example, a `replace` operation isn't required when you're performing a physical hot swappable power supply repair because the BMM host will continue to function normally after the repair.
194
+
However, if the BMM failed hardware validation, the BMM `replace` is required even if the hot swappable repairs are done.
195
+
Examine the BMM status messages to determine if hardware validation failures or other degraded conditions are present.
196
+
-[Troubleshoot Degraded Status Errors on Bare Metal Machines]
@@ -127,6 +208,10 @@ While not strictly necessary to bring the BMM back into service, we recommend do
127
208
128
209
### BMM Relace is Required
129
210
211
+
After components such as motherboard or Network Interface Card (NIC) are replaced, the BMM MAC address changes.
212
+
However, the iDRAC IP address and hostname remain the same.
213
+
Motherboard changes result in MAC address changes, requiring a BMM `replace`.
214
+
130
215
A `replace` operation **is required** to bring the BMM back into service when you're performing the following physical repairs:
131
216
132
217
- Backplane
@@ -136,9 +221,6 @@ A `replace` operation **is required** to bring the BMM back into service when yo
136
221
- Mellanox Network Interface Card (NIC)
137
222
- Broadcom embedded NIC
138
223
139
-
After components such as motherboard or Network Interface Card (NIC) are replaced, the MAC address of BMM will change; however, the iDRAC IP address and hostname will remain the same.
140
-
Motherboard changes result in MAC address changes, requiring a BMM `replace`.
141
-
142
224
### After BMM Replace
143
225
144
226
After the BMM `replace` operation completes successfully, ensure that the `provisioningStatus` is `Succeeded` and the `readyState` is `True`.
@@ -166,3 +248,5 @@ For more information about Support plans, see [Azure Support plans](https://azur
166
248
[Troubleshoot Bare Metal Machine Warning Status]: ./troubleshoot-bare-metal-machine-warning.md
167
249
[Troubleshoot Degraded Status Errors on Bare Metal Machines]: ./troubleshoot-bare-metal-machine-degraded.md
[How to monitor interface In and Out packet rate for network fabric devices]: ./howto-monitor-interface-packet-rate.md
252
+
[How to configure diagnostic settings and monitor configuration differences in Nexus Network Fabric]: ./howto-configure-diagnostic-settings-monitor-configuration-differences.md
Copy file name to clipboardExpand all lines: articles/operator-nexus/howto-baremetal-functions.md
+5-1Lines changed: 5 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -77,7 +77,11 @@ Existing workloads continue to run on the BMM unless the workloads are drained.
77
77
### Drain Workloads from the BMM
78
78
79
79
The cordon command supports the `evacuate` parameter which its default value `False` means that the `cordon` command prevents scheduling new workloads.
80
-
To drain workloads with the `cordon` command, the `evacuate` parameter must be set to `True`. The workloads running on the BMM are `stopped` and the BMM is set to `pending` state.
80
+
To drain workloads with the `cordon` command, the `evacuate` parameter must be set to `True`.
81
+
The workloads running on the BMM are `stopped` and the BMM is set to `pending` state.
82
+
83
+
> [!NOTE]
84
+
> Nexus Management Workloads will continue to run on the BMM even when the server has been cordoned and evacuated.
81
85
82
86
It's a best practice to set the `evacuate` value to `True` when attempting to do any maintenance operations on the BMM server.
83
87
For more best practices to follow, read through [Best Practices for BareMetal Machine Operations](./howto-baremetal-best-practices.md).
0 commit comments