You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/operator-nexus/howto-baremetal-best-practices.md
+21-21Lines changed: 21 additions & 21 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -27,14 +27,14 @@ The aim is to highlight common pitfalls and essential prerequisites.
27
27
28
28
Troubleshooting technical problems requires a systematic approach.
29
29
One effective method is to start with the least invasive solution and, if necessary, work your way up to more complex and drastic measures.
30
-
Keep in mind that these troubleshooting methods might not always be effective for all scenarios and accounting for various other factors may require a different approach.
31
-
For this reason, it is essential to understand the available options well when troubleshooting a BMM for failures to determine the most appropriate corrective action.
30
+
Keep in mind that these troubleshooting methods might not always be effective for all scenarios and accounting for various other factors might require a different approach.
31
+
For this reason, it's essential to understand the available options well when troubleshooting a BMM for failures to determine the most appropriate corrective action.
32
32
33
33
### General Advice while Troubleshooting
34
34
35
35
- Familiarize yourself with the relevant documentation, including troubleshooting guides and how-to articles.
36
36
Always refer to the latest documentation to stay informed about best practices and updates.
37
-
-Before retrying operations, attempt to identify the root cause of the failure to avoid repeating the same mistake.
37
+
-Attempt to identify the root cause of the failure to avoid repeating the same mistake.
38
38
Perform retry attempts in incremental steps to isolate and address specific issues.
39
39
- Wait for Az CLI commands to run to completion and validate the state of the BMM resource before executing other steps.
40
40
- Keep an eye on system logs to detect any anomalies during the retry process.
@@ -64,14 +64,14 @@ The `reimage` action doesn't affect the tenant workload files on the BMM under n
64
64
Before initiating any `reimage` operation, ensure the following preconditions are met:
65
65
66
66
- Ensure the BMM is in `poweredState` set to `On` and `readyState` set to `True`.
67
-
- Make sure the BMM's workloads are drained using the [`cordon`](#make-a-bmm-unschedulable-cordon) command with the paramater`evacuate` set to `True`.
67
+
- Make sure the BMM's workloads are drained using the [`cordon`](./howto-baremetal-functions.md#make-a-bmm-unschedulable-cordon) command with the parameter`evacuate` set to `True`.
68
68
- Perform high level checks covered in the article [Troubleshoot Bare Metal Machine Provisioning].
69
-
- Evaluate any BMM warnings or degraded conditions which could indicate the need to resolve hardware, network, or server configuration problems prior to a `replace` operation.
70
-
See the articles [Troubleshoot Degraded Status Errors on Bare Metal Machines] and [Troubleshoot Bare Metal Machine Warning Status] for more details.
69
+
- Evaluate any BMM warnings or degraded conditions which could indicate the need to resolve hardware, network, or server configuration problems before a `replace` operation.
70
+
For more information, read [Troubleshoot Degraded Status Errors on Bare Metal Machines] and [Troubleshoot Bare Metal Machine Warning Status].
71
71
- Ensure to resolve any BMM hardware validation failures.
Hardware failures are an expected occurrence over the natural lifecycle of a server.
83
-
Component replacements may be necessary to restore functionality and ensure continued operation.
83
+
Component replacements might be necessary to restore functionality and ensure continued operation.
84
84
In cases where one or more hardware components fail on the server, it's necessary to perform a BMM `replace` operation.
85
85
The `replace` operation should be executed after any hardware maintenance event. Multiple maintenance events should be done as multiple `replace` operations.
86
86
87
87
> [!IMPORTANT]
88
-
> With the `2024-07-01` GA API version, the RAID controller is reset during BMM replace, wiping all data from the server's virtual disks.
89
-
> Baseboard Management Controller (BMC) virtual disk alerts triggered during BMM replace can be ignored unless there are more physical disk and/or RAID controllers alerts.
88
+
> With the `2024-07-01` GA API version, the RAID controller is reset during BMM `replace`, wiping all data from the server's virtual disks.
89
+
> Baseboard Management Controller (BMC) virtual disk alerts triggered during BMM `replace` can be ignored unless there are more physical disk and/or RAID controllers alerts.
90
90
91
91
### Resolve Hardware Validation Issues
92
92
93
-
When a BMM is marked with failed hardware validation, it indicates that physical repairs are needed. It is crucial to identify and address these repairs before performing a BMM `replace`.
94
-
A hardware validation process is invoked, as part of the `replace` operation, to ensure the physical host's integrity prior to deploying the OS image.
95
-
If the BMM continues to have hardware validation failures, the BMM will not provision successfully, meaning it will fail to complete the necessary setup steps to become operational, and will not join the cluster.
96
-
Ensure **all hardware validation issues** are cleared prior to the next `replace` action.
93
+
When a BMM is marked with failed hardware validation, it indicates that physical repairs are needed. It's crucial to identify and address these repairs before performing a BMM `replace`.
94
+
A hardware validation process is invoked, as part of the `replace` operation, to ensure the physical host's integrity before deploying the OS image.
95
+
If the BMM continues to have hardware validation failures, then the BMM won't provision successfully meaning it fails to complete the necessary setup steps to become operational and won't join the cluster.
96
+
Ensure **all hardware validation issues** are cleared before the next `replace` action.
97
97
98
-
Read through the [Troubleshoot Hardware Validation Failure](./troubleshoot-hardware-validation-failure.md) article to understand hardware validation results.
98
+
To understand hardware validation result, read through the article [Troubleshoot Hardware Validation Failure](./troubleshoot-hardware-validation-failure.md).
99
99
100
100
### Preconditions and Validations Before BMM Replace
101
101
102
102
Before initiating any `replace` operation, ensure the following preconditions are met:
103
103
104
-
- Ensure the BMM is in `poweredState` set to `On` and `readyState` set to `True`.
105
-
- Make sure the BMM's workloads are drained using the [`cordon`](#make-a-bmm-unschedulable-cordon) command with the paramater`evacuate` set to `True`.
104
+
- Ensure the BMM `poweredState`is set to `On` and the `readyState` is set to `True`.
105
+
- Make sure the BMM's workloads are drained using the [`cordon`](./howto-baremetal-functions.md#make-a-bmm-unschedulable-cordon) command with the parameter`evacuate` set to `True`.
106
106
- Perform high level checks covered in the article [Troubleshoot Bare Metal Machine Provisioning].
107
-
- Evaluate any BMM warnings or degraded conditions which could indicate the need to resolve hardware, network, or server configuration problems prior to a `replace` operation.
108
-
See the articles [Troubleshoot Degraded Status Errors on Bare Metal Machines] and [Troubleshoot Bare Metal Machine Warning Status] for more details.
107
+
- Evaluate any BMM warnings or degraded conditions which could indicate the need to resolve hardware, network, or server configuration problems before a `replace` operation.
108
+
For more information, see [Troubleshoot Degraded Status Errors on Bare Metal Machines] and [Troubleshoot Bare Metal Machine Warning Status].
109
109
- Validate that there are no running firmware upgrade jobs through the BMC before initiating a `replace` operation.
110
-
The BMM will have`provisioningStatus` in the `Preparing` state. Interrupting an ongoing firmware upgrade can leave the BMM in an inconsistent state.
110
+
The BMM has`provisioningStatus` in the `Preparing` state. Interrupting an ongoing firmware upgrade can leave the BMM in an inconsistent state.
111
111
112
112
### BMM Replace isn't Required
113
113
@@ -135,7 +135,7 @@ A `replace` operation **is required** to bring the BMM back into service when yo
135
135
- Mellanox Network Interface Card (NIC)
136
136
- Broadcom embedded NIC
137
137
138
-
After replacing components such as motherboard or Network Interface Card (NIC), the MAC address of BMM will change; however, the iDRAC IP address and hostname will remain the same.
138
+
After components such as motherboard or Network Interface Card (NIC) are replaced, the MAC address of BMM will change; however, the iDRAC IP address and hostname will remain the same.
139
139
Motherboard changes result in MAC address changes, requiring a BMM `replace`.
0 commit comments