You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/operator-nexus/howto-baremetal-best-practices.md
+10-7Lines changed: 10 additions & 7 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -34,7 +34,7 @@ For this reason, it's essential to understand the available options well when tr
34
34
35
35
- Familiarize yourself with the relevant documentation, including troubleshooting guides and how-to articles.
36
36
Always refer to the latest documentation to stay informed about best practices and updates.
37
-
-Attempt to identify the root cause of the failure to avoid repeating the same mistake.
37
+
-Avoid repeated failed operations by first attempting to identify the root cause of the failure before retrying the operation.
38
38
Perform retry attempts in incremental steps to isolate and address specific issues.
39
39
- Wait for Az CLI commands to run to completion and validate the state of the BMM resource before executing other steps.
40
40
- Verify that the firmware and software versions are up-to-date before a new greenfield deployment to prevent compatibility issues between hardware and software versions.
@@ -68,7 +68,8 @@ Before initiating any `reimage` operation, ensure the following preconditions ar
68
68
- Evaluate any BMM warnings or degraded conditions which could indicate the need to resolve hardware, network, or server configuration problems before a `reimage` operation.
69
69
For more information, read [Troubleshoot Degraded Status Errors on Bare Metal Machines] and [Troubleshoot Bare Metal Machine Warning Status].
70
70
- Validate that there are no running firmware upgrade jobs through the BMC before initiating a `reimage` operation.
71
-
The BMM has `provisioningStatus` in the `Preparing` state. Interrupting an ongoing firmware upgrade can leave the BMM in an inconsistent state.
71
+
Interrupting an ongoing firmware upgrade can leave the BMM in an inconsistent state.
72
+
Confirm the BMM resource's `detailedStatus` isn't in the `Preparing` state.
72
73
73
74
## Best Practices for a BMM Replace
74
75
@@ -78,8 +79,8 @@ The BMM `replace` action is explained in [BMM Lifecycle Management Commands] and
78
79
79
80
Hardware failures are a normal occurrence over the life of a server.
80
81
Component replacements might be necessary to restore functionality and ensure continued operation.
81
-
In cases where one or more hardware components fail on the server, it's necessary to perform a BMM `replace` operation.
82
-
The `replace` operation should be executed after any hardware maintenance event. Multiple maintenance events should be done as multiple `replace`operations.
82
+
The `replace` operation must be executed after any hardware maintenance/repair event.
83
+
When one or more hardware components fail on the server (multiple failures), make the necessary repairs for **all** components before executing a BMM `replace`operation.
83
84
84
85
> [!IMPORTANT]
85
86
> With the `2024-07-01` GA API version, the RAID controller is reset during BMM `replace`, wiping all data from the server's virtual disks.
@@ -89,8 +90,9 @@ The `replace` operation should be executed after any hardware maintenance event.
89
90
90
91
When a BMM is marked with failed hardware validation, it might indicate that physical repairs are needed.
91
92
It's crucial to identify and address these repairs before performing a BMM `replace`.
92
-
A hardware validation process is invoked, as part of the `replace` operation, to ensure the physical host's integrity before deploying the OS image.
93
-
If the BMM continues to have hardware validation failures, then the BMM can't provision successfully meaning it fails to complete the necessary setup steps to become operational and join the cluster.
93
+
A hardware validation process is invoked as part of the `replace` operation to ensure the physical host's integrity before deploying the OS image.
94
+
The BMM can't provision successfully when the BMM continues to have hardware validation failures.
95
+
As a result, the BMM fails to complete the necessary setup steps to become operational and join the cluster.
94
96
Ensure **all hardware validation issues** are cleared before the next `replace` action.
95
97
96
98
To understand hardware validation result, read through the article [Troubleshoot Hardware Validation Failure](./troubleshoot-hardware-validation-failure.md).
@@ -105,7 +107,8 @@ Before initiating any `replace` operation, ensure the following preconditions ar
105
107
- Evaluate any BMM warnings or degraded conditions which could indicate the need to resolve hardware, network, or server configuration problems before a `replace` operation.
106
108
For more information, see [Troubleshoot Degraded Status Errors on Bare Metal Machines] and [Troubleshoot Bare Metal Machine Warning Status].
107
109
- Validate that there are no running firmware upgrade jobs through the BMC before initiating a `replace` operation.
108
-
The BMM has `provisioningStatus` in the `Preparing` state. Interrupting an ongoing firmware upgrade can leave the BMM in an inconsistent state.
110
+
Interrupting an ongoing firmware upgrade can leave the BMM in an inconsistent state.
111
+
Confirm the BMM resource's `detailedStatus` isn't in the `Preparing` state.
0 commit comments