You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/operator-nexus/howto-baremetal-best-practices.md
+14-16Lines changed: 14 additions & 16 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -37,14 +37,13 @@ For this reason, it's essential to understand the available options well when tr
37
37
- Attempt to identify the root cause of the failure to avoid repeating the same mistake.
38
38
Perform retry attempts in incremental steps to isolate and address specific issues.
39
39
- Wait for Az CLI commands to run to completion and validate the state of the BMM resource before executing other steps.
40
-
- Keep an eye on system logs to detect any anomalies during the retry process.
41
-
- Verify that the firmware and software versions are up-to-date to prevent compatibility issues and ensure compatibility between hardware and software versions.
42
-
- Always back up critical data to prevent data loss during the recovery or replacement process.
40
+
- Verify that the firmware and software versions are up-to-date before a new greenfield deployment to prevent compatibility issues between hardware and software versions.
41
+
For more information about firmware compatibility, see [Operator Nexus Platform Prerequisites](./howto-platform-prerequisites.md).
43
42
- Ensure stable network connectivity to avoid interruptions during the process.
44
43
Validate that there are no active network stability issues with the network fabric.
45
44
Ignoring network stability could make operations fail to complete successfully and leave a BMM in an unknown state.
46
45
47
-
## Best Practices for BMM Reimage
46
+
## Best Practices for a BMM Reimage
48
47
49
48
The BMM `reimage` action is explained in [BMM Lifecycle Management Commands] and scenario procedures described in [Troubleshoot Azure Operator Nexus Server Problems].
50
49
@@ -53,33 +52,31 @@ The BMM `reimage` action is explained in [BMM Lifecycle Management Commands] and
53
52
You can restore the operating system runtime version on a BMM by executing the `reimage` operation.
54
53
A BMM `reimage` can be both time-saving and reliable for resolving issues or restoring the operating system software to a known-good state.
55
54
This process **redeploys** the runtime image on the target BMM and executes the steps to rejoin the cluster with the same identifiers.
56
-
The `reimage` action doesn't affect the tenant workload files on the BMM under normal circumstances.
55
+
The `reimage` action is designed to interact with the operating system partition, leaving virtual machine's local storage unchanged.
57
56
58
57
> [!IMPORTANT]
59
-
> Avoid write or edit actions performed on the node via BMM access.
58
+
> Avoid manual or automated changes to the BMM's file system (also known as "break glass").
60
59
> The `reimage` action is required to restore Microsoft support and any changes done to the BMM are lost while restoring the node to its expected state.
61
60
62
-
### Preconditions and Validations Before BMM Reimage
61
+
### Preconditions and Validations Before a BMM Reimage
63
62
64
63
Before initiating any `reimage` operation, ensure the following preconditions are met:
65
64
66
65
- Ensure the BMM is in `poweredState` set to `On` and `readyState` set to `True`.
67
66
- Make sure the BMM's workloads are drained using the [`cordon`](./howto-baremetal-functions.md#make-a-bmm-unschedulable-cordon) command with the parameter `evacuate` set to `True`.
68
67
- Perform high level checks covered in the article [Troubleshoot Bare Metal Machine Provisioning].
69
-
- Evaluate any BMM warnings or degraded conditions which could indicate the need to resolve hardware, network, or server configuration problems before a `replace` operation.
68
+
- Evaluate any BMM warnings or degraded conditions which could indicate the need to resolve hardware, network, or server configuration problems before a `reimage` operation.
70
69
For more information, read [Troubleshoot Degraded Status Errors on Bare Metal Machines] and [Troubleshoot Bare Metal Machine Warning Status].
71
-
- Ensure to resolve any BMM hardware validation failures.
- Validate that there are no running firmware upgrade jobs through the BMC before initiating a `replace` operation.
70
+
- Validate that there are no running firmware upgrade jobs through the BMC before initiating a `reimage` operation.
74
71
The BMM has `provisioningStatus` in the `Preparing` state. Interrupting an ongoing firmware upgrade can leave the BMM in an inconsistent state.
75
72
76
-
## Best Practices for BMM Replace
73
+
## Best Practices for a BMM Replace
77
74
78
75
The BMM `replace` action is explained in [BMM Lifecycle Management Commands] and scenario procedures described in [Troubleshoot Azure Operator Nexus Server Problems].
Hardware failures are an expected occurrence over the natural lifecycle of a server.
79
+
Hardware failures are a normal occurrence over the life of a server.
83
80
Component replacements might be necessary to restore functionality and ensure continued operation.
84
81
In cases where one or more hardware components fail on the server, it's necessary to perform a BMM `replace` operation.
85
82
The `replace` operation should be executed after any hardware maintenance event. Multiple maintenance events should be done as multiple `replace` operations.
@@ -90,14 +87,15 @@ The `replace` operation should be executed after any hardware maintenance event.
90
87
91
88
### Resolve Hardware Validation Issues
92
89
93
-
When a BMM is marked with failed hardware validation, it indicates that physical repairs are needed. It's crucial to identify and address these repairs before performing a BMM `replace`.
90
+
When a BMM is marked with failed hardware validation, it might indicate that physical repairs are needed.
91
+
It's crucial to identify and address these repairs before performing a BMM `replace`.
94
92
A hardware validation process is invoked, as part of the `replace` operation, to ensure the physical host's integrity before deploying the OS image.
95
-
If the BMM continues to have hardware validation failures, then the BMM won't provision successfully meaning it fails to complete the necessary setup steps to become operational and won't join the cluster.
93
+
If the BMM continues to have hardware validation failures, then the BMM can't provision successfully meaning it fails to complete the necessary setup steps to become operational and join the cluster.
96
94
Ensure **all hardware validation issues** are cleared before the next `replace` action.
97
95
98
96
To understand hardware validation result, read through the article [Troubleshoot Hardware Validation Failure](./troubleshoot-hardware-validation-failure.md).
99
97
100
-
### Preconditions and Validations Before BMM Replace
98
+
### Preconditions and Validations Before a BMM Replace
101
99
102
100
Before initiating any `replace` operation, ensure the following preconditions are met:
Copy file name to clipboardExpand all lines: articles/operator-nexus/troubleshoot-bare-metal-machine-warning.md
+3-5Lines changed: 3 additions & 5 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -26,15 +26,13 @@ The Detailed status message of the Bare Metal Machine (Operator Nexus) resource
26
26
27
27
## Troubleshooting
28
28
29
+
Evaluate the current status of all BMMs in the specified resource group.
30
+
Any active _Warning_ conditions are visible in the Detailed Status Message, as seen in the following example.
31
+
29
32
To check for any Bare Metal Machines (BMMs) which are reporting _Warning_ messages, run:
30
33
31
34
```azurecli
32
35
az networkcloud baremetalmachine list -g <ResourceGroup_Name> -o table
33
-
```
34
-
35
-
This command shows the current status of all BMMs in the specified resource group. Any active _Warning_ conditions are visible in the Detailed Status Message, as seen in the following example.
36
-
37
-
```shell
38
36
Name ResourceGroup DetailedStatus DetailedStatusMessage
Copy file name to clipboardExpand all lines: articles/operator-nexus/troubleshoot-reboot-reimage-replace.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -9,7 +9,7 @@ author: eak13
9
9
ms.author: ekarandjeff
10
10
---
11
11
12
-
# Troubleshoot Azure Operator Nexus Server Problems
12
+
# Troubleshoot Bare Metal Machine Server Problems
13
13
14
14
This article describes how to troubleshoot server problems by using `restart`, `reimage`, and `replace` actions on Azure Operator Nexus BareMetal Machines (BMM).
15
15
These operations are performed for maintenance on your servers and cause a disruption to the specific BMM.
0 commit comments