Skip to content

Commit 0b12d29

Browse files
Merge pull request #240842 from JAC0BSMITH/jac0bsmithNexusDocs
Troubleshooting with the 3 R's for BMM
2 parents f939c14 + c8e0e7b commit 0b12d29

File tree

2 files changed

+79
-0
lines changed

2 files changed

+79
-0
lines changed

articles/operator-nexus/TOC.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -61,6 +61,8 @@
6161
href: howto-track-async-operations-cli.md
6262
- name: Troubleshooting
6363
items:
64+
- name: Troubleshoot Bare Metal Machine
65+
href: troubleshoot-reboot-reimage-replace.md
6466
- name: Troubleshoot AKS-Hybrid
6567
href: troubleshoot-aks-hybrid-cluster.md
6668
- name: Troubleshoot Isolation Domain
Lines changed: 77 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,77 @@
1+
---
2+
title: Troubleshoot cluster baremetalmachine with three Rs for Azure Operator Nexus
3+
description: Troubleshoot cluster baremetalmachine with three Rs for Azure Operator Nexus
4+
ms.service: azure-operator-nexus
5+
ms.custom: troubleshooting
6+
ms.topic: troubleshooting
7+
ms.date: 06/12/2023
8+
author: JAC0BSMITH
9+
ms.author: jacobsmith
10+
---
11+
12+
# Troubleshooting Server Issues
13+
14+
This article describes how you can troubleshoot server issues using the restart, reimage and replace on Operator Nexus Bare Metal Machines (BMM). You may need to take these actions on your server for maintenance reasons, which causes a brief disruption to this specific BMM, as the server performs the operation.
15+
The time required to complete each of these actions is relatively similar, with reboot being the fastest and replace taking slightly longer. All three actions are simple and efficient methods for troubleshooting.
16+
17+
## Prerequisites
18+
19+
- Familiarize yourself with the capabilities referenced in this article by reviewing the [Bare Metal Machine Actions](howto-baremetal-functions.md)
20+
- Get the name of the resource group for the BMM
21+
- Get the name of the bare metal machine that requires a lifecycle management operation
22+
23+
## Identifying the corrective action
24+
25+
When troubleshooting a BMM for failures and determining the best corrective action, it’s important to understand the options available. Rebooting or reimaging a BMM server can be an efficient and effective way to fix problems or simply restore the software to a known-good place. This article provides direction on the best practices to be followed for each of the three Rs.
26+
27+
It's important to have a systematic approach when troubleshooting technical issues. One effective method is to start with the simplest and least invasive solution and work your way up to more complex and drastic measures, if necessary.
28+
29+
The first step in troubleshooting is often to try rebooting the device or system. Rebooting can help to clear any temporary glitches or errors that may be causing the issue. If rebooting doesn't solve the problem, the next step may be to try reimaging the device or system.
30+
31+
If reimaging doesn't solve the problem, the final step may be to replace the faulty hardware component. Replace can be a more drastic measure, but it may be necessary if the issue is related to a hardware malfunction.
32+
It's important to note that these troubleshooting methods may not always be effective, and there may be other factors at play that require a different approach.
33+
34+
### Troubleshooting with Reboot action
35+
36+
Rebooting a BMM is a process of restarting the server through a simple API call. This action can be useful for troubleshooting issues when tenant VMs on the host aren't responsive or otherwise stuck.
37+
38+
The reboot typically is the starting point for mitigating a problem.
39+
40+
### Troubleshooting with Reimage action
41+
42+
Reimaging a BMM is a process used to redeploy the image on the OS disk, without impact to the Tenant data. This action executes the steps to rejoin the cluster with the same identifiers. Reimage action can be useful for troubleshooting issues by restoring the OS to a known-good working state. Common causes that can be resolved through reimage include recovery due to doubt of host integrity, suspected and/or confirmed security compromise, “break-glass” write activity performed.
43+
44+
Reimage action is the recommended best practice for lowest operational risk to ensure the integrity of the BMM.
45+
46+
### Troubleshooting with Replace action
47+
48+
Servers contain many physical components that can fail over time. It's important to understand which physical repairs require a BMM replace action, do not require replace and which are recommended but not required. A hardware validation process is invoked to ensure the integrity of the physical host in advance of deploying the OS image. Like the reimage action, the Tenant data isn't modified during replace activity.
49+
50+
As a best practice, the BMM should be cordoned and shut down in advance of physical repair.
51+
When performing the following physical repairs, a replace action isn't required, as the BMM host will continue to function normally after the repair.
52+
53+
- Hot swappable power supply
54+
55+
When performing the following physical repairs, a replace action is recommended but not necessary to bring the BMM back into service:
56+
57+
- CPU
58+
- DIMM
59+
- Fan
60+
- Expansion board riser
61+
- Transceiver
62+
- Ethernet or fiber cable replacement
63+
64+
When performing the following physical repairs, a replace action is required to bring the BMM back into service:
65+
66+
- Backplane
67+
- System board
68+
- SSD disk
69+
- PERC/RAID adapter
70+
- Mellanox NIC
71+
- Broadcom embedded NIC
72+
73+
### Summary
74+
75+
In conclusion, rebooting, reimaging, and replacing are three effective troubleshooting methods that can be used to address technical issues. However, it's important to have a systematic approach and to consider other factors before attempting any drastic measures.
76+
77+
If you still have further questions, [contact support](https://portal.azure.com/?#blade/Microsoft_Azure_Support/HelpAndSupportBlade) to get your issue resolved quickly.

0 commit comments

Comments
 (0)