Skip to content

Commit 0d03926

Browse files
Merge pull request #298427 from robertstarling/icm618958784-run-read-control-node
clarify that run-read-command requires a control-plane BMM
2 parents 85af13f + 104b6f2 commit 0d03926

File tree

4 files changed

+33
-22
lines changed

4 files changed

+33
-22
lines changed

articles/operator-nexus/howto-bare-metal-best-practices.md

Lines changed: 14 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
---
22
title: Best practices for Bare Metal Machine operations
33
description: Steps that should be taken before executing any Bare Metal Machine replace, or reimage actions. Highlight essential prerequisites and common pitfalls to avoid.
4-
ms.date: 03/25/2025
4+
ms.date: 04/17/2025
55
ms.topic: how-to
66
ms.service: azure-operator-nexus
77
ms.custom: template-how-to, best-practices
@@ -63,8 +63,9 @@ Connected
6363

6464
Take a deeper look at the NetworkFabric resources by checking the NetworkFabric resources statuses, alerts, and metrics.
6565
See related articles:
66-
- [How to monitor interface In and Out packet rate for network fabric devices]
67-
- [How to configure diagnostic settings and monitor configuration differences in Nexus Network Fabric].
66+
67+
- [How to monitor interface In and Out packet rate for network fabric devices]
68+
- [How to configure diagnostic settings and monitor configuration differences in Nexus Network Fabric].
6869

6970
Evaluate for any Bare Metal Machine warnings or degraded conditions which could indicate the need to resolve hardware, network, or server configuration problems.
7071
For more information, see [Troubleshoot Degraded Status Errors on Bare Metal Machines] and [Troubleshoot Bare Metal Machine Warning Status].
@@ -73,7 +74,9 @@ For more information, see [Troubleshoot Degraded Status Errors on Bare Metal Mac
7374

7475
Validate that there are no running firmware upgrade jobs through the BMC before initiating a `replace` or `reimage` operation.
7576
Interrupting an ongoing firmware upgrade can leave the Bare Metal Machine in an inconsistent state.
76-
You can view in the iDRAC GUI the `jobqueue` or use a `racadm jobqueque view` to determine if there are firmware upgrade jobs running.
77+
78+
- You can view in the iDRAC GUI the `jobqueue` or use `run-read-command` `racadm jobqueque view` to determine if there are firmware upgrade jobs running.
79+
- For more information about the `run-read-command` feature, see [BareMetal Run-Read Execution](./howto-baremetal-run-read.md).
7780

7881
```azurecli
7982
az networkcloud baremetalmachine run-read-command \
@@ -86,6 +89,7 @@ az networkcloud baremetalmachine run-read-command \
8689
```
8790

8891
Here's an example output from the `racadm jobqueue view` command which shows `Firmware Update`.
92+
8993
```
9094
[Job ID=JID_833540920066]
9195
Job Name=Firmware Update: iDRAC
@@ -97,6 +101,7 @@ Percent Complete= [50%]
97101
```
98102

99103
Here's an example output from the `racadm jobqueue view` command showing common happy-path statements.
104+
100105
```
101106
-------------------------JOB QUEUE------------------------
102107
[Job ID=JID_429400224349]
@@ -191,9 +196,10 @@ Some repairs don't require a Bare Metal Machine `replace` to be executed.
191196
For example, a `replace` operation isn't required when you're performing a physical hot swappable power supply repair because the Bare Metal Machine host will continue to function normally after the repair.
192197
However, if the Bare Metal Machine failed hardware validation, the Bare Metal Machine `replace` is required even if the hot swappable repairs are done.
193198
Examine the Bare Metal Machine status messages to determine if hardware validation failures or other degraded conditions are present.
194-
- [Troubleshoot Degraded Status Errors on Bare Metal Machines]
195-
- [Troubleshoot Bare Metal Machine Warning Status]
196-
- [Troubleshoot Hardware Validation Failure](./troubleshoot-hardware-validation-failure.md).
199+
200+
- [Troubleshoot Degraded Status Errors on Bare Metal Machines]
201+
- [Troubleshoot Bare Metal Machine Warning Status]
202+
- [Troubleshoot Hardware Validation Failure](./troubleshoot-hardware-validation-failure.md).
197203

198204
Other repairs of this type might be:
199205

@@ -247,4 +253,4 @@ For more information about Support plans, see [Azure Support plans](https://azur
247253
[Troubleshoot Degraded Status Errors on Bare Metal Machines]: ./troubleshoot-bare-metal-machine-degraded.md
248254
[Troubleshoot Hardware Validation Failure]: ./troubleshoot-hardware-validation-failure.md
249255
[How to monitor interface In and Out packet rate for network fabric devices]: ./howto-monitor-interface-packet-rate.md
250-
[How to configure diagnostic settings and monitor configuration differences in Nexus Network Fabric]: ./howto-configure-diagnostic-settings-monitor-configuration-differences.md
256+
[How to configure diagnostic settings and monitor configuration differences in Nexus Network Fabric]: ./howto-configure-diagnostic-settings-monitor-configuration-differences.md

articles/operator-nexus/howto-baremetal-run-read.md

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@ author: eak13
55
ms.author: ekarandjeff
66
ms.service: azure-operator-nexus
77
ms.topic: how-to
8-
ms.date: 2/13/2025
8+
ms.date: 4/17/2025
99
ms.custom: template-how-to
1010
---
1111

@@ -30,7 +30,7 @@ See [Azure Operator Nexus Cluster support for managed identities and user provid
3030

3131
To change the cluster from a user-assigned identity to a system-assigned identity, the CommandOutputSettings must first be cleared using the command in the next section, then set using this command.
3232

33-
The CommandOutputSettings can be cleared, directing run-data-extract output back to the cluster manager's storage. However, it isn't recommended since it's less secure, and the option will be removed in a future release.
33+
The CommandOutputSettings can be cleared, directing run-read-command output back to the cluster manager's storage. However, it isn't recommended since it's less secure, and the option will be removed in a future release.
3434

3535
However, the CommandOutputSettings do need to be cleared if switching from a user-assigned identity to a system-assigned identity.
3636

@@ -246,17 +246,19 @@ This list shows the commands you can use. Commands in `*italics*` can't have `ar
246246
The command syntax for a single command with no arguments is as follows, using `hostname` as an example:
247247

248248
```azurecli
249-
az networkcloud baremetalmachine run-read-command --name "<machine-name>"
249+
az networkcloud baremetalmachine run-read-command --name "<bareMetalMachineName>"
250250
--limit-time-seconds "<timeout>" \
251251
--commands "[{command:hostname}]" \
252252
--resource-group "<cluster_MRG>" \
253253
--subscription "<subscription>"
254254
```
255255

256+
- `--name` is the name of the BMM resource on which to execute the command.
256257
- The `--commands` parameter always takes a list of commands, even if there's only one command.
257258
- Multiple commands can be provided in json format using [Azure CLI Shorthand](https://aka.ms/cli-shorthand) notation.
258259
- Any whitespace must be enclosed in single quotes.
259260
- Any arguments for each command must also be provided as a list, as shown in the following examples.
261+
- Not all commands can run on any BMM. For example, `kubectl` commands can only be run from a BMM with the `control-plane` role.
260262

261263
```
262264
--commands "[{command:hostname},{command:'nc-toolbox nc-toolbox-runread racadm ifconfig'}]"

articles/operator-nexus/troubleshoot-bare-metal-machine-warning.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ description: Troubleshooting guide for Bare Metal Machines Warning status messag
44
ms.service: azure-operator-nexus
55
ms.custom: azure-operator-nexus
66
ms.topic: troubleshooting
7-
ms.date: 03/03/2025
7+
ms.date: 04/17/2025
88
author: robertstarling
99
ms.author: robstarling
1010
ms.reviewer: ekarandjeff
@@ -46,7 +46,7 @@ For more information, use an Azure CLI Bare Metal Machine `run-read-command` com
4646
```azurecli
4747
az networkcloud baremetalmachine run-read-command \
4848
-g <ResourceGroup_Name> \
49-
-n <BareMetal Machine Name> \
49+
-n rack1control01 \
5050
--limit-time-seconds 60 \
5151
--commands "[{command:'kubectl get',arguments:[-n,nc-system,bmm,rack1compute01,-o,json]}]" \
5252
--output-directory .

articles/operator-nexus/troubleshoot-memory-limits.md

Lines changed: 12 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ description: Learn how to troubleshoot Kubernetes container limits.
44
ms.service: azure-operator-nexus
55
ms.custom: troubleshooting
66
ms.topic: troubleshooting
7-
ms.date: 11/01/2024
7+
ms.date: 04/17/2025
88
ms.author: matthewernst
99
author: matternst7258
1010
---
@@ -19,18 +19,21 @@ We recommend that you have alerts set up for the Azure Operator Nexus cluster to
1919

2020
The following table lists the metrics that are exposed to identify memory limits.
2121

22-
| Metric name | Description |
23-
| ------------------------------------ | ------------------------------------------------ |
24-
| Container Restarts | `kube_pod_container_status_restarts_total` |
25-
| Container Status Terminated Reason | `kube_pod_container_status_terminated_reason` |
26-
| Container Resource Limits | `kube_pod_container_resource_limits` |
22+
| Metric name | Description |
23+
| ---------------------------------- | --------------------------------------------- |
24+
| Container Restarts | `kube_pod_container_status_restarts_total` |
25+
| Container Status Terminated Reason | `kube_pod_container_status_terminated_reason` |
26+
| Container Resource Limits | `kube_pod_container_resource_limits` |
2727

2828
The `Container Status Terminated Reason` metric displays the `OOMKill` reason for pods that are affected.
2929

3030
## Identify Out of Memory (OOM) pods
3131

3232
Start by identifying any components that are restarting or show `OOMKill`.
3333

34+
- Replace `<bareMetalMachineName>` with the name of a healthy `control-plane` Bare Metal Machine resource on which to execute the `kubectl` command.
35+
- For more information about the `run-read-command` feature, see [BareMetal Run-Read Execution](./howto-baremetal-run-read.md).
36+
3437
```azcli
3538
az networkcloud baremetalmachine run-read-command --name "<bareMetalMachineName>" \
3639
--limit-time-seconds 60 \
@@ -92,6 +95,6 @@ az networkcloud baremetalmachine run-read-command --name "<bareMetalMachineName>
9295

9396
## Known services susceptible to OOM issues
9497

95-
* cdi-operator
96-
* vulnerability-operator
97-
* cluster-metadata-operator
98+
- cdi-operator
99+
- vulnerability-operator
100+
- cluster-metadata-operator

0 commit comments

Comments
 (0)