Skip to content

Commit 80b44da

Browse files
committed
address PauseRack PR review comments
1 parent 9e6d214 commit 80b44da

File tree

3 files changed

+48
-34
lines changed

3 files changed

+48
-34
lines changed

articles/operator-nexus/howto-cluster-runtime-upgrade-with-pauserack-strategy.md

Lines changed: 28 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -1,54 +1,65 @@
11
---
2-
title: "Azure Operator Nexus: Runtime upgrade with rack pause strategy"
3-
description: Learn to execute a cluster runtime upgrade for Operator Nexus with a pause rack strategy
2+
title: "Azure Operator Nexus: Runtime upgrade with PauseRack strategy"
3+
description: Learn to execute a cluster runtime upgrade for Operator Nexus with a PauseRack strategy
44
author: vivekjMSFT
55
ms.author: vija
66
ms.service: azure-operator-nexus
77
ms.topic: how-to
88
ms.date: 08/16/2024
99
# ms.custom: template-include
1010
---
11-
## Upgrading cluster runtime with a pause rack strategy
11+
# Upgrading cluster runtime with a PauseRack strategy
1212

13-
This how-to guide explains the steps to execute a cluster runtime upgrade with pasue rack strategy. Executing cluster runtime upgrade with "PauseRack" strategy will update a single rack in a cluster and then pause to wait for confirmation before moving to the next rack. All existing thresholds will still be honoried with pause rack strategy.
13+
This how-to guide explains the steps to execute a cluster runtime upgrade with PauseRack strategy. Executing cluster runtime upgrade with PauseRack strategy will update a single rack in a cluster and then pause to wait for confirmation before moving to the next rack. All existing thresholds will still be honored.
1414

1515
## Prerequisites
1616

1717
> [!NOTE]
1818
> Upgrades with the PauseRack strategy is available starting API version 2024-06-01-preview.
1919
20-
Please follow the steps mentioned in prerequistie section of [Upgrading cluster runtime from Azure CLI](./howto-cluster-runtime-upgrade.md)
20+
1. The [Install Azure CLI][installation-instruction] must be installed.
21+
2. The `networkcloud` CLI extension is required. If the `networkcloud` extension isn't installed, it can be installed following the steps listed [here](https://github.com/MicrosoftDocs/azure-docs-pr/blob/main/articles/operator-nexus/howto-install-cli-extensions.md).
22+
3. Access to the Azure portal for the target cluster to be upgraded.
23+
4. You must be logged in to the same subscription as your target cluster via `az login`
24+
5. Target cluster must be in a running state, with all control plane nodes healthy and 80+% of compute nodes in a running and healthy state.
2125

2226
## Procedure
2327

24-
1. Enable Rack Pause upgrade strategy on a Nexus cluster
25-
26-
Example:
28+
1. Enable PauseRack upgrade strategy on a Nexus cluster
2729

2830
```azurecli
29-
az networkcloud cluster update --name "clusterName" --resource-group "resourceGroupName" --update-strategy \
30-
strategy-type="PauseRack" \
31-
wait-time-minutes=0
31+
az networkcloud cluster update
32+
--name $CLUSTER_NAME \
33+
--resource-group $RESOURCE_GROUP \
34+
--update-strategy strategy-type="PauseRack" wait-time-minutes=0
3235
```
3336
34-
2. Confirm that the cluster resource JSON in the JSON View reflects the rack pause upgrade strategy.
37+
2. Confirm that the cluster resource JSON in the JSON View reflects the PauseRack upgrade strategy.
3538
3639
```azurecli
3740
az networkcloud cluster show --cluster-name "clusterName" --resource-group "resourceGroupName"
3841
```
3942
40-
:::image type="content" source="media/runtime-upgrade-cluster-pause-rack-strategy.png" alt-text="Runtime upgrade strategy property details":::
43+
```
44+
"updateStrategy": {
45+
"maxUnavailable": 2,
46+
"strategyType": "PauseAfterRack",
47+
"thresholdType": "PercentSuccess",
48+
"thresholdValue": 70,
49+
"waitTimeMinutes": 15,
50+
}
51+
```
4152
42-
3.Trigger runtime bundle upgrade as usual from Azure portal / CLI. for reference [Upgrading cluster runtime from Azure CLI](./howto-cluster-runtime-upgrade.md)
53+
3. Trigger runtime bundle upgrade as usual from Azure portal / CLI. For reference [Upgrading cluster runtime from Azure CLI](./howto-cluster-runtime-upgrade.md)
4354
44-
4.Once Rack 1 has completed, the runtime upgrade will pause, awaiting user action to resume the runtime upgrade for Rack 2.
55+
4. Once Rack 1 completes, the runtime upgrade will be paused, awaiting user action to resume the upgrade for Rack 2.
4556
4657
:::image type="content" source="media/runtime-upgrade-cluster-paused.png" alt-text="Paused Runtime Upgrade":::
4758
4859
> [!NOTE]
4960
> This message will be available in logs for programtic access, for more details follow [List of logs available for streaming in Azure Operator Nexus](list-logs-available.md)
5061
51-
5.To resume the runtime upgrade, execute the following `az networkcloud` cli command to trigger the continue upgrade version action.
62+
5. To resume the runtime upgrade, execute the following `az networkcloud` cli command.
5263
5364
```shell
5465
az networkcloud cluster continue-update-version \
@@ -57,7 +68,7 @@ az networkcloud cluster continue-update-version \
5768
--cluster-name=$CLUSTER_NAME
5869
```
5970

60-
6.Continue repeating step 5 for each rack until all racks have been upgraded to the latest runtime bundle.
71+
6. Repeat step 5 for each rack until all racks have been upgraded to the latest runtime bundle.
6172

6273
## Related content
6374

articles/operator-nexus/howto-cluster-runtime-upgrade.md

Lines changed: 20 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -98,15 +98,20 @@ For more detailed insights on the upgrade progress, the individual BMM in each R
9898
The following Azure CLI command is used to configure the compute threshold parameters for a runtime upgrade:
9999

100100
```azurecli
101-
az networkcloud cluster update --name "<clusterName>" --resource-group "<resourceGroup>" --update-strategy strategy-type="Rack" threshold-type="PercentSuccess" threshold-value="<thresholdValue>" max-unavailable=<maxNodesOffline> wait-time-minutes=<waitTimeBetweenRacks>
101+
az networkcloud cluster update /
102+
--name "<clusterName>" /
103+
--resource-group "<resourceGroup>" /
104+
--update-strategy strategy-type="Rack" threshold-type="PercentSuccess" /
105+
threshold-value="<thresholdValue>" max-unavailable=<maxNodesOffline> /
106+
wait-time-minutes=<waitTimeBetweenRacks>
102107
```
103108

104-
Required arguments:
105-
- strategy-type: Defines the update strategy. In this case, "Rack" means updates occur rack-by-rack. The default value is "Rack"
109+
Required parameters:
110+
- strategy-type: Defines the update strategy. In this case, "Rack" means updates occur rack-by-rack. The default value is "Rack".
106111
- threshold-type: Determines how the threshold should be evaluated, applied in the units defined by the strategy. The default value is "PercentSuccess".
107112
- threshold-value: The numeric threshold value used to evaluate an update. The default value is 80.
108113

109-
Optional arguments:
114+
Optional parameters:
110115
- max-unavailable: The maximum number of worker nodes that can be offline, that is, upgraded rack at a time. The default value is 32767.
111116
- wait-time-minutes: The delay or waiting period before updating a rack. The default value is 15.
112117

@@ -118,21 +123,22 @@ az networkcloud cluster update --name "cluster01" --resource-group "cluster01-rg
118123

119124
Upon successful execution of the command, the updateStrategy values specified will be applied to the cluster:
120125

121-
``` "updateStrategy": {
126+
```
127+
"updateStrategy": {
122128
"maxUnavailable": 16,
123129
"strategyType": "Rack",
124130
"thresholdType": "PercentSuccess",
125131
"thresholdValue": 70,
126132
"waitTimeMinutes": 15,
127-
},
133+
}
128134
```
129135

130136
> [!NOTE]
131-
> When a threshold value below 100% is set, it’s possible that any unhealthy nodes might not be upgraded, yet the “Cluster” status could still indicate that upgrade was sucessfull. For troubleshooting issues with bare metal machines, please refer to [Troubleshoot Azure Operator Nexus server problems](troubleshoot-reboot-reimage-replace.md)
137+
> When a threshold value below 100% is set, it’s possible that any unhealthy nodes might not be upgraded, yet the “Cluster” status could still indicate that upgrade was successful. For troubleshooting issues with bare metal machines, please refer to [Troubleshoot Azure Operator Nexus server problems](troubleshoot-reboot-reimage-replace.md)
132138
133-
## Upgrade with PauseRack Strategy
139+
## Upgrade with Pause Rack Strategy
134140

135-
Starting with API version 2024-06-01-preview, runtime upgrades can be triggered using a "PauseRack" strategy. When you execute a cluster runtime upgrade with the PauseRack" strategy, it will update one rack at a time in the cluster and then pause, awaiting confirmation before proceeding to the next rack. All existing thresholds will continue to be respected with the "PauseRack" strategy. To carry out a cluster runtime upgrade using the "PauseRack" strategy follow the steps outlined in [Upgrading cluster runtime with a pause rack strategy](howto-cluster-runtime-upgrade-with-pauserack-strategy.md)
141+
Starting with API version 2024-06-01-preview, runtime upgrades can be triggered using a "PauseRack" strategy. When you execute a cluster runtime upgrade with the "PauseRack" strategy, it will update one rack at a time in the cluster and then pause, awaiting confirmation before proceeding to the next rack. All existing thresholds will continue to be respected with the "PauseRack" strategy. To carry out a cluster runtime upgrade using the "PauseRack" strategy follow the steps outlined in [Upgrading cluster runtime with a PauseRack strategy](howto-cluster-runtime-upgrade-with-pauserack-strategy.md)
136142

137143
## Frequently Asked Questions
138144

@@ -162,15 +168,12 @@ Once the cordon and drain process of the tenant cluster node is completed, the u
162168

163169
It's important to note that the Nexus Kubernetes cluster node won't be shut down after the cordon and drain process. The BMH is rebooted with the new image as soon as all the Nexus Kubernetes cluster nodes are cordoned and drained, after 10 minutes if the drain process isn't completed. Additionally, the cordon and drain is not initiated for power-off or restart actions of the BMH; it's exclusively activated only during a runtime upgrade.
164170

165-
It is important to note that following the runtime upgrade, there could be instance where a Nexus Kubernetes Cluster node remains cordoned. For such scenario, you can manually uncordon the node by executing the following commands via(./includes/kubernetes-cluster/cluster-connect.md)
171+
It is important to note that following the runtime upgrade, there could be instance where a Nexus Kubernetes Cluster node remains cordoned. For such scenario, you can manually uncordon the node by executing the following command
166172

167-
```kubectl get nodes | grep SchedulingDisabled > /dev/null
168-
if [ $? -eq 0 ]; then
169-
for node in $(kubectl get nodes | grep SchedulingDisabled | awk '{print $1}'); do
170-
kubectl uncordon $node
171-
done
172-
fi
173-
```
173+
```azurecli
174+
az networkcloud baremetalmachine list -g $mrg --subscription $sub --query "sort_by([].{name:name,kubernetesNodeName:kubernetesNodeName,location:location,readyState:readyState,provisioningState:provisioningState,detailedStatus:detailedStatus,detailedStatusMessage:detailedStatusMessage,powerState:powerState,tags:tags.Status,machineRoles:join(', ', machineRoles),cordonStatus:cordonStatus,createdAt:systemData.createdAt}, &name)"
175+
--output table
174176
177+
```
175178
<!-- LINKS - External -->
176179
[installation-instruction]: https://aka.ms/azcli
Binary file not shown.

0 commit comments

Comments
 (0)