Skip to content

Commit d4438b3

Browse files
authored
Merge pull request #286646 from vivekjMSFT/main
[operator-nexus] Add PauseRack update Strategy Documentation
2 parents 5106178 + 0e4fde0 commit d4438b3

File tree

4 files changed

+106
-17
lines changed

4 files changed

+106
-17
lines changed

articles/operator-nexus/TOC.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -146,6 +146,8 @@
146146
href: howto-run-instance-readiness-testing.md
147147
- name: Cluster Upgrades
148148
href: howto-cluster-runtime-upgrade.md
149+
- name: Cluster Upgrades With PauseRack Startegy
150+
href: howto-cluster-runtime-upgrade-with-pauserack-strategy.md
149151
- name: Credential Rotation
150152
href: howto-credential-rotation.md
151153
- name: Credential Manager Key Vault
Lines changed: 75 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,75 @@
1+
---
2+
title: "Azure Operator Nexus: Runtime upgrade with PauseRack strategy"
3+
description: Learn to execute a cluster runtime upgrade for Operator Nexus with a PauseRack strategy
4+
author: vivekjMSFT
5+
ms.author: vija
6+
ms.service: azure-operator-nexus
7+
ms.topic: how-to
8+
ms.date: 08/16/2024
9+
# ms.custom: template-include
10+
---
11+
# Upgrading cluster runtime with a PauseRack strategy
12+
13+
This how-to guide explains the steps to execute a cluster runtime upgrade with PauseRack strategy. Executing cluster runtime upgrade with PauseRack strategy will update a single rack in a cluster and then pause to wait for confirmation before moving to the next rack. All existing thresholds will still be honored.
14+
15+
## Prerequisites
16+
17+
> [!NOTE]
18+
> Upgrades with the PauseRack strategy is available starting API version 2024-06-01-preview.
19+
20+
1. The [Install Azure CLI][installation-instruction] must be installed.
21+
2. The `networkcloud` CLI extension is required. If the `networkcloud` extension isn't installed, it can be installed following the steps listed [here](https://github.com/MicrosoftDocs/azure-docs-pr/blob/main/articles/operator-nexus/howto-install-cli-extensions.md).
22+
3. Access to the Azure portal for the target cluster to be upgraded.
23+
4. You must be logged in to the same subscription as your target cluster via `az login`
24+
5. Target cluster must be in a running state, with all control plane nodes healthy and 80+% of compute nodes in a running and healthy state.
25+
26+
## Procedure
27+
28+
1. Enable PauseRack upgrade strategy on a Nexus cluster
29+
30+
```azurecli
31+
az networkcloud cluster update
32+
--name $CLUSTER_NAME \
33+
--resource-group $RESOURCE_GROUP \
34+
--update-strategy strategy-type="PauseRack" wait-time-minutes=0
35+
```
36+
37+
2. Confirm that the cluster resource JSON in the JSON View reflects the PauseRack upgrade strategy.
38+
39+
```azurecli
40+
az networkcloud cluster show --cluster-name "clusterName" --resource-group "resourceGroupName"
41+
```
42+
43+
```
44+
"updateStrategy": {
45+
"maxUnavailable": 2,
46+
"strategyType": "PauseAfterRack",
47+
"thresholdType": "PercentSuccess",
48+
"thresholdValue": 70,
49+
"waitTimeMinutes": 15,
50+
}
51+
```
52+
53+
3. Trigger runtime bundle upgrade as usual from Azure portal / CLI. For reference [Upgrading cluster runtime from Azure CLI](./howto-cluster-runtime-upgrade.md)
54+
55+
4. Once Rack 1 completes, the runtime upgrade will be paused, awaiting user action to resume the upgrade for Rack 2.
56+
57+
:::image type="content" source="media/runtime-upgrade-cluster-paused.png" alt-text="Screenshot showing Paused Runtime Upgrade.":::
58+
59+
> [!NOTE]
60+
> This message will be available in logs for programtic access, for more details follow [List of logs available for streaming in Azure Operator Nexus](list-logs-available.md)
61+
62+
5. To resume the runtime upgrade, execute the following `az networkcloud` cli command.
63+
64+
```shell
65+
az networkcloud cluster continue-update-version \
66+
--subscription=$SUBSCRIPTION \
67+
--resource-group=$RESOURCE_GROUP \
68+
--cluster-name=$CLUSTER_NAME
69+
```
70+
71+
6. Repeat step 5 for each rack until all racks have been upgraded to the latest runtime bundle.
72+
73+
## Related content
74+
75+
- [Upgrading cluster runtime from Azure CLI](./howto-cluster-runtime-upgrade.md)

articles/operator-nexus/howto-cluster-runtime-upgrade.md

Lines changed: 29 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -94,36 +94,52 @@ The output should be the target cluster's information and the cluster's detailed
9494
For more detailed insights on the upgrade progress, the individual BMM in each Rack can be checked for status. Example of this is provided in the reference section under [BareMetal Machine roles](./reference-near-edge-baremetal-machine-roles.md).
9595

9696
## Configure compute threshold parameters for runtime upgrade using cluster updateStrategy
97+
9798
The following Azure CLI command is used to configure the compute threshold parameters for a runtime upgrade:
9899

99100
```azurecli
100-
az networkcloud cluster update --name "<clusterName>" --resource-group "<resourceGroup>" --update-strategy strategy-type="Rack" threshold-type="PercentSuccess" threshold-value="<thresholdValue>" max-unavailable=<maxNodesOffline> wait-time-minutes=<waitTimeBetweenRacks>
101+
az networkcloud cluster update /
102+
--name "<clusterName>" /
103+
--resource-group "<resourceGroup>" /
104+
--update-strategy strategy-type="Rack" threshold-type="PercentSuccess" /
105+
threshold-value="<thresholdValue>" max-unavailable=<maxNodesOffline> /
106+
wait-time-minutes=<waitTimeBetweenRacks>
101107
```
102108

103-
Required arguments:
104-
- strategy-type: Defines the update strategy. In this case, "Rack" means updates occur rack-by-rack. The default value is "Rack"
109+
Required parameters:
110+
- strategy-type: Defines the update strategy. In this case, "Rack" means updates occur rack-by-rack. The default value is "Rack".
105111
- threshold-type: Determines how the threshold should be evaluated, applied in the units defined by the strategy. The default value is "PercentSuccess".
106112
- threshold-value: The numeric threshold value used to evaluate an update. The default value is 80.
107113

108-
Optional arguments:
114+
Optional parameters:
109115
- max-unavailable: The maximum number of worker nodes that can be offline, that is, upgraded rack at a time. The default value is 32767.
110116
- wait-time-minutes: The delay or waiting period before updating a rack. The default value is 15.
111117

112118
An example usage of the command is as below:
119+
113120
```azurecli
114121
az networkcloud cluster update --name "cluster01" --resource-group "cluster01-rg" --update-strategy strategy-type="Rack" threshold-type="PercentSuccess" threshold-value=70 max-unavailable=16 wait-time-minutes=15
115122
```
123+
116124
Upon successful execution of the command, the updateStrategy values specified will be applied to the cluster:
117-
```
118-
"updateStrategy": {
125+
126+
```
127+
"updateStrategy": {
119128
"maxUnavailable": 16,
120129
"strategyType": "Rack",
121130
"thresholdType": "PercentSuccess",
122131
"thresholdValue": 70,
123132
"waitTimeMinutes": 15,
124-
},
133+
}
125134
```
126135

136+
> [!NOTE]
137+
> When a threshold value below 100% is set, it’s possible that any unhealthy nodes might not be upgraded, yet the “Cluster” status could still indicate that upgrade was successful. For troubleshooting issues with bare metal machines, please refer to [Troubleshoot Azure Operator Nexus server problems](troubleshoot-reboot-reimage-replace.md)
138+
139+
## Upgrade with PauseRack strategy
140+
141+
Starting with API version 2024-06-01-preview, runtime upgrades can be triggered using a "PauseRack" strategy. When you execute a Cluster runtime upgrade with the PauseRack" strategy, it will update one rack at a time in the Cluster and then stop, awaiting confirmation before proceeding to the next rack. All existing thresholds will continue to be respected with the "PauseRack" strategy. To carry out a Cluster runtime upgrade using the "PauseRack" strategy follow the steps outlined in [Upgrading cluster runtime with a pause rack strategy](howto-cluster-runtime-upgrade-with-pauserack-strategy.md)
142+
127143
## Frequently Asked Questions
128144

129145
### Identifying Cluster Upgrade Stalled/Stuck
@@ -146,22 +162,18 @@ During a runtime upgrade, the cluster enters a state of `Upgrading`. In the even
146162

147163
### Impact on Nexus Kubernetes tenant workloads during cluster runtime upgrade
148164

149-
During a runtime upgrade, impacted Nexus Kubernetes cluster nodes are cordoned and drained before the Bare Metal Hosts (BMH) are upgraded. Cordoning the cluster node prevents new pods from being scheduled on it and draining the cluster node allows pods that are running tenant workloads a chance to shift to another available cluster node, which helps to reduce the impact on services. The draining mechanism's effectiveness is contingent on the available capacity within the Nexus Kubernetes cluster. If the cluster is nearing full capacity and lacks space for the pods to relocate, they transition into a Pending state following the draining process.
165+
During a runtime upgrade, impacted Nexus Kubernetes Cluster nodes are cordoned and drained before the Bare Metal Hosts (BMH) are upgraded. Cordoning the Kubernetes Cluster node prevents new pods from being scheduled on it and draining the Kubernetes Cluster node allows pods that are running tenant workloads a chance to shift to another available Kubernetes Cluster node, which helps to reduce the impact on services. The draining mechanism's effectiveness is contingent on the available capacity within the Nexus Kubernetes Cluster. If the Kubernetes Cluster is nearing full capacity and lacks space for the pods to relocate, they transition into a Pending state following the draining process.
150166

151167
Once the cordon and drain process of the tenant cluster node is completed, the upgrade of the BMH proceeds. Each tenant cluster node is allowed up to 10 minutes for the draining process to complete, after which the BMH upgrade will begin. This guarantees the BMH upgrade will make progress. BMHs are upgraded one rack at a time, and upgrades are performed in parallel within the same rack. The BMH upgrade does not wait for tenant resources to come online before continuing with the runtime upgrade of BMHs in the rack being upgraded. The benefit of this is that the maximum overall wait time for a rack upgrade is kept at 10 minutes regardless of how many nodes are available. This maximum wait time is specific to the cordon and drain procedure and is not applied to the overall upgrade procedure. Upon completion of each BMH upgrade, the Nexus Kubernetes cluster node starts, rejoins the cluster, and is uncordoned, allowing pods to be scheduled on the node once again.
152168

153169
It's important to note that the Nexus Kubernetes cluster node won't be shut down after the cordon and drain process. The BMH is rebooted with the new image as soon as all the Nexus Kubernetes cluster nodes are cordoned and drained, after 10 minutes if the drain process isn't completed. Additionally, the cordon and drain is not initiated for power-off or restart actions of the BMH; it's exclusively activated only during a runtime upgrade.
154170

155-
It is important to note that following the runtime upgrade, there could be instance where a Nexus Kubernetes Cluster node remains cordoned. For such scenario, you can manually uncordon the node by executing the following commands via(./includes/kubernetes-cluster/cluster-connect.md)
171+
It is important to note that following the runtime upgrade, there could be instance where a Nexus Kubernetes Cluster node remains cordoned. For such scenario, you can manually uncordon the node by executing the following command
156172

157-
```
158-
kubectl get nodes | grep SchedulingDisabled > /dev/null
159-
if [ $? -eq 0 ]; then
160-
for node in $(kubectl get nodes | grep SchedulingDisabled | awk '{print $1}'); do
161-
kubectl uncordon $node
162-
done
163-
fi
164-
```
173+
```azurecli
174+
az networkcloud baremetalmachine list -g $mrg --subscription $sub --query "sort_by([].{name:name,kubernetesNodeName:kubernetesNodeName,location:location,readyState:readyState,provisioningState:provisioningState,detailedStatus:detailedStatus,detailedStatusMessage:detailedStatusMessage,powerState:powerState,tags:tags.Status,machineRoles:join(', ', machineRoles),cordonStatus:cordonStatus,createdAt:systemData.createdAt}, &name)"
175+
--output table
165176
177+
```
166178
<!-- LINKS - External -->
167179
[installation-instruction]: https://aka.ms/azcli
12.9 KB
Loading

0 commit comments

Comments
 (0)