You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/operator-nexus/howto-cluster-runtime-upgrade-with-pauserack-strategy.md
+28-17Lines changed: 28 additions & 17 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,54 +1,65 @@
1
1
---
2
-
title: "Azure Operator Nexus: Runtime upgrade with rack pause strategy"
3
-
description: Learn to execute a cluster runtime upgrade for Operator Nexus with a pause rack strategy
2
+
title: "Azure Operator Nexus: Runtime upgrade with PauseRack strategy"
3
+
description: Learn to execute a cluster runtime upgrade for Operator Nexus with a PauseRack strategy
4
4
author: vivekjMSFT
5
5
ms.author: vija
6
6
ms.service: azure-operator-nexus
7
7
ms.topic: how-to
8
8
ms.date: 08/16/2024
9
9
# ms.custom: template-include
10
10
---
11
-
##Upgrading cluster runtime with a pause rack strategy
11
+
# Upgrading cluster runtime with a PauseRack strategy
12
12
13
-
This how-to guide explains the steps to execute a cluster runtime upgrade with pasue rack strategy. Executing cluster runtime upgrade with "PauseRack" strategy will update a single rack in a cluster and then pause to wait for confirmation before moving to the next rack. All existing thresholds will still be honoried with pause rack strategy.
13
+
This how-to guide explains the steps to execute a cluster runtime upgrade with PauseRack strategy. Executing cluster runtime upgrade with PauseRack strategy will update a single rack in a cluster and then pause to wait for confirmation before moving to the next rack. All existing thresholds will still be honored.
14
14
15
15
## Prerequisites
16
16
17
17
> [!NOTE]
18
18
> Upgrades with the PauseRack strategy is available starting API version 2024-06-01-preview.
19
19
20
-
Please follow the steps mentioned in prerequistie section of [Upgrading cluster runtime from Azure CLI](./howto-cluster-runtime-upgrade.md)
20
+
1. The [Install Azure CLI][installation-instruction] must be installed.
21
+
2. The `networkcloud` CLI extension is required. If the `networkcloud` extension isn't installed, it can be installed following the steps listed [here](https://github.com/MicrosoftDocs/azure-docs-pr/blob/main/articles/operator-nexus/howto-install-cli-extensions.md).
22
+
3. Access to the Azure portal for the target cluster to be upgraded.
23
+
4. You must be logged in to the same subscription as your target cluster via `az login`
24
+
5. Target cluster must be in a running state, with all control plane nodes healthy and 80+% of compute nodes in a running and healthy state.
21
25
22
26
## Procedure
23
27
24
-
1. Enable Rack Pause upgrade strategy on a Nexus cluster
25
-
26
-
Example:
28
+
1. Enable PauseRack upgrade strategy on a Nexus cluster
27
29
28
30
```azurecli
29
-
az networkcloud cluster update --name "clusterName" --resource-group "resourceGroupName" --update-strategy \
3.Trigger runtime bundle upgrade as usual from Azure portal / CLI. for reference [Upgrading cluster runtime from Azure CLI](./howto-cluster-runtime-upgrade.md)
53
+
3.Trigger runtime bundle upgrade as usual from Azure portal / CLI. For reference [Upgrading cluster runtime from Azure CLI](./howto-cluster-runtime-upgrade.md)
43
54
44
-
4.Once Rack 1 has completed, the runtime upgrade will pause, awaiting user action to resume the runtime upgrade for Rack 2.
55
+
4.Once Rack 1 completes, the runtime upgrade will be paused, awaiting user action to resume the upgrade for Rack 2.
> This message will be available in logs for programtic access, for more details follow [List of logs available for streaming in Azure Operator Nexus](list-logs-available.md)
50
61
51
-
5.To resume the runtime upgrade, execute the following `az networkcloud` cli command to trigger the continue upgrade version action.
62
+
5.To resume the runtime upgrade, execute the following `az networkcloud` cli command.
52
63
53
64
```shell
54
65
az networkcloud cluster continue-update-version \
@@ -57,7 +68,7 @@ az networkcloud cluster continue-update-version \
57
68
--cluster-name=$CLUSTER_NAME
58
69
```
59
70
60
-
6.Continue repeating step 5 for each rack until all racks have been upgraded to the latest runtime bundle.
71
+
6. Repeat step 5 for each rack until all racks have been upgraded to the latest runtime bundle.
- strategy-type: Defines the update strategy. In this case, "Rack" means updates occur rack-by-rack. The default value is "Rack"
109
+
Required parameters:
110
+
- strategy-type: Defines the update strategy. In this case, "Rack" means updates occur rack-by-rack. The default value is "Rack".
106
111
- threshold-type: Determines how the threshold should be evaluated, applied in the units defined by the strategy. The default value is "PercentSuccess".
107
112
- threshold-value: The numeric threshold value used to evaluate an update. The default value is 80.
108
113
109
-
Optional arguments:
114
+
Optional parameters:
110
115
- max-unavailable: The maximum number of worker nodes that can be offline, that is, upgraded rack at a time. The default value is 32767.
111
116
- wait-time-minutes: The delay or waiting period before updating a rack. The default value is 15.
Upon successful execution of the command, the updateStrategy values specified will be applied to the cluster:
120
125
121
-
```"updateStrategy": {
126
+
```
127
+
"updateStrategy": {
122
128
"maxUnavailable": 16,
123
129
"strategyType": "Rack",
124
130
"thresholdType": "PercentSuccess",
125
131
"thresholdValue": 70,
126
132
"waitTimeMinutes": 15,
127
-
},
133
+
}
128
134
```
129
135
130
136
> [!NOTE]
131
-
> When a threshold value below 100% is set, it’s possible that any unhealthy nodes might not be upgraded, yet the “Cluster” status could still indicate that upgrade was sucessfull. For troubleshooting issues with bare metal machines, please refer to [Troubleshoot Azure Operator Nexus server problems](troubleshoot-reboot-reimage-replace.md)
137
+
> When a threshold value below 100% is set, it’s possible that any unhealthy nodes might not be upgraded, yet the “Cluster” status could still indicate that upgrade was successful. For troubleshooting issues with bare metal machines, please refer to [Troubleshoot Azure Operator Nexus server problems](troubleshoot-reboot-reimage-replace.md)
132
138
133
-
## Upgrade with PauseRack Strategy
139
+
## Upgrade with Pause Rack Strategy
134
140
135
-
Starting with API version 2024-06-01-preview, runtime upgrades can be triggered using a "PauseRack" strategy. When you execute a cluster runtime upgrade with the PauseRack" strategy, it will update one rack at a time in the cluster and then pause, awaiting confirmation before proceeding to the next rack. All existing thresholds will continue to be respected with the "PauseRack" strategy. To carry out a cluster runtime upgrade using the "PauseRack" strategy follow the steps outlined in [Upgrading cluster runtime with a pause rack strategy](howto-cluster-runtime-upgrade-with-pauserack-strategy.md)
141
+
Starting with API version 2024-06-01-preview, runtime upgrades can be triggered using a "PauseRack" strategy. When you execute a cluster runtime upgrade with the "PauseRack" strategy, it will update one rack at a time in the cluster and then pause, awaiting confirmation before proceeding to the next rack. All existing thresholds will continue to be respected with the "PauseRack" strategy. To carry out a cluster runtime upgrade using the "PauseRack" strategy follow the steps outlined in [Upgrading cluster runtime with a PauseRack strategy](howto-cluster-runtime-upgrade-with-pauserack-strategy.md)
136
142
137
143
## Frequently Asked Questions
138
144
@@ -162,15 +168,12 @@ Once the cordon and drain process of the tenant cluster node is completed, the u
162
168
163
169
It's important to note that the Nexus Kubernetes cluster node won't be shut down after the cordon and drain process. The BMH is rebooted with the new image as soon as all the Nexus Kubernetes cluster nodes are cordoned and drained, after 10 minutes if the drain process isn't completed. Additionally, the cordon and drain is not initiated for power-off or restart actions of the BMH; it's exclusively activated only during a runtime upgrade.
164
170
165
-
It is important to note that following the runtime upgrade, there could be instance where a Nexus Kubernetes Cluster node remains cordoned. For such scenario, you can manually uncordon the node by executing the following commands via(./includes/kubernetes-cluster/cluster-connect.md)
171
+
It is important to note that following the runtime upgrade, there could be instance where a Nexus Kubernetes Cluster node remains cordoned. For such scenario, you can manually uncordon the node by executing the following command
166
172
167
-
```kubectl get nodes | grep SchedulingDisabled > /dev/null
168
-
if [ $? -eq 0 ]; then
169
-
for node in $(kubectl get nodes | grep SchedulingDisabled | awk '{print $1}'); do
170
-
kubectl uncordon $node
171
-
done
172
-
fi
173
-
```
173
+
```azurecli
174
+
az networkcloud baremetalmachine list -g $mrg --subscription $sub --query "sort_by([].{name:name,kubernetesNodeName:kubernetesNodeName,location:location,readyState:readyState,provisioningState:provisioningState,detailedStatus:detailedStatus,detailedStatusMessage:detailedStatusMessage,powerState:powerState,tags:tags.Status,machineRoles:join(', ', machineRoles),cordonStatus:cordonStatus,createdAt:systemData.createdAt}, &name)"
0 commit comments