Skip to content

Commit 9e6d214

Browse files
committed
rack pause update docs
1 parent ff889d4 commit 9e6d214

File tree

3 files changed

+25
-23
lines changed

3 files changed

+25
-23
lines changed

articles/operator-nexus/howto-cluster-runtime-upgrade-with-pauserack-strategy.md

Lines changed: 14 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -8,25 +8,21 @@ ms.topic: how-to
88
ms.date: 08/16/2024
99
# ms.custom: template-include
1010
---
11-
12-
# Upgrading cluster runtime with a pause rack strategy
11+
## Upgrading cluster runtime with a pause rack strategy
1312

1413
This how-to guide explains the steps to execute a cluster runtime upgrade with pasue rack strategy. Executing cluster runtime upgrade with "PauseRack" strategy will update a single rack in a cluster and then pause to wait for confirmation before moving to the next rack. All existing thresholds will still be honoried with pause rack strategy.
1514

1615
## Prerequisites
1716

18-
Please follow the steps mentioned in prerequistie section of [Upgrading cluster runtime from Azure CLI](./howto-cluster-runtime-upgrade.md)
19-
20-
> **Note:**
17+
> [!NOTE]
2118
> Upgrades with the PauseRack strategy is available starting API version 2024-06-01-preview.
2219
20+
Please follow the steps mentioned in prerequistie section of [Upgrading cluster runtime from Azure CLI](./howto-cluster-runtime-upgrade.md)
21+
2322
## Procedure
2423

2524
1. Enable Rack Pause upgrade strategy on a Nexus cluster
2625

27-
> **Note:**
28-
> Below is just a reference command, please choose threshold values as desired.
29-
3026
Example:
3127

3228
```azurecli
@@ -37,19 +33,22 @@ Please follow the steps mentioned in prerequistie section of [Upgrading cluster
3733
3834
2. Confirm that the cluster resource JSON in the JSON View reflects the rack pause upgrade strategy.
3935
40-
```shell
41-
az networkcloud cluster show --cluster-name "clusterName" --resource-group "resourceGroupName"
42-
```
36+
```azurecli
37+
az networkcloud cluster show --cluster-name "clusterName" --resource-group "resourceGroupName"
38+
```
4339
4440
:::image type="content" source="media/runtime-upgrade-cluster-pause-rack-strategy.png" alt-text="Runtime upgrade strategy property details":::
4541
46-
3. Trigger runtime bundle upgrade as usual from Azure portal / CLI. for reference [Upgrading cluster runtime from Azure CLI](./howto-cluster-runtime-upgrade.md)
42+
3.Trigger runtime bundle upgrade as usual from Azure portal / CLI. for reference [Upgrading cluster runtime from Azure CLI](./howto-cluster-runtime-upgrade.md)
4743
48-
4. Once Rack 1 has completed, the runtime upgrade will pause, awaiting user action to resume the runtime upgrade for Rack 2.
44+
4.Once Rack 1 has completed, the runtime upgrade will pause, awaiting user action to resume the runtime upgrade for Rack 2.
4945
5046
:::image type="content" source="media/runtime-upgrade-cluster-paused.png" alt-text="Paused Runtime Upgrade":::
5147
52-
5. To resume the runtime upgrade, execute the following `az networkcloud` cli command to trigger the continue upgrade version action.
48+
> [!NOTE]
49+
> This message will be available in logs for programtic access, for more details follow [List of logs available for streaming in Azure Operator Nexus](list-logs-available.md)
50+
51+
5.To resume the runtime upgrade, execute the following `az networkcloud` cli command to trigger the continue upgrade version action.
5352
5453
```shell
5554
az networkcloud cluster continue-update-version \
@@ -58,7 +57,7 @@ az networkcloud cluster continue-update-version \
5857
--cluster-name=$CLUSTER_NAME
5958
```
6059

61-
6. Continue repeating step 5 for each rack until all racks have been upgraded to the latest runtime bundle.
60+
6.Continue repeating step 5 for each rack until all racks have been upgraded to the latest runtime bundle.
6261

6362
## Related content
6463

articles/operator-nexus/howto-cluster-runtime-upgrade.md

Lines changed: 11 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -94,6 +94,7 @@ The output should be the target cluster's information and the cluster's detailed
9494
For more detailed insights on the upgrade progress, the individual BMM in each Rack can be checked for status. Example of this is provided in the reference section under [BareMetal Machine roles](./reference-near-edge-baremetal-machine-roles.md).
9595

9696
## Configure compute threshold parameters for runtime upgrade using cluster updateStrategy
97+
9798
The following Azure CLI command is used to configure the compute threshold parameters for a runtime upgrade:
9899

99100
```azurecli
@@ -110,25 +111,28 @@ Optional arguments:
110111
- wait-time-minutes: The delay or waiting period before updating a rack. The default value is 15.
111112

112113
An example usage of the command is as below:
114+
113115
```azurecli
114116
az networkcloud cluster update --name "cluster01" --resource-group "cluster01-rg" --update-strategy strategy-type="Rack" threshold-type="PercentSuccess" threshold-value=70 max-unavailable=16 wait-time-minutes=15
115117
```
118+
116119
Upon successful execution of the command, the updateStrategy values specified will be applied to the cluster:
117-
```
118-
"updateStrategy": {
120+
121+
``` "updateStrategy": {
119122
"maxUnavailable": 16,
120123
"strategyType": "Rack",
121124
"thresholdType": "PercentSuccess",
122125
"thresholdValue": 70,
123126
"waitTimeMinutes": 15,
124127
},
125128
```
126-
> [!WARNING]
127-
> When a threshold value below 100% is set, it’s possible that any unhealthy nodes might not be upgraded, yet the “Cluster” status could still indicate that upgrade was sucessfull. For troubleshooting issues with bare metal machines, please refer to the troubleshooting guide titled [Troubleshoot Azure Operator Nexus server problems](troubleshoot-reboot-reimage-replace.md)
129+
130+
> [!NOTE]
131+
> When a threshold value below 100% is set, it’s possible that any unhealthy nodes might not be upgraded, yet the “Cluster” status could still indicate that upgrade was sucessfull. For troubleshooting issues with bare metal machines, please refer to [Troubleshoot Azure Operator Nexus server problems](troubleshoot-reboot-reimage-replace.md)
128132
129133
## Upgrade with PauseRack Strategy
130134

131-
Starting with API version 2024-06-01-preview, runtime upgrades can be triggered using a "PauseRack" strategy. When you execute a cluster runtime upgrade with the PauseRack" strategy, it will update one rack at a time in the cluster and then pause, awaiting confirmation before proceeding to the next rack. All existing thresholds will continue to be respected with the "PauseRack" strategy. To carry out a cluster runtime upgrade using the "PauseRack" strategy, please follow the steps outlined in [Upgrading cluster runtime with a pause rack strategy](howto-cluster-runtime-upgrade-with-pauserack-strategy.md)
135+
Starting with API version 2024-06-01-preview, runtime upgrades can be triggered using a "PauseRack" strategy. When you execute a cluster runtime upgrade with the PauseRack" strategy, it will update one rack at a time in the cluster and then pause, awaiting confirmation before proceeding to the next rack. All existing thresholds will continue to be respected with the "PauseRack" strategy. To carry out a cluster runtime upgrade using the "PauseRack" strategy follow the steps outlined in [Upgrading cluster runtime with a pause rack strategy](howto-cluster-runtime-upgrade-with-pauserack-strategy.md)
132136

133137
## Frequently Asked Questions
134138

@@ -152,16 +156,15 @@ During a runtime upgrade, the cluster enters a state of `Upgrading`. In the even
152156

153157
### Impact on Nexus Kubernetes tenant workloads during cluster runtime upgrade
154158

155-
During a runtime upgrade, impacted Nexus Kubernetes cluster nodes are cordoned and drained before the Bare Metal Hosts (BMH) are upgraded. Cordoning the cluster node prevents new pods from being scheduled on it and draining the cluster node allows pods that are running tenant workloads a chance to shift to another available cluster node, which helps to reduce the impact on services. The draining mechanism's effectiveness is contingent on the available capacity within the Nexus Kubernetes cluster. If the cluster is nearing full capacity and lacks space for the pods to relocate, they transition into a Pending state following the draining process.
159+
During a runtime upgrade, impacted Nexus Kubernetes cluster nodes are cordoned and drained before the Bare Metal Hosts (BMH) are upgraded. Cordoning the cluster node prevents new pods from being scheduled on it and draining the cluster node allows pods that are running tenant workloads a chance to shift to another available cluster node, which helps to reduce the impact on services. The draining mechanism's effectiveness is contingent on the available capacity within the Nexus Kubernetes cluster. If the cluster is nearing full capacity and lacks space for the pods to relocate, they transition into a Pending state following the draining process.
156160

157161
Once the cordon and drain process of the tenant cluster node is completed, the upgrade of the BMH proceeds. Each tenant cluster node is allowed up to 10 minutes for the draining process to complete, after which the BMH upgrade will begin. This guarantees the BMH upgrade will make progress. BMHs are upgraded one rack at a time, and upgrades are performed in parallel within the same rack. The BMH upgrade does not wait for tenant resources to come online before continuing with the runtime upgrade of BMHs in the rack being upgraded. The benefit of this is that the maximum overall wait time for a rack upgrade is kept at 10 minutes regardless of how many nodes are available. This maximum wait time is specific to the cordon and drain procedure and is not applied to the overall upgrade procedure. Upon completion of each BMH upgrade, the Nexus Kubernetes cluster node starts, rejoins the cluster, and is uncordoned, allowing pods to be scheduled on the node once again.
158162

159163
It's important to note that the Nexus Kubernetes cluster node won't be shut down after the cordon and drain process. The BMH is rebooted with the new image as soon as all the Nexus Kubernetes cluster nodes are cordoned and drained, after 10 minutes if the drain process isn't completed. Additionally, the cordon and drain is not initiated for power-off or restart actions of the BMH; it's exclusively activated only during a runtime upgrade.
160164

161165
It is important to note that following the runtime upgrade, there could be instance where a Nexus Kubernetes Cluster node remains cordoned. For such scenario, you can manually uncordon the node by executing the following commands via(./includes/kubernetes-cluster/cluster-connect.md)
162166

163-
```
164-
kubectl get nodes | grep SchedulingDisabled > /dev/null
167+
```kubectl get nodes | grep SchedulingDisabled > /dev/null
165168
if [ $? -eq 0 ]; then
166169
for node in $(kubectl get nodes | grep SchedulingDisabled | awk '{print $1}'); do
167170
kubectl uncordon $node
-39.9 KB
Loading

0 commit comments

Comments
 (0)