Skip to content

Commit 8603acc

Browse files
authored
Update howto-cluster-runtime-upgrade.md
1 parent 0911b16 commit 8603acc

File tree

1 file changed

+38
-71
lines changed

1 file changed

+38
-71
lines changed

articles/operator-nexus/howto-cluster-runtime-upgrade.md

Lines changed: 38 additions & 71 deletions
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,7 @@ This how-to guide explains the steps for installing the required Azure CLI and e
1717
## Prerequisites
1818

1919
1. The [Install Azure CLI](/cli/azure/install-azure-cli) must be installed.
20-
2. The `networkcloud` CLI extension is required. If the `networkcloud` extension isn't installed, it can be installed following the steps listed [here](./howto-install-cli-extensions.md)
20+
2. The `networkcloud` CLI extension is required. If the `networkcloud` extension isn't installed, it can be installed following the steps listed [here](./howto-install-cli-extensions.md).
2121
3. Access to the Azure portal for the target cluster to be upgraded.
2222
4. You must be logged in to the same subscription as your target cluster via `az login`
2323
5. Target cluster must be in a running state, with all control plane nodes healthy and 80+% of compute nodes in a running and healthy state.
@@ -59,42 +59,46 @@ In the output, you can find the `availableUpgradeVersions` property and look at
5959

6060
If there are no available cluster upgrades, the list is empty.
6161

62-
### Set Deployment Threshold
63-
64-
***--update-strategy - The strategy for updating the cluster indicating the allowable compute node failures during bootstrap provisioning.***
62+
## Configure compute threshold parameters for runtime upgrade using cluster updateStrategy
6563

66-
If the customer requests an `update-strategy` threshold that is different from the default of 80%, you can run the following cluster update command.
64+
The following Azure CLI command is used to configure the compute threshold parameters for a runtime upgrade:
6765

6866
```azurecli
69-
az networkcloud cluster update -n <CLUSTER_NAME> -g <CLUSTER_RG> --update-strategy strategy-type="Rack" threshold-type="PercentSuccess" threshold-value=<DEPLOYMENT_THRESHOLD> wait-time-minutes=<DEPLOYMENT_PAUSE_MINS> --subscription <SUBSCRIPTION_ID>
70-
```
71-
72-
strategy-type can be "Rack" (Rack by Rack) OR "PauseAfterRack" (Wait for customer response to continue)
73-
74-
threshold-type can be "PercentSuccess" OR "CountSuccess"
75-
76-
If updateStrategy isn't set, the defaults are as follows:
77-
78-
```
79-
"strategyType": "Rack",
80-
"thresholdType": "PercentSuccess",
81-
"thresholdValue": 80,
82-
"waitTimeMinutes": 1
67+
az networkcloud cluster update /
68+
--name "<clusterName>" /
69+
--resource-group "<resourceGroup>" /
70+
--update-strategy strategy-type="Rack" threshold-type="PercentSuccess" /
71+
threshold-value="<thresholdValue>" max-unavailable=<maxNodesOffline> /
72+
wait-time-minutes=<waitTimeBetweenRacks> /
73+
--subscription <SUBSCRIPTION_ID>
8374
```
8475

76+
Required parameters:
77+
- strategy-type: Defines the update strategy. This can be `"Rack"` (Rack by Rack) OR `"PauseAfterRack"` (Upgrade one rack at a time and then wait for confirmation before proceeding to the next rack. The default value is `Rack`. To carry out a Cluster runtime upgrade using the "PauseRack" strategy follow the steps outlined in [Upgrading cluster runtime with a pause rack strategy](howto-cluster-runtime-upgrade-with-pauserack-strategy.md)
78+
- threshold-type: Determines how the threshold should be evaluated, applied in the units defined by the strategy. This can be `"PercentSuccess"` OR `"CountSuccess"`. The default value is `PercentSuccess`.
79+
- threshold-value: The numeric threshold value used to evaluate an update. The default value is `80`.
8580

81+
Optional parameters:
82+
- max-unavailable: The maximum number of worker nodes that can be offline, that is, upgraded rack at a time. The default value is `32767`.
83+
- wait-time-minutes: The delay or waiting period before updating a rack. The default value is `15`.
8684

8785
The following example is for a customer using Rack by Rack strategy with a Percent Success of 60% and a 1-minute pause.
8886

8987
```azurecli
90-
az networkcloud cluster update -n <CLUSTER_NAME> -g <CLUSTER_RG> --update-strategy strategy-type="Rack" threshold-type="PercentSuccess" threshold-value=60 wait-time-minutes=1 --subscription <SUBSCRIPTION_ID>
88+
az networkcloud cluster update --name "<clusterName>" /
89+
--resource-group "<resourceGroup>" /
90+
--update-strategy strategy-type="Rack" threshold-type="PercentSuccess" /
91+
threshold-value=60 wait-time-minutes=1 /
92+
--subscription <SUBSCRIPTION_ID>
9193
```
9294

93-
9495
Verify update:
9596

9697
```
97-
az networkcloud cluster show -g <CLUSTER_RG> -n <CLUSTER_NAME> --subscription <SUBSCRIPTION_ID>| grep -a5 updateStrategy
98+
az networkcloud cluster show --resource-group "<resourceGroup>" /
99+
--name "<clusterName>" /
100+
--subscription <SUBSCRIPTION_ID>| grep -a5 updateStrategy
101+
98102
"strategyType": "Rack",
99103
"thresholdType": "PercentSuccess",
100104
"thresholdValue": 60,
@@ -106,14 +110,20 @@ In this example, if less than 60% of the compute nodes being provisioned in a ra
106110
The following example is for a customer using Rack by Rack strategy with a threshold type CountSuccess of 10 nodes per rack and a 1-minute pause.
107111

108112
```azurecli
109-
az networkcloud cluster update -n <CLUSTER_NAME> -g <CLUSTER_RG> --update-strategy strategy-type="Rack" threshold-type="CountSuccess" threshold-value=10 wait-time-minutes=1 --subscription <SUBSCRIPTION_ID>
113+
az networkcloud cluster update --name "<clusterName>" /
114+
--resource-group "<resourceGroup>" /
115+
--update-strategy strategy-type="Rack" threshold-type="CountSuccess" /
116+
threshold-value=10 wait-time-minutes=1 /
117+
--subscription <SUBSCRIPTION_ID>
110118
```
111119

112-
113120
Verify update:
114121

115122
```
116-
az networkcloud cluster show -g <CLUSTER_RG> -n <CLUSTER_NAME> --subscription <SUBSCRIPTION_ID>| grep -a5 updateStrategy
123+
az networkcloud cluster show --resource-group "<resourceGroup>" /
124+
--name "<clusterName>" /
125+
--subscription <SUBSCRIPTION_ID>| grep -a5 updateStrategy
126+
117127
"strategyType": "Rack",
118128
"thresholdType": "CountSuccess",
119129
"thresholdValue": 10,
@@ -122,7 +132,9 @@ az networkcloud cluster show -g <CLUSTER_RG> -n <CLUSTER_NAME> --subscription <S
122132

123133
In this example, if less than 10 compute nodes being provisioned in a rack fail to provision (on a rack by rack basis), the cluster deployment fails. If 10 or more of the compute nodes are successfully provisioned, cluster deployment moves on to the next rack of compute nodes.
124134

125-
***NOTE: `update-strategy` cannot be changed after the cluster runtime upgrade has started.***
135+
> [!NOTE]
136+
> ***`update-strategy` cannot be changed after the cluster runtime upgrade has started.***
137+
> When a threshold value below 100% is set, it’s possible that any unhealthy nodes might not be upgraded, yet the “Cluster” status could still indicate that upgrade was successful. For troubleshooting issues with bare metal machines, please refer to [Troubleshoot Azure Operator Nexus server problems](troubleshoot-reboot-reimage-replace.md)
126138
127139
## Upgrading cluster runtime using CLI
128140

@@ -158,52 +170,7 @@ az networkcloud cluster show --cluster-name "clusterName" --resource-group "reso
158170
The output should be the target cluster's information and the cluster's detailed status and detail status message should be present.
159171
For more detailed insights on the upgrade progress, the individual node in each Rack can be checked for status. An example of checking the status is provided in the reference section under [BareMetal Machine roles](./reference-near-edge-baremetal-machine-roles.md).
160172

161-
## Configure compute threshold parameters for runtime upgrade using cluster updateStrategy
162-
163-
The following Azure CLI command is used to configure the compute threshold parameters for a runtime upgrade:
164-
165-
```azurecli
166-
az networkcloud cluster update /
167-
--name "<clusterName>" /
168-
--resource-group "<resourceGroup>" /
169-
--update-strategy strategy-type="Rack" threshold-type="PercentSuccess" /
170-
threshold-value="<thresholdValue>" max-unavailable=<maxNodesOffline> /
171-
wait-time-minutes=<waitTimeBetweenRacks>
172-
```
173-
174-
Required parameters:
175-
- strategy-type: Defines the update strategy. In this case, "Rack" means updates occur rack-by-rack. The default value is `Rack`.
176-
- threshold-type: Determines how the threshold should be evaluated, applied in the units defined by the strategy. The default value is `PercentSuccess`.
177-
- threshold-value: The numeric threshold value used to evaluate an update. The default value is 80.
178-
179-
Optional parameters:
180-
- max-unavailable: The maximum number of worker nodes that can be offline, that is, upgraded rack at a time. The default value is 32767.
181-
- wait-time-minutes: The delay or waiting period before updating a rack. The default value is 15.
182-
183-
The following example shows usage of the command:
184-
185-
```azurecli
186-
az networkcloud cluster update --name "cluster01" --resource-group "cluster01-rg" --update-strategy strategy-type="Rack" threshold-type="PercentSuccess" threshold-value=70 max-unavailable=16 wait-time-minutes=15
187-
```
188-
189-
Upon successful execution of the command, the updateStrategy values specified are applied to the cluster:
190-
191-
```
192-
"updateStrategy": {
193-
"maxUnavailable": 16,
194-
"strategyType": "Rack",
195-
"thresholdType": "PercentSuccess",
196-
"thresholdValue": 70,
197-
"waitTimeMinutes": 15,
198-
}
199-
```
200-
201-
> [!NOTE]
202-
> When a threshold value below 100% is set, it’s possible that any unhealthy nodes might not be upgraded, yet the “Cluster” status could still indicate that upgrade was successful. For troubleshooting issues with bare metal machines, please refer to [Troubleshoot Azure Operator Nexus server problems](troubleshoot-reboot-reimage-replace.md)
203-
204-
## Upgrade with PauseRack strategy
205173

206-
Starting with API version 2024-06-01-preview, you can trigger runtime upgrades using a "PauseRack" strategy. When you execute a Cluster runtime upgrade with the PauseRack" strategy, it will update one rack at a time in the Cluster and then stop, awaiting confirmation before proceeding to the next rack. All existing thresholds continue to be respected with the "PauseRack" strategy. To carry out a Cluster runtime upgrade using the "PauseRack" strategy follow the steps outlined in [Upgrading cluster runtime with a pause rack strategy](howto-cluster-runtime-upgrade-with-pauserack-strategy.md)
207174

208175
## Frequently Asked Questions
209176

0 commit comments

Comments
 (0)