You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/operator-nexus/howto-cluster-runtime-upgrade.md
+26-26Lines changed: 26 additions & 26 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,6 +1,6 @@
1
1
---
2
2
title: "Azure Operator Nexus: Runtime upgrade"
3
-
description: Learn to execute a cluster runtime upgrade for Operator Nexus
3
+
description: Learn to execute a Cluster runtime upgrade for Operator Nexus
4
4
author: bartpinto
5
5
ms.author: bpinto
6
6
ms.service: azure-operator-nexus
@@ -10,7 +10,7 @@ ms.date: 02/25/2025
10
10
# ms.custom: template-include
11
11
---
12
12
13
-
# Upgrade cluster runtime from Azure CLI
13
+
# Upgrade Cluster runtime from Azure CLI
14
14
15
15
This how-to guide explains the steps for installing the required Azure CLI and extensions required to interact with Operator Nexus.
16
16
@@ -23,23 +23,23 @@ This how-to guide explains the steps for installing the required Azure CLI and e
23
23
- Subscription ID (`SUBSCRIPTION`)
24
24
- Cluster name (`CLUSTER`)
25
25
- Resource group (`CLUSTER_RG`)
26
-
1. Target cluster must be healthy in a running state, with all control plane nodes healthy.
26
+
1. Target Cluster must be healthy in a running state, with all control plane nodes healthy.
27
27
28
28
## Checking current runtime version
29
-
Verify current cluster runtime version before upgrade:
30
-
[How to check current cluster runtime version.](./howto-check-runtime-version.md#check-current-cluster-runtime-version)
29
+
Verify current Cluster runtime version before upgrade:
30
+
[How to check current Cluster runtime version.](./howto-check-runtime-version.md#check-current-cluster-runtime-version)
31
31
32
32
## Finding available runtime versions
33
33
34
34
### Via Azure portal
35
35
36
-
To find available upgradeable runtime versions, navigate to the target cluster in the Azure portal. In the cluster's overview pane, navigate to the ***Available upgrade versions*** tab.
36
+
To find available upgradeable runtime versions, navigate to the target Cluster in the Azure portal. In the Cluster's overview pane, navigate to the ***Available upgrade versions*** tab.
37
37
38
-
:::image type="content" source="./media/runtime-upgrade-upgradeable-runtime-versions.png" alt-text="Screenshot of Azure portal showing correct tab to identify available cluster upgrades." lightbox="./media/runtime-upgrade-upgradeable-runtime-versions.png":::
38
+
:::image type="content" source="./media/runtime-upgrade-upgradeable-runtime-versions.png" alt-text="Screenshot of Azure portal showing correct tab to identify available Cluster upgrades." lightbox="./media/runtime-upgrade-upgradeable-runtime-versions.png":::
39
39
40
-
From the **available upgrade versions** tab, we're able to see the different cluster versions that are currently available to upgrade. The operator can select from the listed the target runtime versions. Once selected, proceed to upgrade the cluster.
40
+
From the **available upgrade versions** tab, we're able to see the different Cluster versions that are currently available to upgrade. The operator can select from the listed the target runtime versions. Once selected, proceed to upgrade the Cluster.
41
41
42
-
:::image type="content" source="./media/runtime-upgrade-runtime-version.png" lightbox="./media/runtime-upgrade-runtime-version.png" alt-text="Screenshot of Azure portal showing available cluster upgrades.":::
42
+
:::image type="content" source="./media/runtime-upgrade-runtime-version.png" lightbox="./media/runtime-upgrade-runtime-version.png" alt-text="Screenshot of Azure portal showing available Cluster upgrades.":::
43
43
44
44
### Via Azure CLI
45
45
@@ -66,9 +66,9 @@ In the output, you can find the `availableUpgradeVersions` property and look at
66
66
],
67
67
```
68
68
69
-
If there are no available cluster upgrades, the list is empty.
69
+
If there are no available Cluster upgrades, the list is empty.
70
70
71
-
## Configure compute threshold parameters for runtime upgrade using cluster updateStrategy
71
+
## Configure compute threshold parameters for runtime upgrade using Cluster `updateStrategy`
72
72
73
73
The following Azure CLI command is used to configure the compute threshold parameters for a runtime upgrade:
- strategy-type: Defines the update strategy. Setting used are `Rack` (Rack by Rack) OR `PauseAfterRack` (Pause for user before each Rack starts). The default value is `Rack`. To perform a cluster runtime upgrade using the `PauseAfterRack` strategy, follow the steps outlined in [Upgrade Cluster Runtime with PauseAfterRack Strategy](howto-cluster-runtime-upgrade-with-pauseafterrack-strategy.md).
85
+
- strategy-type: Defines the update strategy. Setting used are `Rack` (Rack-by-Rack) OR `PauseAfterRack` (Pause for user before each Rack starts). The default value is `Rack`. To perform a Cluster runtime upgrade using the `PauseAfterRack` strategy, follow the steps outlined in [Upgrade Cluster Runtime with PauseAfterRack Strategy](howto-cluster-runtime-upgrade-with-pauseafterrack-strategy.md).
86
86
- threshold-type: Determines how the threshold should be evaluated, applied in the units defined by the strategy. Settings used are `PercentSuccess` OR `CountSuccess`. The default value is `PercentSuccess`.
87
87
- threshold-value: The numeric threshold value used to evaluate an update. The default value is `80`.
88
88
89
89
Optional parameters:
90
90
- max-unavailable: The maximum number of worker nodes that can be offline, that is, upgraded rack at a time. The default value is `32767`.
91
91
- wait-time-minutes: The delay or waiting period before updating a rack. The default value is `15`.
92
92
93
-
The following example is for a customer using Rack by Rack strategy with a Percent Success of 60% and a 1-minute pause.
93
+
The following example is for a customer using Rack-by-Rack strategy with a Percent Success of 60% and a 1-minute pause.
94
94
95
95
```azurecli
96
96
az networkcloud cluster update --name "<CLUSTER>" \
@@ -115,9 +115,9 @@ az networkcloud cluster show --name "<CLUSTER>" \
115
115
"waitTimeMinutes": 1
116
116
```
117
117
118
-
In this example, if less than 60% of the compute nodes being provisioned in a rack fail to provision (on a Rack by Rack basis), the cluster upgrade waits indefinitely until the condition is met. If 60% or more of the compute nodes are successfully provisioned, cluster deployment moves on to the next rack of compute nodes. If there are too many failures in the rack, the hardware must be repaired before the upgrade can continue.
118
+
In this example, if less than 60% of the compute nodes being provisioned in a rack fail to provision (on a Rack-by-Rack basis), the Cluster upgrade waits indefinitely until the condition is met. If 60% or more of the compute nodes are successfully provisioned, Cluster deployment moves on to the next rack of compute nodes. If there are too many failures in the rack, the hardware must be repaired before the upgrade can continue.
119
119
120
-
The following example is for a customer using Rack by Rack strategy with a threshold type CountSuccess of 10 nodes per rack and a 1-minute pause.
120
+
The following example is for a customer using Rack-by-Rack strategy with a threshold type `CountSuccess` of 10 nodes per rack and a 1-minute pause.
121
121
122
122
```azurecli
123
123
az networkcloud cluster update --name "<CLUSTER>" \
@@ -142,13 +142,13 @@ az networkcloud cluster show --name "<CLUSTER>" \
142
142
"waitTimeMinutes": 1
143
143
```
144
144
145
-
In this example, if less than 10 compute nodes being provisioned in a rack fail to provision (on a Rack by Rack basis), the cluster upgrade will wait indefinitely until the condition is met. If 10 or more of the compute nodes are successfully provisioned, cluster deployment moves on to the next rack of compute nodes. If there are too many failures in the rack, the hardware must be repaired before the upgrade can continue.
145
+
In this example, if less than 10 compute nodes being provisioned in a rack fail to provision (on a Rack-by-Rack basis), the Cluster upgrade waits indefinitely until the condition is met. If 10 or more of the compute nodes are successfully provisioned, Cluster deployment moves on to the next rack of compute nodes. If there are too many failures in the rack, the hardware must be repaired before the upgrade can continue.
146
146
147
147
> [!NOTE]
148
-
> ***`update-strategy` cannot be changed after the cluster runtime upgrade has started.***
148
+
> ***`update-strategy` cannot be changed after the Cluster runtime upgrade has started.***
149
149
> When a threshold value below 100% is set, it’s possible that any unhealthy nodes might not be upgraded, yet the "Cluster" status could still indicate that upgrade was successful. For troubleshooting issues with bare metal machines, refer to [Troubleshoot Azure Operator Nexus server problems](troubleshoot-reboot-reimage-replace.md)
150
150
151
-
## Upgrade cluster runtime using CLI
151
+
## Upgrade Cluster runtime using CLI
152
152
153
153
To perform an upgrade of the runtime, use the following Azure CLI command:
The runtime upgrade is a long process. The upgrade first upgrades the management nodes and then sequentially Rack by Rack for the worker nodes.
162
+
The runtime upgrade is a long process. The upgrade first upgrades the management nodes and then sequentially Rack-by-Rack for the worker nodes.
163
163
The upgrade is considered to be finished when 80% of worker nodes per rack and 100% of management nodes are successfully upgraded.
164
164
Workloads might be impacted while the worker nodes in a rack are in the process of being upgraded, however workloads in all other racks aren't impacted. Consideration of workload placement in light of this implementation design is encouraged.
165
165
166
166
Upgrading all the nodes takes multiple hours, depending upon how many racks exist for the Cluster.
167
167
Due to the length of the upgrade process, the Cluster's detail status should be checked periodically for the current state of the upgrade.
168
168
To check on the status of the upgrade observe the detailed status of the Cluster. This check can be done via the portal or az CLI.
169
169
170
-
To view the upgrade status through the Azure portal, navigate to the targeted cluster resource. In the cluster's *Overview* screen, the detailed status is provided along with a detailed status message.
170
+
To view the upgrade status through the Azure portal, navigate to the targeted Cluster resource. In the Cluster's *Overview* screen, the detailed status is provided along with a detailed status message.
171
171
172
172
The Cluster upgrade is in-progress when detailedStatus is set to `Updating` and detailedStatusMessage shows the progress of upgrade. Some examples of upgrade progress shown in detailedStatusMessage are `Waiting for control plane upgrade to complete...`, `Waiting for nodepool "<rack-id>" to finish upgrading...`, etc.
173
173
174
174
The Cluster upgrade is complete when detailedStatus is set to `Running` and detailedStatusMessage shows message `Cluster is up and running`
175
175
176
-
:::image type="content" source="./media/runtime-upgrade-cluster-detail-status.png" lightbox="./media/runtime-upgrade-cluster-detail-status.png" alt-text="Screenshot of Azure portal showing in progress cluster upgrade.":::
176
+
:::image type="content" source="./media/runtime-upgrade-cluster-detail-status.png" lightbox="./media/runtime-upgrade-cluster-detail-status.png" alt-text="Screenshot of Azure portal showing in progress Cluster upgrade.":::
177
177
178
178
To view the upgrade status through the Azure CLI, use `az networkcloud cluster show`.
179
179
@@ -183,7 +183,7 @@ az networkcloud cluster show --cluster-name "<CLUSTER>" \
183
183
--subscription "<SUBSCRIPTION>"
184
184
```
185
185
186
-
The output should be the target cluster's information and the cluster's detailed status and detail status message should be present.
186
+
The output should be the target Cluster's information and the Cluster's detailed status and detail status message should be present.
187
187
For more detailed insights on the upgrade progress, the individual node in each Rack can be checked for status. An example of checking the status is provided in the reference section under [BareMetal Machine roles](./reference-near-edge-baremetal-machine-roles.md).
188
188
189
189
@@ -192,7 +192,7 @@ For more detailed insights on the upgrade progress, the individual node in each
192
192
### Identifying Cluster Upgrade Stalled/Stuck
193
193
194
194
During a runtime upgrade, it's possible that the upgrade fails to move forward but the detail status reflects that the upgrade is still ongoing. **Because the runtime upgrade can take a very long time to successfully finish, there's no set timeout length currently specified**.
195
-
Hence, it's advisable to also check periodically on your cluster's detail status and logs to determine if your upgrade is indefinitely attempting to upgrade.
195
+
Hence, it's advisable to also check periodically on your Cluster's detail status and logs to determine if your upgrade is indefinitely attempting to upgrade.
196
196
197
197
We can identify an `indefinitely attempting to upgrade` situation by looking at the Cluster's logs, detailed message, and detailed status message. If a timeout occurs, we would observe that the Cluster is continuously reconciling over the same indefinitely and not moving forward. From here, we recommend checking Cluster logs or configured LAW, to see if there's a failure, or a specific upgrade that is causing the lack of progress.
198
198
@@ -202,7 +202,7 @@ A guide for identifying issues with provisioning worker nodes is provided at [Tr
If a hardware failure during an upgrade occurs, the runtime upgrade continues as long as the set thresholds are met for the compute and management/control nodes. Once the machine is fixed or replaced, it gets provisioned with the current platform runtime's OS, which contains the targeted version of the runtime. If a rack was updated before a failure, then the upgraded runtime version would be used when the nodes are reprovisioned. If the rack's spec wasn't updated to the upgraded runtime version before the hardware failure, the machine would be provisioned with the previous runtime version when it's repaired. It is upgraded along with the rack when the rack starts its upgrade.
206
-
### After a runtime upgrade, the cluster shows "Failed" Provisioning State
205
+
If a hardware failure during an upgrade occurs, the runtime upgrade continues as long as the set thresholds are met for the compute and management/control nodes. Once the machine is fixed or replaced, it gets provisioned with the current platform runtime's OS, which contains the targeted version of the runtime. If a rack was updated before a failure, then the upgraded runtime version would be used when the nodes are reprovisioned. If the rack's spec wasn't updated to the upgraded runtime version before the hardware failure, the machine will provision with the previous runtime version when the hardware is repaired. The machine is upgraded along with the rack when the rack starts its upgrade.
206
+
### After a runtime upgrade, the Cluster shows "Failed" Provisioning State
207
207
208
-
During a runtime upgrade, the cluster enters a state of `Upgrading`. If the runtime upgrade fails, the cluster goes into a `Failed` provisioning state. Infrastructure components (e.g the Storage Appliance) may cause failures during the upgrade. In some scenarios, it may be necessary to diagnose the failure with Microsoft support.
208
+
During a runtime upgrade, the Cluster enters a state of `Upgrading`. If the runtime upgrade fails, the Cluster goes into a `Failed` provisioning state. Infrastructure components (e.g the Storage Appliance) may cause failures during the upgrade. In some scenarios, it may be necessary to diagnose the failure with Microsoft support.
0 commit comments