Skip to content

Commit c2c1c84

Browse files
authored
Merge pull request #18299 from sethmanheim/rctsg626
Pull commits for RC TSG
2 parents 9f2500f + 229d090 commit c2c1c84

File tree

3 files changed

+150
-0
lines changed

3 files changed

+150
-0
lines changed

AKS-Arc/TOC.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -193,6 +193,8 @@
193193
href: entra-prompts.md
194194
- name: BGP with FRR not working
195195
href: connectivity-troubleshoot.md
196+
- name: Cluster status stuck during upgrade
197+
href: cluster-upgrade-status.md
196198
- name: Reference
197199
items:
198200
- name: Azure CLI

AKS-Arc/aks-troubleshoot.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -28,6 +28,7 @@ The following sections describe known issues for AKS enabled by Azure Arc:
2828
| AKS steady state | [AKS Arc telemetry pod consumes too much memory and CPU](telemetry-pod-resources.md) | Active |
2929
| AKS steady state | [Disk space exhaustion on control plane VMs due to accumulation of kube-apiserver audit logs](kube-apiserver-log-overflow.md) | Active |
3030
| AKS cluster delete | [Deleted AKS Arc cluster still visible on Azure portal](deleted-cluster-visible.md) | Active |
31+
| AKS cluster upgrade | [AKS Arc cluster stuck in "Upgrading" state](cluster-upgrade-status.md) | Fixed in 2505 release |
3132
| AKS cluster delete | [Can't fully delete AKS Arc cluster with PodDisruptionBudget (PDB) resources](delete-cluster-pdb.md) | Fixed in 2503 release |
3233
| Azure portal | [Can't see VM SKUs on Azure portal](check-vm-sku.md) | Fixed in 2411 release |
3334
| MetalLB Arc extension | [Connectivity issues with MetalLB](load-balancer-issues.md) | Fixed in 2411 release |

AKS-Arc/cluster-upgrade-status.md

Lines changed: 147 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,147 @@
1+
---
2+
title: Troubleshoot issue in which the cluster is stuck in Upgrading state
3+
description: Learn how to troubleshoot and mitigate the issue when an AKS enabled by Arc cluster is stuck in 'Upgrading' state.
4+
ms.topic: troubleshooting
5+
author: rcheeran
6+
ms.author: rcheeran
7+
ms.date: 06/27/2025
8+
ms.reviewer: abha
9+
10+
---
11+
12+
# Troubleshoot AKS Arc cluster stuck in "Upgrading" state
13+
14+
This article describes how to fix an issue in which your Azure Kubernetes Service enabled by Arc (AKS Arc) cluster is stuck in the **Upgrading** state. This issue typically occurs after you update Azure Local to version 2503 or 2504, and you then try to upgrade the Kubernetes version on your cluster.
15+
16+
## Symptoms
17+
18+
When you try to upgrade an AKS Arc cluster, you notice that the `currentState` property of the cluster remains in the **Upgrading** state.
19+
20+
```azurecli
21+
az aksarc upgrade --name "cluster-name" --resource-group "rg-name"
22+
```
23+
24+
```output
25+
===> Kubernetes might be unavailable during cluster upgrades.
26+
Are you sure you want to perform this operation? (y/N): y
27+
The cluster is on version 1.28.9 and is not in a failed state.
28+
29+
===> This will upgrade the control plane AND all nodepools to version 1.30.4. Continue? (y/N): y
30+
Upgrading the AKSArc cluster. This operation might take a while...
31+
{
32+
"extendedLocation": {
33+
"name": "/subscriptions/resourceGroups/Bellevue/providers/Microsoft.ExtendedLocation/customLocations/bel-CL",
34+
"type": "CustomLocation"
35+
},
36+
"id": "/subscriptions/fbaf508b-cb61-4383-9cda-a42bfa0c7bc9/resourceGroups/Bellevue/providers/Microsoft.Kubernetes/ConnectedClusters/Bel-cluster/providers/Microsoft.HybridContainerService/ProvisionedClusterInstances/default",
37+
"name": "default",
38+
"properties": {
39+
"kubernetesVersion": "1.30.4",
40+
"provisioningState": "Succeeded",
41+
"currentState": "Upgrading",
42+
"errorMessage": null,
43+
"operationStatus": null
44+
"agentPoolProfiles": [
45+
{
46+
...
47+
```
48+
49+
## Cause
50+
51+
- The issue is caused by a recent change introduced in Azure Local version 2503. Under certain conditions, if there are transient or intermittent failures during the Kubernetes upgrade process, they're not correctly detected or recovered from. This can cause the cluster state to remain in the **Upgrading** state.
52+
- You see this issue if the AKS Arc custom location extension `hybridaksextension` version is 2.1.211 or 2.1.223. You can run the following command to check the extension version on your cluster:
53+
54+
```azurecli
55+
az login --use-device-code --tenant <Azure tenant ID>
56+
az account set -s <subscription ID>
57+
$res=get-archcimgmt
58+
az k8s-extension show -g $res.HybridaksExtension.resourceGroup -c $res.ResourceBridge.name --cluster-type appliances --name hybridaksextension
59+
```
60+
61+
```output
62+
{
63+
"aksAssignedIdentity": null,
64+
"autoUpgradeMinorVersion": false,
65+
"configurationProtectedSettings": {},
66+
"currentVersion": "2.1.211",
67+
"customLocationSettings": null,
68+
"errorInfo": null,
69+
"extensionType": "microsoft.hybridaksoperator",
70+
...
71+
}
72+
```
73+
74+
## Mitigation
75+
76+
This issue was fixed in AKS on [Azure Local, version 2505](/azure/azure-local/whats-new?view=azloc-2505&preserve-view=true#features-and-improvements-in-2505). Upgrade your Azure Local deployment to the 2505 build. After you update, [verify that the Kubernetes version was upgraded](#verification) and the `currentState` property of the cluster shows as **Succeeded**.
77+
78+
### Workaround for Azure Local versions 2503 or 2504
79+
80+
This issue only affects clusters in Azure Local version 2503 or 2504, and on AKS Arc extension versions 2.1.211 or 2.1.223. The mitigation described here is applicable only when you are unable to upgrade to 2505.
81+
82+
You can resolve the issue by running the AKS Arc `update` command. The `update` command restarts the upgrade flow. You can run the `aksarc update` command with placeholder parameters, which do not impact the state of the cluster. So in this case, you can run the `update` command to enable NFS or SMB drivers if those features aren't already enabled. First, check if any of the storage drivers are already enabled:
83+
84+
```azurecli
85+
az login --use-device-code --tenant <Azure tenant ID>
86+
az account set -s <subscription ID>
87+
az aksarc show -g <resource_group_name> -n <cluster_name>
88+
```
89+
90+
Check the storage profile section:
91+
92+
```json
93+
"storageProfile": {
94+
"nfsCsiDriver": {
95+
"enabled": false
96+
},
97+
"smbCsiDriver": {
98+
99+
"enabled": true
100+
}
101+
}
102+
```
103+
104+
If one of the drivers is disabled, you can enable it using one of the following commands:
105+
106+
```azurecli
107+
az aksarc update --enable-smb-driver -g <resource_group_name> -n <cluster_name>
108+
az aksarc update --enable-nfs-driver -g <resource_group_name> -n <cluster_name>
109+
```
110+
111+
Running the `aksarc update` command should resolve the issue and the `currentState` property of the cluster should now show as **Succeeded**. Once the status is updated, if you don't want to keep the drivers enabled, you can reverse this action by running one of the following commands:
112+
113+
```azurecli
114+
az aksarc update --disable-smb-driver -g <resource_group_name> -n <cluster_name>
115+
az aksarc update --disable-nfs-driver -g <resource_group_name> -n <cluster_name>
116+
```
117+
118+
If both drivers are already enabled on your cluster, you can disable the one that's not in use. If you require both drivers to remain enabled, contact Microsoft Support for further assistance.
119+
120+
## Verification
121+
122+
To confirm the K8s version upgrade is complete, run the following command and check that the `currentState` property in the JSON output is set to **Succeeded**.
123+
124+
```azurecli
125+
az aksarc show -g <resource_group> -n <cluster_name>
126+
```
127+
128+
```output
129+
...
130+
...
131+
"provisioningState": "Succeeded",
132+
"status": {
133+
"currentState": "Succeeded",
134+
"errorMessage": null,
135+
"operationStatus": null
136+
"controlPlaneStatus": { ...
137+
...
138+
```
139+
140+
## Contact Microsoft Support
141+
142+
If the problem persists, collect the [AKS cluster logs](get-on-demand-logs.md) before you [create a support request](aks-troubleshoot.md#open-a-support-request).
143+
144+
## Next steps
145+
146+
- [Use the diagnostic checker tool to identify common environment issues](aks-arc-diagnostic-checker.md)
147+
- [Review AKS on Azure Local architecture](cluster-architecture.md)

0 commit comments

Comments
 (0)