Skip to content

Commit 416096c

Browse files
committed
Fix merge conflict
2 parents 549dc74 + 6656cd4 commit 416096c

26 files changed

+565
-1010
lines changed

AKS-Arc/TOC.yml

Lines changed: 6 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -161,12 +161,14 @@
161161
href: aks-troubleshoot.md
162162
- name: AKS on Azure Local support policy
163163
href: aks-on-azure-local-support-policy.md
164+
- name: Get support
165+
href: help-support.md
166+
- name: Use diagnostic checker
167+
href: aks-arc-diagnostic-checker.md
164168
- name: Control plane configuration validation errors
165169
href: control-plane-validation-errors.md
166170
- name: K8sVersionValidation error
167171
href: cluster-k8s-version.md
168-
- name: Use diagnostic checker
169-
href: aks-arc-diagnostic-checker.md
170172
- name: KubeAPIServer unreachable error
171173
href: kube-api-server-unreachable.md
172174
- name: Can't create/scale AKS cluster due to image issues
@@ -195,6 +197,8 @@
195197
href: entra-prompts.md
196198
- name: BGP with FRR not working
197199
href: connectivity-troubleshoot.md
200+
- name: Cluster status stuck during upgrade
201+
href: cluster-upgrade-status.md
198202
- name: Reference
199203
items:
200204
- name: Azure CLI
@@ -607,8 +611,6 @@
607611
href: known-issues.yml
608612
- name: Support policies
609613
href: support-policies.md
610-
- name: Get support
611-
href: help-support.md
612614
- name: File bugs
613615
href: https://aka.ms/AKS-hybrid-issues
614616
- name: Release notes

AKS-Arc/aks-arc-diagnostic-checker.md

Lines changed: 8 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ description: Learn how to diagnose common causes for failures in AKS Arc.
44
ms.topic: troubleshooting
55
author: sethmanheim
66
ms.author: sethm
7-
ms.date: 01/30/2025
7+
ms.date: 06/27/2025
88
ms.reviewer: abha
99

1010
#Customer intent: As an AKS user, I want to use the diagnostic checker to run diagnostic checks on my AKS cluster to find out common causes for AKS cluster create failure.
@@ -13,14 +13,14 @@ ms.reviewer: abha
1313

1414
# Use the diagnostic checker to diagnose and fix environment issues for AKS cluster creation failure (preview)
1515

16-
It can be difficult to identify environment-related issues, such as networking configurations, that can result in an AKS cluster creation failure. The diagnostic checker is a PowerShell-based tool that can help you identify AKS cluster creation failures due to potential issues in the environment.
16+
It can be difficult to identify environment-related issues, such as networking configuration, that can result in an AKS cluster creation failure. The diagnostic checker is a PowerShell tool that can help you identify AKS cluster creation failures due to potential issues in the environment.
1717

1818
> [!NOTE]
19-
> You can only use the diagnostic checker tool if an AKS cluster was created, but is in a failed state. You can't use the tool if you don't see an AKS cluster on the Azure portal. If the AKS cluster creation fails before an Azure Resource Manager resource is created, [file a support request](aks-troubleshoot.md#open-a-support-request).
19+
> You can only use the diagnostic checker tool if an AKS cluster was created, but is in a failed state. You can't use the tool if you don't see an AKS cluster on the Azure portal. If the AKS cluster creation fails before an Azure Resource Manager resource is created, [file a support request](help-support.md).
2020
2121
## Before you begin
2222

23-
Before you begin, make sure you have the following prerequisites. If you don't meet the requirements for running the diagnostic checker tool, [file a support request](aks-troubleshoot.md#open-a-support-request):
23+
Before you begin, make sure you have the following prerequisites. If you don't meet the requirements for running the diagnostic checker tool, [file a support request](help-support.md):
2424

2525
- Direct access to the Azure Local cluster where you created the AKS cluster. This access can be through remote desktop (RDP), or you can also sign in to one of the Azure Local physical nodes.
2626
- Review the [networking concepts for creating an AKS cluster](aks-hci-network-system-requirements.md) and the [AKS cluster architecture](cluster-architecture.md).
@@ -43,17 +43,17 @@ VMName IPAddresses
4343
<cluster-name>-XXXXXX-control-plane-XXXXXX {172.16.0.10, 172.16.0.4, fe80::ec:d3ff:fea0:1}
4444
```
4545

46-
If you don't see a control plane VM as shown in the previous output, [file a support request](aks-troubleshoot.md#open-a-support-request).
46+
If you don't see a control plane VM as shown in the previous output, [file a support request](help-support.md).
4747

4848
If you see a control plane VM, and it has:
4949

50-
- 0 IPv4 addresses: file a [support request](aks-troubleshoot.md#open-a-support-request).
50+
- 0 IPv4 addresses: file a [support request](help-support.md).
5151
- 1 IP address: use the IPv4 address as the input for `vmIP` parameter.
5252
- 2 IP addresses: use any one of the IPv4 address as an input for `vmIP` parameter in the diagnostic checker.
5353

5454
## Run the diagnostic checker script
5555

56-
Copy the following PowerShell script `run_diagnostic.ps1` into any one node of your Azure Local cluster:
56+
Copy the following PowerShell script named `run_diagnostic.ps1` into any one node of your Azure Local cluster:
5757

5858
```powershell
5959
<#
@@ -288,4 +288,4 @@ The following table provides a summary of each test performed by the script, inc
288288

289289
## Next steps
290290

291-
If the problem persists, collect [AKS cluster logs](get-on-demand-logs.md) before you [create a support request](aks-troubleshoot.md#open-a-support-request).
291+
If the problem persists, collect [AKS cluster logs](get-on-demand-logs.md) before you [create a support request](help-support.md).

AKS-Arc/aks-troubleshoot.md

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@ title: Troubleshoot common issues in AKS enabled by Azure Arc
33
description: Learn about common issues and workarounds in AKS enabled by Arc.
44
ms.topic: how-to
55
author: sethmanheim
6-
ms.date: 06/18/2025
6+
ms.date: 06/27/2025
77
ms.author: sethm
88
ms.lastreviewed: 04/01/2025
99
ms.reviewer: abha
@@ -16,7 +16,7 @@ This section describes how to find solutions for issues you encounter when using
1616

1717
## Open a support request
1818

19-
To open a support request, see the [Get support](/azure/aks/hybrid/help-support) article for information about how to use the Azure portal to get support or open a support request for AKS Arc.
19+
To open a support request, see the [Get support](help-support.md) article for information about how to use the Azure portal to get support or open a support request for AKS Arc.
2020

2121
## Known issues
2222

@@ -28,6 +28,7 @@ The following sections describe known issues for AKS enabled by Azure Arc:
2828
| AKS steady state | [AKS Arc telemetry pod consumes too much memory and CPU](telemetry-pod-resources.md) | Active |
2929
| AKS steady state | [Disk space exhaustion on control plane VMs due to accumulation of kube-apiserver audit logs](kube-apiserver-log-overflow.md) | Active |
3030
| AKS cluster delete | [Deleted AKS Arc cluster still visible on Azure portal](deleted-cluster-visible.md) | Active |
31+
| AKS cluster upgrade | [AKS Arc cluster stuck in "Upgrading" state](cluster-upgrade-status.md) | Fixed in 2505 release |
3132
| AKS cluster delete | [Can't fully delete AKS Arc cluster with PodDisruptionBudget (PDB) resources](delete-cluster-pdb.md) | Fixed in 2503 release |
3233
| Azure portal | [Can't see VM SKUs on Azure portal](check-vm-sku.md) | Fixed in 2411 release |
3334
| MetalLB Arc extension | [Connectivity issues with MetalLB](load-balancer-issues.md) | Fixed in 2411 release |
@@ -42,9 +43,10 @@ The following sections describe known issues for AKS enabled by Azure Arc:
4243
| Create validation | [KubeAPIServer unreachable error](kube-api-server-unreachable.md) |
4344
| Network configuration issues | [Use diagnostic checker](aks-arc-diagnostic-checker.md) |
4445
| Kubernetes steady state | [Resolve issues due to out-of-band deletion of storage volumes](delete-storage-volume.md) |
46+
| Kubernetes steady state | [Repeated Entra authentication prompts when running kubectl with Kubernetes RBAC](entra-prompts.md) |
4547
| Release validation | [Azure Advisor upgrade recommendation message](azure-advisor-upgrade.md) |
4648
| Network validation | [Network validation error due to .local domain](network-validation-error-local.md) |
47-
| BGP with FRR not working | [Troubleshoot BGP with FRR in AKS Arc environments](connectivity-troubleshoot.md) |
49+
| Network validation | [Troubleshoot BGP with FRR in AKS Arc environments](connectivity-troubleshoot.md) |
4850

4951
## Next steps
5052

AKS-Arc/cluster-upgrade-status.md

Lines changed: 147 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,147 @@
1+
---
2+
title: Troubleshoot issue in which the cluster is stuck in Upgrading state
3+
description: Learn how to troubleshoot and mitigate the issue when an AKS enabled by Arc cluster is stuck in 'Upgrading' state.
4+
ms.topic: troubleshooting
5+
author: rcheeran
6+
ms.author: rcheeran
7+
ms.date: 06/27/2025
8+
ms.reviewer: abha
9+
10+
---
11+
12+
# Troubleshoot AKS Arc cluster stuck in "Upgrading" state
13+
14+
This article describes how to fix an issue in which your Azure Kubernetes Service enabled by Arc (AKS Arc) cluster is stuck in the **Upgrading** state. This issue typically occurs after you update Azure Local to version 2503 or 2504, and you then try to upgrade the Kubernetes version on your cluster.
15+
16+
## Symptoms
17+
18+
When you try to upgrade an AKS Arc cluster, you notice that the `currentState` property of the cluster remains in the **Upgrading** state.
19+
20+
```azurecli
21+
az aksarc upgrade --name "cluster-name" --resource-group "rg-name"
22+
```
23+
24+
```output
25+
===> Kubernetes might be unavailable during cluster upgrades.
26+
Are you sure you want to perform this operation? (y/N): y
27+
The cluster is on version 1.28.9 and is not in a failed state.
28+
29+
===> This will upgrade the control plane AND all nodepools to version 1.30.4. Continue? (y/N): y
30+
Upgrading the AKSArc cluster. This operation might take a while...
31+
{
32+
"extendedLocation": {
33+
"name": "/subscriptions/resourceGroups/Bellevue/providers/Microsoft.ExtendedLocation/customLocations/bel-CL",
34+
"type": "CustomLocation"
35+
},
36+
"id": "/subscriptions/fbaf508b-cb61-4383-9cda-a42bfa0c7bc9/resourceGroups/Bellevue/providers/Microsoft.Kubernetes/ConnectedClusters/Bel-cluster/providers/Microsoft.HybridContainerService/ProvisionedClusterInstances/default",
37+
"name": "default",
38+
"properties": {
39+
"kubernetesVersion": "1.30.4",
40+
"provisioningState": "Succeeded",
41+
"currentState": "Upgrading",
42+
"errorMessage": null,
43+
"operationStatus": null
44+
"agentPoolProfiles": [
45+
{
46+
...
47+
```
48+
49+
## Cause
50+
51+
- The issue is caused by a recent change introduced in Azure Local version 2503. Under certain conditions, if there are transient or intermittent failures during the Kubernetes upgrade process, they're not correctly detected or recovered from. This can cause the cluster state to remain in the **Upgrading** state.
52+
- You see this issue if the AKS Arc custom location extension `hybridaksextension` version is 2.1.211 or 2.1.223. You can run the following command to check the extension version on your cluster:
53+
54+
```azurecli
55+
az login --use-device-code --tenant <Azure tenant ID>
56+
az account set -s <subscription ID>
57+
$res=get-archcimgmt
58+
az k8s-extension show -g $res.HybridaksExtension.resourceGroup -c $res.ResourceBridge.name --cluster-type appliances --name hybridaksextension
59+
```
60+
61+
```output
62+
{
63+
"aksAssignedIdentity": null,
64+
"autoUpgradeMinorVersion": false,
65+
"configurationProtectedSettings": {},
66+
"currentVersion": "2.1.211",
67+
"customLocationSettings": null,
68+
"errorInfo": null,
69+
"extensionType": "microsoft.hybridaksoperator",
70+
...
71+
}
72+
```
73+
74+
## Mitigation
75+
76+
This issue was fixed in AKS on [Azure Local, version 2505](/azure/azure-local/whats-new?view=azloc-2505&preserve-view=true#features-and-improvements-in-2505). Upgrade your Azure Local deployment to the 2505 build. After you update, [verify that the Kubernetes version was upgraded](#verification) and the `currentState` property of the cluster shows as **Succeeded**.
77+
78+
### Workaround for Azure Local versions 2503 or 2504
79+
80+
This issue only affects clusters in Azure Local version 2503 or 2504, and on AKS Arc extension versions 2.1.211 or 2.1.223. The mitigation described here is applicable only when you are unable to upgrade to 2505.
81+
82+
You can resolve the issue by running the AKS Arc `update` command. The `update` command restarts the upgrade flow. You can run the `aksarc update` command with placeholder parameters, which do not impact the state of the cluster. So in this case, you can run the `update` command to enable NFS or SMB drivers if those features aren't already enabled. First, check if any of the storage drivers are already enabled:
83+
84+
```azurecli
85+
az login --use-device-code --tenant <Azure tenant ID>
86+
az account set -s <subscription ID>
87+
az aksarc show -g <resource_group_name> -n <cluster_name>
88+
```
89+
90+
Check the storage profile section:
91+
92+
```json
93+
"storageProfile": {
94+
"nfsCsiDriver": {
95+
"enabled": false
96+
},
97+
"smbCsiDriver": {
98+
99+
"enabled": true
100+
}
101+
}
102+
```
103+
104+
If one of the drivers is disabled, you can enable it using one of the following commands:
105+
106+
```azurecli
107+
az aksarc update --enable-smb-driver -g <resource_group_name> -n <cluster_name>
108+
az aksarc update --enable-nfs-driver -g <resource_group_name> -n <cluster_name>
109+
```
110+
111+
Running the `aksarc update` command should resolve the issue and the `currentState` property of the cluster should now show as **Succeeded**. Once the status is updated, if you don't want to keep the drivers enabled, you can reverse this action by running one of the following commands:
112+
113+
```azurecli
114+
az aksarc update --disable-smb-driver -g <resource_group_name> -n <cluster_name>
115+
az aksarc update --disable-nfs-driver -g <resource_group_name> -n <cluster_name>
116+
```
117+
118+
If both drivers are already enabled on your cluster, you can disable the one that's not in use. If you require both drivers to remain enabled, contact Microsoft Support for further assistance.
119+
120+
## Verification
121+
122+
To confirm the K8s version upgrade is complete, run the following command and check that the `currentState` property in the JSON output is set to **Succeeded**.
123+
124+
```azurecli
125+
az aksarc show -g <resource_group> -n <cluster_name>
126+
```
127+
128+
```output
129+
...
130+
...
131+
"provisioningState": "Succeeded",
132+
"status": {
133+
"currentState": "Succeeded",
134+
"errorMessage": null,
135+
"operationStatus": null
136+
"controlPlaneStatus": { ...
137+
...
138+
```
139+
140+
## Contact Microsoft Support
141+
142+
If the problem persists, collect the [AKS cluster logs](get-on-demand-logs.md) before you [create a support request](aks-troubleshoot.md#open-a-support-request).
143+
144+
## Next steps
145+
146+
- [Use the diagnostic checker tool to identify common environment issues](aks-arc-diagnostic-checker.md)
147+
- [Review AKS on Azure Local architecture](cluster-architecture.md)

AKS-Arc/entra-prompts.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -26,7 +26,7 @@ This issue is caused by [a GitHub bug](https://github.com/Azure/kubelogin/issues
2626

2727
To mitigate this issue, you can use one of the following two methods:
2828

29-
- Downgrade **kubelogin** to version 1.9.0. This stable version does not have the bug that causes repeated authentication prompts. You can [download this version from the GitHub repository](https://github.com/int128/kubelogin/releases/tag/v1.9.0). Select the appropriate asset for your OS or architecture, extract it, and replace your existing **kubelogin** binary.
29+
- Downgrade **kubelogin** to version 0.1.9. This stable version does not have the bug that causes repeated authentication prompts. You can [download this version from the GitHub repository](https://github.com/Azure/kubelogin/releases/tag/v0.1.9). Select the appropriate asset for your OS or architecture, extract it, and replace your existing **kubelogin** binary.
3030
- Alternatively, if you have administrator permissions, you can use the `--admin` flag with the `az aksarc get-credentials` command. This method bypasses **kubelogin** authentication by retrieving admin credentials directly:
3131

3232
```azurecli

0 commit comments

Comments
 (0)