Skip to content

Commit a8e8d64

Browse files
authored
Merge pull request #17523 from abhilashaagarwala/patch-83
Update TSGs and known issues
2 parents 7b5e95e + 5abfcea commit a8e8d64

7 files changed

+251
-11
lines changed

AKS-Arc/TOC.yml

Lines changed: 13 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -157,18 +157,25 @@
157157
href: aks-arc-diagnostic-checker.md
158158
- name: KubeAPIServer unreachable error
159159
href: kube-api-server-unreachable.md
160-
- name: Deleted AKS Arc cluster still visible on Azure portal
161-
href: deleted-cluster-visible.md
160+
- name: Can't create/scale AKS cluster due to image issues
161+
href: gallery-image-not-usable.md
162+
- name: Disk space exhaustion on control plane VMs
163+
href: kube-apiserver-log-overflow.md
164+
- name: Telemetry pod consumes too much memory and CPU
165+
href: telemetry-pod-resources.md
166+
- name: Issues after deleting storage volumes
167+
href: delete-storage-volume.md
162168
- name: Can't fully delete AKS Arc cluster with PodDisruptionBudget (PDB) resources
163169
href: delete-cluster-pdb.md
170+
- name: Azure Advisor upgrade recommendation
171+
href: azure-advisor-upgrade.md
172+
- name: Deleted AKS Arc cluster still visible on Azure portal
173+
href: deleted-cluster-visible.md
164174
- name: Can't see VM SKUs on Azure portal
165175
href: check-vm-sku.md
166176
- name: Connectivity issues with MetalLB
167177
href: load-balancer-issues.md
168-
- name: Azure Advisor upgrade recommendation
169-
href: azure-advisor-upgrade.md
170-
- name: Issues after deleting storage volumes
171-
href: delete-storage-volume.md
178+
172179
- name: Reference
173180
items:
174181
- name: Azure CLI

AKS-Arc/aks-troubleshoot.md

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@ author: sethmanheim
66
ms.date: 04/01/2025
77
ms.author: sethm
88
ms.lastreviewed: 04/01/2025
9-
ms.reviewer: guanghu
9+
ms.reviewer: abha
1010

1111
---
1212

@@ -24,6 +24,9 @@ The following sections describe known issues for AKS enabled by Azure Arc:
2424

2525
| AKS Arc CRUD operation | Issue | Fix status |
2626
|------------------------|-------|------------|
27+
| AKS cluster create | [Can't create AKS cluster or scale node pool because of issues with AKS Arc images](gallery-image-not-usable.md) | Partially fixed in 2503 release |
28+
| AKS steady state | [AKS Arc telemetry pod consumes too much memory and CPU](telemetry-pod-resources.md) | Active
29+
| AKS steady state | [Disk space exhaustion on control plane VMs due to accumulation of kube-apiserver audit logs](kube-apiserver-log-overflow.md) | Active
2730
| AKS cluster delete | [Deleted AKS Arc cluster still visible on Azure portal](deleted-cluster-visible.md) | Active |
2831
| AKS cluster delete | [Can't fully delete AKS Arc cluster with PodDisruptionBudget (PDB) resources](delete-cluster-pdb.md) | Fixed in 2503 release |
2932
| Azure portal | [Can't see VM SKUs on Azure portal](check-vm-sku.md) | Fixed in 2411 release |
@@ -38,7 +41,7 @@ The following sections describe known issues for AKS enabled by Azure Arc:
3841
| Create validation | [K8sVersionValidation error](cluster-k8s-version.md)
3942
| Create validation | [KubeAPIServer unreachable error](kube-api-server-unreachable.md)
4043
| Network configuration issues | [Use diagnostic checker](aks-arc-diagnostic-checker.md)
41-
| Kubernetes steady state | [Issues after deleting storage volume](delete-storage-volume.md)
44+
| Kubernetes steady state | [Resolve issues due to out-of-band deletion of storage volumes](delete-storage-volume.md)
4245
| Release validation | [Azure Advisor upgrade recommendation message](azure-advisor-upgrade.md)
4346

4447
## Next steps

AKS-Arc/delete-cluster-pdb.md

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -19,9 +19,11 @@ When you delete an AKS Arc cluster that has [PodDisruptionBudget](https://kubern
1919

2020
This issue was fixed in [AKS on Azure Local, version 2503](aks-whats-new-23h2.md#release-2503).
2121

22-
If you're on an older build, please update to Azure Local, version 2503. Once you update to 2503, you can retry deleting the AKS cluster. If the retry doesn't work, follow this workaround. File a support case if the retry does not delete the AKS cluster.
22+
- **For deleting an AKS cluster** with a PodDisruptionBudget: If you're on an older build, please update to Azure Local, version 2503. Once you update to 2503, you can retry deleting the AKS cluster. File a support case if you're on the 2503 release and your AKS cluster is not deleted after at least one retry.
23+
- **For deleting a nodepool** with a PodDisruptionBudget: By design, the nodepool isn't deleted if a PodDisruptionBudget exists, to protect applications. Use the following workaround to delete the PDB resources and then retry deleting the nodepool.
2324

24-
## Workaround for AKS Edge Essentials and prior versions of AKS on Azure Local.
25+
26+
## Workaround for AKS Edge Essentials and older versions of AKS on Azure Local
2527

2628
Before you delete the AKS Arc cluster, access the AKS Arc cluster's **kubeconfig** and delete all PDBs:
2729

AKS-Arc/gallery-image-not-usable.md

Lines changed: 54 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,54 @@
1+
---
2+
title: Kubernetes cluster create or nodepool scale failing due to AKS Arc image issues
3+
description: Learn about a known issue with Kubernetes cluster create or nodepool scale failing due to AKS Arc VHD image download issues.
4+
ms.topic: troubleshooting
5+
author: sethmanheim
6+
ms.author: sethm
7+
ms.date: 04/01/2025
8+
ms.reviewer: abha
9+
10+
---
11+
12+
# Can't create AKS cluster or scale node pool because of issues with AKS Arc images
13+
14+
[!INCLUDE [hci-applies-to-23h2](includes/hci-applies-to-23h2.md)]
15+
16+
## Symptoms
17+
18+
You see the following error when you try to create the AKS cluster:
19+
20+
```output
21+
Kubernetes version 1.29.4 is not ready for use on Linux. Please go to https://aka.ms/aksarccheckk8sversions for details of how to check the readiness of Kubernetes versions.
22+
```
23+
24+
You might also see the following error when you try to scale a nodepool:
25+
26+
```output
27+
error with code NodepoolPrecheckFailed occured: AksHci nodepool creation precheck failed. Detailed message: 1 error occurred:\n\t* rpc error: code = Unknown desc = GalleryImage not usable, health state degraded: Degraded
28+
```
29+
30+
When you run `az aksarc get-versions`, you see the following errors:
31+
32+
```output
33+
...
34+
              {
35+
36+
                "errorMessage": "failed cloud-side provisioning image linux-cblmariner-0.4.1.11203 to cloud gallery: {\n  \"code\": \"ImageProvisionError\",\n  \"message\": \"force failed to deprovision existing gallery image: failed to delete gallery image linux-cblmariner-0.4.1.11203: rpc error: code = Unknown desc = sa659p1012: rpc error: code = Unavailable desc = connection error: desc = \\\"transport: Error while dialing: dial tcp 10.202.244.4:45000: connectex: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.\\\"\",\n  \"additionalInfo\": [\n   {\n    \"type\": \"providerImageProvisionInfo\",\n    \"info\": {\n     \"ProviderDownload\": \"True\"\n    }\n   }\n  ],\n  \"category\": \"\"\n }",
37+
                "osSku": "CBLMariner",
38+
                "osType": "Linux",
39+
                "ready": false
40+
              },
41+
...
42+
```
43+
44+
## Mitigation
45+
46+
- This issue was fixed in [AKS on Azure Local, version 2503](aks-whats-new-23h2.md#release-2503).
47+
- Upgrade your Azure Local deployment to the 2503 build.
48+
- Once updated, confirm that the images have been downloaded successfully by running the `az aksarc get-versions` command.
49+
- For new AKS clusters: new AKS clusters should now be created successfully.
50+
- For scaling existing AKS clusters: scaling existing AKS clusters continues to encounter issues. Please file a support case.
51+
52+
## Next steps
53+
54+
[Known issues in AKS enabled by Azure Arc](aks-known-issues.md)
Lines changed: 73 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,73 @@
1+
---
2+
title: Disk space exhaustion on the control plane VMs due to accumulation of kube-apiserver audit logs
3+
description: Learn about a known issue with disk space exhaustion on the control plane VMs due to accumulation of kube-apiserver audit logs.
4+
ms.topic: troubleshooting
5+
author: sethmanheim
6+
ms.author: sethm
7+
ms.date: 04/01/2025
8+
ms.reviewer: abha
9+
10+
---
11+
12+
# Disk space exhaustion on control plane VMs due to accumulation of kube-apiserver audit logs
13+
14+
[!INCLUDE [hci-applies-to-23h2](includes/hci-applies-to-23h2.md)]
15+
16+
## Symptoms
17+
18+
If you're running kubectl commands and facing issues, you might see errors such as:
19+
20+
```output
21+
kubectl get ns
22+
Error from server (InternalError): an error on the server ("Internal Server Error: \"/api/v1/namespaces?limit=500\": unknown") has prevented the request from succeeding (get namespaces)
23+
```
24+
25+
When you SSH into the control plane VM, you might notice that your control plane VM ran out of disk space, specifically on the **/dev/sda2** partition. This is due to the accumulation of kube-apiserver audit logs in the **/var/log/kube-apiserver** directory, which can consume approximately 90 GB of disk space.
26+
27+
```output
28+
clouduser@moc-laiwyj6tly6 [ /var/log/kube-apiserver ]$ df -h
29+
Filesystem      Size  Used Avail Use% Mounted on
30+
devtmpfs        4.0M     0  4.0M   0% /dev
31+
tmpfs           3.8G   84K  3.8G   1% /dev/shm
32+
tmpfs           1.6G  179M  1.4G  12% /run
33+
tmpfs           4.0M     0  4.0M   0% /sys/fs/cgroup
34+
/dev/sda2        99G   99G     0 100% /
35+
tmpfs           3.8G     0  3.8G   0% /tmp
36+
tmpfs           769M     0  769M   0% /run/user/1002
37+
clouduser@moc-laiwyj6tly6 [ /var/log/kube-apiserver ]$ sudo ls -l /var/log/kube-apiserver|wc -l
38+
890
39+
clouduser@moc-laiwyj6tly6 [ /var/log/kube-apiserver ]$ sudo du -h /var/log/kube-apiserver
40+
87G     /var/log/kube-apiserver
41+
```
42+
43+
The issue occurs because the `--audit-log-maxbackup` value is set to 0. This setting allows the audit logs to accumulate without any limit, eventually filling up the disk.
44+
45+
## Mitigation
46+
47+
To resolve the issue temporarily, you must manually clean up the old audit logs. Follow these steps:
48+
49+
- SSH into the control plane virtual machine (VM) of your AKS Arc cluster.
50+
- Remove the old audit logs from the **/var/log/kube-apiserver** folder.
51+
- If you have multiple control plane nodes, you must repeat this process on each control plane VM.
52+
53+
[SSH into the control plane VM](ssh-connect-to-windows-and-linux-worker-nodes.md) and navigate to the kube-apiserver logs directory:
54+
55+
```bash
56+
cd /var/log/kube-apiserver
57+
```
58+
59+
Remove the old audit log files:
60+
61+
```bash
62+
rm audit-*.log
63+
```
64+
65+
Exit the SSH session:
66+
67+
```bash
68+
exit
69+
```
70+
71+
## Next steps
72+
73+
[Known issues in AKS enabled by Azure Arc](aks-known-issues.md)

AKS-Arc/scale-requirements.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -64,7 +64,7 @@ This article describes the maximum and minimum supported scale count for AKS on
6464
| Standard_D4s_v3 | 4 | 16 |
6565
| Standard_D8s_v3 | 8 | 32 |
6666
| Standard_D16s_v3 | 16 | 64 |
67-
| Standard_D8s_v3 | 32 | 128 |
67+
| Standard_D32s_v3 | 32 | 128 |
6868

6969
For more worker node sizes with GPU support, see the next section.
7070

AKS-Arc/telemetry-pod-resources.md

Lines changed: 101 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,101 @@
1+
---
2+
title: AKS Arc telemetry pod consumes too much memory and CPU
3+
description: Learn how to troubleshoot when AKS Arc telemetry pod consumes too much memory and CPU.
4+
ms.topic: troubleshooting
5+
author: sethmanheim
6+
ms.author: sethm
7+
ms.date: 04/01/2025
8+
ms.reviewer: abha
9+
10+
---
11+
12+
# AKS Arc telemetry pod consumes too much memory and CPU
13+
14+
## Symptoms
15+
16+
The **akshci-telemetry** pod in a AKS Arc cluster can over time consume a lot of CPU and memory resources. If metrics are enabled, you can verify the CPU and memory usage using the following `kubectl` command:
17+
18+
```bash
19+
kubectl -n kube-system top pod -l app=akshci-telemetry
20+
```
21+
22+
You might see an output similar to this:
23+
24+
```output
25+
NAME CPU(cores) MEMORY(bytes)
26+
akshci-telemetry-5df56fd5-rjqk4 996m 152Mi
27+
```
28+
29+
## Mitigation
30+
31+
To resolve this issue, set default **resource limits** for the pods in the `kube-system` namespace.
32+
33+
### Important notes
34+
35+
- Verify if you have any pods in the **kube-system** namespace that might require more memory than the default limit setting. If so, adjustments might be needed.
36+
- The **LimitRange** is applied to the **namespace**; in this case, the `kube-system` namespace. The default resource limits also apply to new pods that don't specify their own limits.
37+
- **Existing pods**, including those that already have resource limits, aren't affected.
38+
- **New pods** that don't specify their own resource limits are constrained by the limits set in the next section.
39+
- After you set the resource limits and delete the telemetry pod, the new pod might eventually hit the memory limit and generate **OOM (Out-Of-Memory)** errors. This is a temporary mitigation.
40+
41+
To proceed with setting the resource limits, you can run the following script. While the script uses `az aksarc get-credentials`, you can also use `az connectedk8s proxy` to get the proxy kubeconfig and access the Kubernetes cluster.
42+
43+
### Define the LimitRange YAML to set default CPU and memory limits
44+
45+
```powershell
46+
# Set the $cluster_name and $resource_group of the aksarc cluster
47+
$cluster_name = ""
48+
$resource_group = ""
49+
50+
# Connect to the aksarc cluster
51+
az aksarc get-credentials -n $cluster_name -g $resource_group --admin -f "./kubeconfig-$cluster_name"
52+
53+
$limitRangeYaml = @'
54+
apiVersion: v1
55+
kind: LimitRange
56+
metadata:
57+
name: cpu-mem-resource-constraint
58+
namespace: kube-system
59+
spec:
60+
limits:
61+
- default: # this section defines default limits for containers that haven't specified any limits
62+
cpu: 250m
63+
memory: 250Mi
64+
defaultRequest: # this section defines default requests for containers that haven't specified any requests
65+
cpu: 10m
66+
memory: 20Mi
67+
type: Container
68+
'@
69+
70+
$limitRangeYaml | kubectl apply --kubeconfig "./kubeconfig-$cluster_name" -f -
71+
72+
kubectl get pods -l app=akshci-telemetry -n kube-system --kubeconfig "./kubeconfig-$cluster_name"
73+
kubectl delete pods -l app=akshci-telemetry -n kube-system --kubeconfig "./kubeconfig-$cluster_name"
74+
75+
sleep 5
76+
kubectl get pods -l app=akshci-telemetry -n kube-system --kubeconfig "./kubeconfig-$cluster_name"
77+
```
78+
79+
### Validate if the resource limits were applied correctly
80+
81+
1. Check the resource limits in the pod's YAML configuration:
82+
83+
```powershell
84+
kubectl get pods -l app=akshci-telemetry -n kube-system --kubeconfig "./kubeconfig-$cluster_name" -o yaml
85+
```
86+
87+
1. In the output, verify that the `resources` section includes the limits:
88+
89+
```yaml
90+
resources:
91+
limits:
92+
cpu: 250m
93+
memory: 250Mi
94+
requests:
95+
cpu: 10m
96+
memory: 20Mi
97+
```
98+
99+
## Next steps
100+
101+
[Known issues in AKS enabled by Azure Arc](aks-known-issues.md)

0 commit comments

Comments
 (0)