Merge pull request #17523 from abhilashaagarwala/patch-83

sethmanheim · web-flow · commit a8e8d6401220 · 2025-04-03T13:26:41.000-07:00
Update TSGs and known issues
diff --git a/AKS-Arc/TOC.yml b/AKS-Arc/TOC.yml
@@ -157,18 +157,25 @@
       href: aks-arc-diagnostic-checker.md
     - name: KubeAPIServer unreachable error
       href: kube-api-server-unreachable.md
-    - name: Deleted AKS Arc cluster still visible on Azure portal
-      href: deleted-cluster-visible.md
+    - name: Can't create/scale AKS cluster due to image issues
+      href: gallery-image-not-usable.md
+    - name: Disk space exhaustion on control plane VMs
+      href: kube-apiserver-log-overflow.md
+    - name: Telemetry pod consumes too much memory and CPU  
+      href: telemetry-pod-resources.md
+    - name: Issues after deleting storage volumes
+      href: delete-storage-volume.md
     - name: Can't fully delete AKS Arc cluster with PodDisruptionBudget (PDB) resources
       href: delete-cluster-pdb.md
+    - name: Azure Advisor upgrade recommendation
+      href: azure-advisor-upgrade.md
+    - name: Deleted AKS Arc cluster still visible on Azure portal
+      href: deleted-cluster-visible.md
     - name: Can't see VM SKUs on Azure portal
       href: check-vm-sku.md
     - name: Connectivity issues with MetalLB
       href: load-balancer-issues.md 
-    - name: Azure Advisor upgrade recommendation
-      href: azure-advisor-upgrade.md
-    - name: Issues after deleting storage volumes
-      href: delete-storage-volume.md
+
   - name: Reference
     items:
     - name: Azure CLI
diff --git a/AKS-Arc/aks-troubleshoot.md b/AKS-Arc/aks-troubleshoot.md
@@ -6,7 +6,7 @@ author: sethmanheim
 ms.date: 04/01/2025
 ms.author: sethm 
 ms.lastreviewed: 04/01/2025
-ms.reviewer: guanghu
+ms.reviewer: abha
 
 ---
 
@@ -24,6 +24,9 @@ The following sections describe known issues for AKS enabled by Azure Arc:
 
 | AKS Arc CRUD operation | Issue | Fix status |
 |------------------------|-------|------------|
+| AKS cluster create     | [Can't create AKS cluster or scale node pool because of issues with AKS Arc images](gallery-image-not-usable.md) | Partially fixed in 2503 release |
+| AKS steady state       | [AKS Arc telemetry pod consumes too much memory and CPU](telemetry-pod-resources.md) | Active
+| AKS steady state       | [Disk space exhaustion on control plane VMs due to accumulation of kube-apiserver audit logs](kube-apiserver-log-overflow.md) | Active
 | AKS cluster delete     | [Deleted AKS Arc cluster still visible on Azure portal](deleted-cluster-visible.md) | Active |
 | AKS cluster delete     | [Can't fully delete AKS Arc cluster with PodDisruptionBudget (PDB) resources](delete-cluster-pdb.md) | Fixed in 2503 release |
 | Azure portal           | [Can't see VM SKUs on Azure portal](check-vm-sku.md) | Fixed in 2411 release |
@@ -38,7 +41,7 @@ The following sections describe known issues for AKS enabled by Azure Arc:
 | Create validation      | [K8sVersionValidation error](cluster-k8s-version.md)   
 | Create validation      | [KubeAPIServer unreachable error](kube-api-server-unreachable.md)  
 | Network configuration issues | [Use diagnostic checker](aks-arc-diagnostic-checker.md)
-| Kubernetes steady state   | [Issues after deleting storage volume](delete-storage-volume.md)
+| Kubernetes steady state   | [Resolve issues due to out-of-band deletion of storage volumes](delete-storage-volume.md)
 | Release validation     | [Azure Advisor upgrade recommendation message](azure-advisor-upgrade.md)
 
 ## Next steps
diff --git a/AKS-Arc/delete-cluster-pdb.md b/AKS-Arc/delete-cluster-pdb.md
@@ -19,9 +19,11 @@ When you delete an AKS Arc cluster that has [PodDisruptionBudget](https://kubern
 
 This issue was fixed in [AKS on Azure Local, version 2503](aks-whats-new-23h2.md#release-2503).
 
-If you're on an older build, please update to Azure Local, version 2503. Once you update to 2503, you can retry deleting the AKS cluster. If the retry doesn't work, follow this workaround. File a support case if the retry does not delete the AKS cluster.
+- **For deleting an AKS cluster** with a PodDisruptionBudget: If you're on an older build, please update to Azure Local, version 2503. Once you update to 2503, you can retry deleting the AKS cluster. File a support case if you're on the 2503 release and your AKS cluster is not deleted after at least one retry.
+- **For deleting a nodepool** with a PodDisruptionBudget: By design, the nodepool isn't deleted if a PodDisruptionBudget exists, to protect applications. Use the following workaround to delete the PDB resources and then retry deleting the nodepool.
 
-## Workaround for AKS Edge Essentials and prior versions of AKS on Azure Local.
+
+## Workaround for AKS Edge Essentials and older versions of AKS on Azure Local
 
 Before you delete the AKS Arc cluster, access the AKS Arc cluster's **kubeconfig** and delete all PDBs:
 
diff --git a/AKS-Arc/gallery-image-not-usable.md b/AKS-Arc/gallery-image-not-usable.md
@@ -0,0 +1,54 @@
+---
+title: Kubernetes cluster create or nodepool scale failing due to AKS Arc image issues  
+description: Learn about a known issue with Kubernetes cluster create or nodepool scale failing due to AKS Arc VHD image download issues.
+ms.topic: troubleshooting
+author: sethmanheim
+ms.author: sethm
+ms.date: 04/01/2025
+ms.reviewer: abha
+
+---
+
+# Can't create AKS cluster or scale node pool because of issues with AKS Arc images
+
+[!INCLUDE [hci-applies-to-23h2](includes/hci-applies-to-23h2.md)]
+
+## Symptoms
+
+You see the following error when you try to create the AKS cluster:
+
+```output
+Kubernetes version 1.29.4 is not ready for use on Linux. Please go to https://aka.ms/aksarccheckk8sversions for details of how to check the readiness of Kubernetes versions.
+```
+
+You might also see the following error when you try to scale a nodepool:
+
+```output
+error with code NodepoolPrecheckFailed occured: AksHci nodepool creation precheck failed. Detailed message: 1 error occurred:\n\t* rpc error: code = Unknown desc = GalleryImage not usable, health state degraded: Degraded
+```
+
+When you run `az aksarc get-versions`, you see the following errors:
+
+```output
+...
+              {
+
+                "errorMessage": "failed cloud-side provisioning image linux-cblmariner-0.4.1.11203 to cloud gallery: {\n  \"code\": \"ImageProvisionError\",\n  \"message\": \"force failed to deprovision existing gallery image: failed to delete gallery image linux-cblmariner-0.4.1.11203: rpc error: code = Unknown desc = sa659p1012: rpc error: code = Unavailable desc = connection error: desc = \\\"transport: Error while dialing: dial tcp 10.202.244.4:45000: connectex: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.\\\"\",\n  \"additionalInfo\": [\n   {\n    \"type\": \"providerImageProvisionInfo\",\n    \"info\": {\n     \"ProviderDownload\": \"True\"\n    }\n   }\n  ],\n  \"category\": \"\"\n }",
+                "osSku": "CBLMariner",
+                "osType": "Linux",
+                "ready": false
+              },
+...
+```
+
+## Mitigation
+
+- This issue was fixed in [AKS on Azure Local, version 2503](aks-whats-new-23h2.md#release-2503).
+- Upgrade your Azure Local deployment to the 2503 build.
+- Once updated, confirm that the images have been downloaded successfully by running the `az aksarc get-versions` command.
+- For new AKS clusters: new AKS clusters should now be created successfully.
+- For scaling existing AKS clusters: scaling existing AKS clusters continues to encounter issues. Please file a support case.
+
+## Next steps
+
+[Known issues in AKS enabled by Azure Arc](aks-known-issues.md)
diff --git a/AKS-Arc/kube-apiserver-log-overflow.md b/AKS-Arc/kube-apiserver-log-overflow.md
@@ -0,0 +1,73 @@
+---
+title: Disk space exhaustion on the control plane VMs due to accumulation of kube-apiserver audit logs
+description: Learn about a known issue with disk space exhaustion on the control plane VMs due to accumulation of kube-apiserver audit logs.
+ms.topic: troubleshooting
+author: sethmanheim
+ms.author: sethm
+ms.date: 04/01/2025
+ms.reviewer: abha
+
+---
+
+# Disk space exhaustion on control plane VMs due to accumulation of kube-apiserver audit logs
+
+[!INCLUDE [hci-applies-to-23h2](includes/hci-applies-to-23h2.md)]
+
+## Symptoms
+
+If you're running kubectl commands and facing issues, you might see errors such as:
+
+```output
+kubectl get ns
+Error from server (InternalError): an error on the server ("Internal Server Error: \"/api/v1/namespaces?limit=500\": unknown") has prevented the request from succeeding (get namespaces)
+```
+
+When you SSH into the control plane VM, you might notice that your control plane VM ran out of disk space, specifically on the **/dev/sda2** partition. This is due to the accumulation of kube-apiserver audit logs in the **/var/log/kube-apiserver** directory, which can consume approximately 90 GB of disk space.
+
+```output
+clouduser@moc-laiwyj6tly6 [ /var/log/kube-apiserver ]$ df -h
+Filesystem      Size  Used Avail Use% Mounted on
+devtmpfs        4.0M     0  4.0M   0% /dev
+tmpfs           3.8G   84K  3.8G   1% /dev/shm
+tmpfs           1.6G  179M  1.4G  12% /run
+tmpfs           4.0M     0  4.0M   0% /sys/fs/cgroup
+/dev/sda2        99G   99G     0 100% /
+tmpfs           3.8G     0  3.8G   0% /tmp
+tmpfs           769M     0  769M   0% /run/user/1002
+clouduser@moc-laiwyj6tly6 [ /var/log/kube-apiserver ]$ sudo ls -l /var/log/kube-apiserver|wc -l
+890
+clouduser@moc-laiwyj6tly6 [ /var/log/kube-apiserver ]$ sudo du -h /var/log/kube-apiserver
+87G     /var/log/kube-apiserver
+```
+
+The issue occurs because the `--audit-log-maxbackup` value is set to 0. This setting allows the audit logs to accumulate without any limit, eventually filling up the disk. 
+
+## Mitigation
+
+To resolve the issue temporarily, you must manually clean up the old audit logs. Follow these steps:
+
+- SSH into the control plane virtual machine (VM) of your AKS Arc cluster.
+- Remove the old audit logs from the **/var/log/kube-apiserver** folder.
+- If you have multiple control plane nodes, you must repeat this process on each control plane VM.
+
+[SSH into the control plane VM](ssh-connect-to-windows-and-linux-worker-nodes.md) and navigate to the kube-apiserver logs directory:
+
+```bash
+cd /var/log/kube-apiserver
+```
+
+Remove the old audit log files:
+
+```bash
+rm audit-*.log
+```
+
+Exit the SSH session:
+
+```bash
+exit
+```
+
+## Next steps
+
+[Known issues in AKS enabled by Azure Arc](aks-known-issues.md)
diff --git a/AKS-Arc/scale-requirements.md b/AKS-Arc/scale-requirements.md
@@ -64,7 +64,7 @@ This article describes the maximum and minimum supported scale count for AKS on
 | Standard_D4s_v3             | 4    | 16           |
 | Standard_D8s_v3             | 8    | 32           |
 | Standard_D16s_v3            | 16   | 64           |
-| Standard_D8s_v3             | 32   | 128          |
+| Standard_D32s_v3            | 32   | 128          |
 
 For more worker node sizes with GPU support, see the next section.
 
diff --git a/AKS-Arc/telemetry-pod-resources.md b/AKS-Arc/telemetry-pod-resources.md
@@ -0,0 +1,101 @@
+---
+title: AKS Arc telemetry pod consumes too much memory and CPU
+description: Learn how to troubleshoot when AKS Arc telemetry pod consumes too much memory and CPU.
+ms.topic: troubleshooting
+author: sethmanheim
+ms.author: sethm
+ms.date: 04/01/2025
+ms.reviewer: abha
+
+---
+
+# AKS Arc telemetry pod consumes too much memory and CPU
+
+## Symptoms
+
+The **akshci-telemetry** pod in a AKS Arc cluster can over time consume a lot of CPU and memory resources. If metrics are enabled, you can verify the CPU and memory usage using the following `kubectl` command:
+
+```bash
+kubectl -n kube-system top pod -l app=akshci-telemetry
+```
+
+You might see an output similar to this:
+
+```output
+NAME                              CPU(cores)   MEMORY(bytes)
+akshci-telemetry-5df56fd5-rjqk4   996m         152Mi
+```
+
+## Mitigation
+
+To resolve this issue, set default **resource limits** for the pods in the `kube-system` namespace.
+
+### Important notes
+
+- Verify if you have any pods in the **kube-system** namespace that might require more memory than the default limit setting. If so, adjustments might be needed.
+- The **LimitRange** is applied to the **namespace**; in this case, the `kube-system` namespace. The default resource limits also apply to new pods that don't specify their own limits.
+- **Existing pods**, including those that already have resource limits, aren't affected.
+- **New pods** that don't specify their own resource limits are constrained by the limits set in the next section.
+- After you set the resource limits and delete the telemetry pod, the new pod might eventually hit the memory limit and generate **OOM (Out-Of-Memory)** errors. This is a temporary mitigation.
+  
+To proceed with setting the resource limits, you can run the following script. While the script uses `az aksarc get-credentials`, you can also use `az connectedk8s proxy` to get the proxy kubeconfig and access the Kubernetes cluster.
+
+### Define the LimitRange YAML to set default CPU and memory limits
+
+```powershell
+# Set the $cluster_name and $resource_group of the aksarc cluster
+$cluster_name = ""
+$resource_group = ""
+
+# Connect to the aksarc cluster
+az aksarc get-credentials -n $cluster_name -g $resource_group --admin -f "./kubeconfig-$cluster_name"
+
+$limitRangeYaml = @'
+apiVersion: v1
+kind: LimitRange
+metadata:
+  name: cpu-mem-resource-constraint
+  namespace: kube-system
+spec:
+  limits:
+  - default: # this section defines default limits for containers that haven't specified any limits
+      cpu: 250m
+      memory: 250Mi
+    defaultRequest: # this section defines default requests for containers that haven't specified any requests
+      cpu: 10m
+      memory: 20Mi
+    type: Container
+'@
+
+$limitRangeYaml | kubectl apply --kubeconfig "./kubeconfig-$cluster_name" -f -
+
+kubectl get pods -l app=akshci-telemetry -n kube-system --kubeconfig "./kubeconfig-$cluster_name"
+kubectl delete pods -l app=akshci-telemetry -n kube-system --kubeconfig "./kubeconfig-$cluster_name"
+
+sleep 5
+kubectl get pods -l app=akshci-telemetry -n kube-system --kubeconfig "./kubeconfig-$cluster_name"
+```
+
+### Validate if the resource limits were applied correctly
+
+1. Check the resource limits in the pod's YAML configuration:
+
+   ```powershell
+   kubectl get pods -l app=akshci-telemetry -n kube-system --kubeconfig "./kubeconfig-$cluster_name" -o yaml
+   ```
+
+1. In the output, verify that the `resources` section includes the limits:
+
+   ```yaml
+   resources:
+     limits:
+       cpu: 250m
+       memory: 250Mi
+     requests:
+       cpu: 10m
+       memory: 20Mi
+   ```
+
+## Next steps
+
+[Known issues in AKS enabled by Azure Arc](aks-known-issues.md)