Merge pull request #3588 from MicrosoftDocs/main638793147027039486sync_temp

learn-build-service-prod[bot] · web-flow · commit 8fb6b7af1298 · 2025-04-03T22:05:25.000Z
For protected branch, push strategy should use PR and merge to target branch method to work around git push error
diff --git a/AKS-Arc/TOC.yml b/AKS-Arc/TOC.yml
@@ -157,18 +157,25 @@
       href: aks-arc-diagnostic-checker.md
     - name: KubeAPIServer unreachable error
       href: kube-api-server-unreachable.md
-    - name: Deleted AKS Arc cluster still visible on Azure portal
-      href: deleted-cluster-visible.md
+    - name: Can't create/scale AKS cluster due to image issues
+      href: gallery-image-not-usable.md
+    - name: Disk space exhaustion on control plane VMs
+      href: kube-apiserver-log-overflow.md
+    - name: Telemetry pod consumes too much memory and CPU  
+      href: telemetry-pod-resources.md
+    - name: Issues after deleting storage volumes
+      href: delete-storage-volume.md
     - name: Can't fully delete AKS Arc cluster with PodDisruptionBudget (PDB) resources
       href: delete-cluster-pdb.md
+    - name: Azure Advisor upgrade recommendation
+      href: azure-advisor-upgrade.md
+    - name: Deleted AKS Arc cluster still visible on Azure portal
+      href: deleted-cluster-visible.md
     - name: Can't see VM SKUs on Azure portal
       href: check-vm-sku.md
     - name: Connectivity issues with MetalLB
       href: load-balancer-issues.md 
-    - name: Azure Advisor upgrade recommendation
-      href: azure-advisor-upgrade.md
-    - name: Issues after deleting storage volumes
-      href: delete-storage-volume.md
+
   - name: Reference
     items:
     - name: Azure CLI
diff --git a/AKS-Arc/aks-troubleshoot.md b/AKS-Arc/aks-troubleshoot.md
@@ -6,7 +6,7 @@ author: sethmanheim
 ms.date: 04/01/2025
 ms.author: sethm 
 ms.lastreviewed: 04/01/2025
-ms.reviewer: guanghu
+ms.reviewer: abha
 
 ---
 
@@ -24,6 +24,9 @@ The following sections describe known issues for AKS enabled by Azure Arc:
 
 | AKS Arc CRUD operation | Issue | Fix status |
 |------------------------|-------|------------|
+| AKS cluster create     | [Can't create AKS cluster or scale node pool because of issues with AKS Arc images](gallery-image-not-usable.md) | Partially fixed in 2503 release |
+| AKS steady state       | [AKS Arc telemetry pod consumes too much memory and CPU](telemetry-pod-resources.md) | Active
+| AKS steady state       | [Disk space exhaustion on control plane VMs due to accumulation of kube-apiserver audit logs](kube-apiserver-log-overflow.md) | Active
 | AKS cluster delete     | [Deleted AKS Arc cluster still visible on Azure portal](deleted-cluster-visible.md) | Active |
 | AKS cluster delete     | [Can't fully delete AKS Arc cluster with PodDisruptionBudget (PDB) resources](delete-cluster-pdb.md) | Fixed in 2503 release |
 | Azure portal           | [Can't see VM SKUs on Azure portal](check-vm-sku.md) | Fixed in 2411 release |
@@ -38,7 +41,7 @@ The following sections describe known issues for AKS enabled by Azure Arc:
 | Create validation      | [K8sVersionValidation error](cluster-k8s-version.md)   
 | Create validation      | [KubeAPIServer unreachable error](kube-api-server-unreachable.md)  
 | Network configuration issues | [Use diagnostic checker](aks-arc-diagnostic-checker.md)
-| Kubernetes steady state   | [Issues after deleting storage volume](delete-storage-volume.md)
+| Kubernetes steady state   | [Resolve issues due to out-of-band deletion of storage volumes](delete-storage-volume.md)
 | Release validation     | [Azure Advisor upgrade recommendation message](azure-advisor-upgrade.md)
 
 ## Next steps
diff --git a/AKS-Arc/delete-cluster-pdb.md b/AKS-Arc/delete-cluster-pdb.md
@@ -19,9 +19,11 @@ When you delete an AKS Arc cluster that has [PodDisruptionBudget](https://kubern
 
 This issue was fixed in [AKS on Azure Local, version 2503](aks-whats-new-23h2.md#release-2503).
 
-If you're on an older build, please update to Azure Local, version 2503. Once you update to 2503, you can retry deleting the AKS cluster. If the retry doesn't work, follow this workaround. File a support case if the retry does not delete the AKS cluster.
+- **For deleting an AKS cluster** with a PodDisruptionBudget: If you're on an older build, please update to Azure Local, version 2503. Once you update to 2503, you can retry deleting the AKS cluster. File a support case if you're on the 2503 release and your AKS cluster is not deleted after at least one retry.
+- **For deleting a nodepool** with a PodDisruptionBudget: By design, the nodepool isn't deleted if a PodDisruptionBudget exists, to protect applications. Use the following workaround to delete the PDB resources and then retry deleting the nodepool.
 
-## Workaround for AKS Edge Essentials and prior versions of AKS on Azure Local.
+
+## Workaround for AKS Edge Essentials and older versions of AKS on Azure Local
 
 Before you delete the AKS Arc cluster, access the AKS Arc cluster's **kubeconfig** and delete all PDBs:
 
diff --git a/AKS-Arc/gallery-image-not-usable.md b/AKS-Arc/gallery-image-not-usable.md
@@ -0,0 +1,54 @@
+---
+title: Kubernetes cluster create or nodepool scale failing due to AKS Arc image issues  
+description: Learn about a known issue with Kubernetes cluster create or nodepool scale failing due to AKS Arc VHD image download issues.
+ms.topic: troubleshooting
+author: sethmanheim
+ms.author: sethm
+ms.date: 04/01/2025
+ms.reviewer: abha
+
+---
+
+# Can't create AKS cluster or scale node pool because of issues with AKS Arc images
+
+[!INCLUDE [hci-applies-to-23h2](includes/hci-applies-to-23h2.md)]
+
+## Symptoms
+
+You see the following error when you try to create the AKS cluster:
+
+```output
+Kubernetes version 1.29.4 is not ready for use on Linux. Please go to https://aka.ms/aksarccheckk8sversions for details of how to check the readiness of Kubernetes versions.
+```
+
+You might also see the following error when you try to scale a nodepool:
+
+```output
+error with code NodepoolPrecheckFailed occured: AksHci nodepool creation precheck failed. Detailed message: 1 error occurred:\n\t* rpc error: code = Unknown desc = GalleryImage not usable, health state degraded: Degraded
+```
+
+When you run `az aksarc get-versions`, you see the following errors:
+
+```output
+...
+              {
+
+                "errorMessage": "failed cloud-side provisioning image linux-cblmariner-0.4.1.11203 to cloud gallery: {\n  \"code\": \"ImageProvisionError\",\n  \"message\": \"force failed to deprovision existing gallery image: failed to delete gallery image linux-cblmariner-0.4.1.11203: rpc error: code = Unknown desc = sa659p1012: rpc error: code = Unavailable desc = connection error: desc = \\\"transport: Error while dialing: dial tcp 10.202.244.4:45000: connectex: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.\\\"\",\n  \"additionalInfo\": [\n   {\n    \"type\": \"providerImageProvisionInfo\",\n    \"info\": {\n     \"ProviderDownload\": \"True\"\n    }\n   }\n  ],\n  \"category\": \"\"\n }",
+                "osSku": "CBLMariner",
+                "osType": "Linux",
+                "ready": false
+              },
+...
+```
+
+## Mitigation
+
+- This issue was fixed in [AKS on Azure Local, version 2503](aks-whats-new-23h2.md#release-2503).
+- Upgrade your Azure Local deployment to the 2503 build.
+- Once updated, confirm that the images have been downloaded successfully by running the `az aksarc get-versions` command.
+- For new AKS clusters: new AKS clusters should now be created successfully.
+- For scaling existing AKS clusters: scaling existing AKS clusters continues to encounter issues. Please file a support case.
+
+## Next steps
+
+[Known issues in AKS enabled by Azure Arc](aks-known-issues.md)
diff --git a/AKS-Arc/kube-apiserver-log-overflow.md b/AKS-Arc/kube-apiserver-log-overflow.md
@@ -0,0 +1,73 @@
+---
+title: Disk space exhaustion on the control plane VMs due to accumulation of kube-apiserver audit logs
+description: Learn about a known issue with disk space exhaustion on the control plane VMs due to accumulation of kube-apiserver audit logs.
+ms.topic: troubleshooting
+author: sethmanheim
+ms.author: sethm
+ms.date: 04/01/2025
+ms.reviewer: abha
+
+---
+
+# Disk space exhaustion on control plane VMs due to accumulation of kube-apiserver audit logs
+
+[!INCLUDE [hci-applies-to-23h2](includes/hci-applies-to-23h2.md)]
+
+## Symptoms
+
+If you're running kubectl commands and facing issues, you might see errors such as:
+
+```output
+kubectl get ns
+Error from server (InternalError): an error on the server ("Internal Server Error: \"/api/v1/namespaces?limit=500\": unknown") has prevented the request from succeeding (get namespaces)
+```
+
+When you SSH into the control plane VM, you might notice that your control plane VM ran out of disk space, specifically on the **/dev/sda2** partition. This is due to the accumulation of kube-apiserver audit logs in the **/var/log/kube-apiserver** directory, which can consume approximately 90 GB of disk space.
+
+```output
+clouduser@moc-laiwyj6tly6 [ /var/log/kube-apiserver ]$ df -h
+Filesystem      Size  Used Avail Use% Mounted on
+devtmpfs        4.0M     0  4.0M   0% /dev
+tmpfs           3.8G   84K  3.8G   1% /dev/shm
+tmpfs           1.6G  179M  1.4G  12% /run
+tmpfs           4.0M     0  4.0M   0% /sys/fs/cgroup
+/dev/sda2        99G   99G     0 100% /
+tmpfs           3.8G     0  3.8G   0% /tmp
+tmpfs           769M     0  769M   0% /run/user/1002
+clouduser@moc-laiwyj6tly6 [ /var/log/kube-apiserver ]$ sudo ls -l /var/log/kube-apiserver|wc -l
+890
+clouduser@moc-laiwyj6tly6 [ /var/log/kube-apiserver ]$ sudo du -h /var/log/kube-apiserver
+87G     /var/log/kube-apiserver
+```
+
+The issue occurs because the `--audit-log-maxbackup` value is set to 0. This setting allows the audit logs to accumulate without any limit, eventually filling up the disk. 
+
+## Mitigation
+
+To resolve the issue temporarily, you must manually clean up the old audit logs. Follow these steps:
+
+- SSH into the control plane virtual machine (VM) of your AKS Arc cluster.
+- Remove the old audit logs from the **/var/log/kube-apiserver** folder.
+- If you have multiple control plane nodes, you must repeat this process on each control plane VM.
+
+[SSH into the control plane VM](ssh-connect-to-windows-and-linux-worker-nodes.md) and navigate to the kube-apiserver logs directory:
+
+```bash
+cd /var/log/kube-apiserver
+```
+
+Remove the old audit log files:
+
+```bash
+rm audit-*.log
+```
+
+Exit the SSH session:
+
+```bash
+exit
+```
+
+## Next steps
+
+[Known issues in AKS enabled by Azure Arc](aks-known-issues.md)
diff --git a/AKS-Arc/scale-requirements.md b/AKS-Arc/scale-requirements.md
@@ -64,7 +64,7 @@ This article describes the maximum and minimum supported scale count for AKS on
 | Standard_D4s_v3             | 4    | 16           |
 | Standard_D8s_v3             | 8    | 32           |
 | Standard_D16s_v3            | 16   | 64           |
-| Standard_D8s_v3             | 32   | 128          |
+| Standard_D32s_v3            | 32   | 128          |
 
 For more worker node sizes with GPU support, see the next section.
 
diff --git a/AKS-Arc/telemetry-pod-resources.md b/AKS-Arc/telemetry-pod-resources.md
@@ -0,0 +1,101 @@
+---
+title: AKS Arc telemetry pod consumes too much memory and CPU
+description: Learn how to troubleshoot when AKS Arc telemetry pod consumes too much memory and CPU.
+ms.topic: troubleshooting
+author: sethmanheim
+ms.author: sethm
+ms.date: 04/01/2025
+ms.reviewer: abha
+
+---
+
+# AKS Arc telemetry pod consumes too much memory and CPU
+
+## Symptoms
+
+The **akshci-telemetry** pod in a AKS Arc cluster can over time consume a lot of CPU and memory resources. If metrics are enabled, you can verify the CPU and memory usage using the following `kubectl` command:
+
+```bash
+kubectl -n kube-system top pod -l app=akshci-telemetry
+```
+
+You might see an output similar to this:
+
+```output
+NAME                              CPU(cores)   MEMORY(bytes)
+akshci-telemetry-5df56fd5-rjqk4   996m         152Mi
+```
+
+## Mitigation
+
+To resolve this issue, set default **resource limits** for the pods in the `kube-system` namespace.
+
+### Important notes
+
+- Verify if you have any pods in the **kube-system** namespace that might require more memory than the default limit setting. If so, adjustments might be needed.
+- The **LimitRange** is applied to the **namespace**; in this case, the `kube-system` namespace. The default resource limits also apply to new pods that don't specify their own limits.
+- **Existing pods**, including those that already have resource limits, aren't affected.
+- **New pods** that don't specify their own resource limits are constrained by the limits set in the next section.
+- After you set the resource limits and delete the telemetry pod, the new pod might eventually hit the memory limit and generate **OOM (Out-Of-Memory)** errors. This is a temporary mitigation.
+  
+To proceed with setting the resource limits, you can run the following script. While the script uses `az aksarc get-credentials`, you can also use `az connectedk8s proxy` to get the proxy kubeconfig and access the Kubernetes cluster.
+
+### Define the LimitRange YAML to set default CPU and memory limits
+
+```powershell
+# Set the $cluster_name and $resource_group of the aksarc cluster
+$cluster_name = ""
+$resource_group = ""
+
+# Connect to the aksarc cluster
+az aksarc get-credentials -n $cluster_name -g $resource_group --admin -f "./kubeconfig-$cluster_name"
+
+$limitRangeYaml = @'
+apiVersion: v1
+kind: LimitRange
+metadata:
+  name: cpu-mem-resource-constraint
+  namespace: kube-system
+spec:
+  limits:
+  - default: # this section defines default limits for containers that haven't specified any limits
+      cpu: 250m
+      memory: 250Mi
+    defaultRequest: # this section defines default requests for containers that haven't specified any requests
+      cpu: 10m
+      memory: 20Mi
+    type: Container
+'@
+
+$limitRangeYaml | kubectl apply --kubeconfig "./kubeconfig-$cluster_name" -f -
+
+kubectl get pods -l app=akshci-telemetry -n kube-system --kubeconfig "./kubeconfig-$cluster_name"
+kubectl delete pods -l app=akshci-telemetry -n kube-system --kubeconfig "./kubeconfig-$cluster_name"
+
+sleep 5
+kubectl get pods -l app=akshci-telemetry -n kube-system --kubeconfig "./kubeconfig-$cluster_name"
+```
+
+### Validate if the resource limits were applied correctly
+
+1. Check the resource limits in the pod's YAML configuration:
+
+   ```powershell
+   kubectl get pods -l app=akshci-telemetry -n kube-system --kubeconfig "./kubeconfig-$cluster_name" -o yaml
+   ```
+
+1. In the output, verify that the `resources` section includes the limits:
+
+   ```yaml
+   resources:
+     limits:
+       cpu: 250m
+       memory: 250Mi
+     requests:
+       cpu: 10m
+       memory: 20Mi
+   ```
+
+## Next steps
+
+[Known issues in AKS enabled by Azure Arc](aks-known-issues.md)
diff --git a/azure-local/whats-new.md b/azure-local/whats-new.md
@@ -5,7 +5,7 @@ ms.topic: overview
 author: alkohli
 ms.author: alkohli
 ms.service: azure-local
-ms.date: 03/31/2025
+ms.date: 04/03/2025
 ---
 
 # What's new in Azure Local?
@@ -23,7 +23,7 @@ This is a baseline release with the following features and improvements:
 
 - **Registration and deployment changes**
   - **Extension installation**: Extensions are no longer installed during the registration of Azure Local machines. Instead, the extensions are installed in the machine validation step during the Azure Local instance deployment. For more information, see [Register with Arc via console](./deploy/deployment-arc-register-server-permissions.md) and [Deploy via Azure portal](./deploy/deploy-via-portal.md).
-  - **Register via app**: You can bootstrap your Azure Local machines using the Configurator app. The local UI is now deprecated. For more information, see [Register Azure Local machines using Configurator app](./index.yml).
+  - **Register via app**: You can bootstrap your Azure Local machines using the Configurator app. The local UI is now deprecated. For more information, see [Register Azure Local machines using Configurator app](./deploy/deployment-arc-register-configurator-app.md).
     - Composed image is now supported for Other Equipment Manufacturers (OEMs).
     - Several security enhancements were done for the Bootstrap service.
     - Service Principal Name (SPN) is deprecated for Arc registration.