Merge pull request #291168 from fhryo-msft/patch-40

prmerger-automator[bot] · web-flow · commit 16a050fc9619 · 2024-11-28T01:52:24.000Z
Update troubleshoot-container-storage.md
diff --git a/articles/storage/container-storage/troubleshoot-container-storage.md b/articles/storage/container-storage/troubleshoot-container-storage.md
@@ -18,7 +18,7 @@ ms.topic: how-to
 
 After running `az aks create`, you might see the message *Azure Container Storage failed to install. AKS cluster is created. Please run `az aks update` along with `--enable-azure-container-storage` to enable Azure Container Storage*.
 
-This message means that Azure Container Storage wasn't installed, but your AKS cluster was created properly.
+This message means that Azure Container Storage wasn't installed, but your AKS (Azure Kubernetes Service) cluster was created properly.
 
 To install Azure Container Storage on the cluster and create a storage pool, run the following command. Replace `<cluster-name>` and `<resource-group>` with your own values. Replace `<storage-pool-type>` with `azureDisk`, `ephemeraldisk`, or `elasticSan`.
 
@@ -28,7 +28,7 @@ az aks update -n <cluster-name> -g <resource-group> --enable-azure-container-sto
 
 ### Azure Container Storage fails to install due to Azure Policy restrictions
 
-Azure Container Storage might fail to install if Azure Policy restrictions are in place. Specifically, Azure Container Storage relies on privileged containers, which can be blocked by Azure Policy. When this happens, the installation of Azure Container Storage might timeout or fail, and you might see errors in the `gatekeeper-controller` logs such as:
+Azure Container Storage might fail to install if Azure Policy restrictions are in place. Specifically, Azure Container Storage relies on privileged containers, which can be blocked by Azure Policy. When they are blocked, the installation of Azure Container Storage might time out or fail, and you might see errors in the `gatekeeper-controller` logs such as:
 
 ```output
 $ kubectl logs -n gatekeeper-system deployment/gatekeeper-controller
@@ -41,7 +41,7 @@ $ kubectl logs -n gatekeeper-system deployment/gatekeeper-controller
 {"level":"info","ts":1722622449.2412128,"logger":"webhook","msg":"denied admission: Privileged container is not allowed: ndm, securityContext: {\"privileged\": true}","hookType":"validation","process":"admission","details":{},"event_type":"violation","constraint_name":"azurepolicy-k8sazurev2noprivilege-686dd8b209a774ba977c","constraint_group":"constraints.gatekeeper.sh","constraint_api_version":"v1beta1","constraint_kind":"K8sAzureV2NoPrivilege","constraint_action":"deny","resource_group":"","resource_api_version":"v1","resource_kind":"Pod","resource_namespace":"acstor","resource_name":"azurecontainerstorage-ndm-b5nfg","request_username":"system:serviceaccount:kube-system:daemon-set-controller"}
 ```
 
-To resolve this, you’ll need to add the `acstor` namespace to the exclusion list of your Azure Policy. Azure Policy is used to create and enforce rules for managing resources within Azure, including AKS clusters. In some cases, policies might block the creation of Azure Container Storage pods and components. You can find more details on working with Azure Policy for Kubernetes by consulting [Azure Policy for Kubernetes](/azure/governance/policy/concepts/policy-for-kubernetes).
+To resolve the blocking, you need to add the `acstor` namespace to the exclusion list of your Azure Policy. Azure Policy is used to create and enforce rules for managing resources within Azure, including AKS clusters. In some cases, policies might block the creation of Azure Container Storage pods and components. You can find more details on working with Azure Policy for Kubernetes by consulting [Azure Policy for Kubernetes](/azure/governance/policy/concepts/policy-for-kubernetes).
 
 To add the `acstor` namespace to the exclusion list, follow these steps:
 
@@ -50,9 +50,47 @@ To add the `acstor` namespace to the exclusion list, follow these steps:
 1. Create a policy that you suspect is blocking the installation of Azure Container Storage.
 1. Attempt to install Azure Container Storage in the AKS cluster.
 1. Check the logs for the gatekeeper-controller pod to confirm any policy violations.
-1. Add the `acstor` namespace to the exclusion list of the policy.
+1. Add the `acstor` namespace and `azure-extensions-usage-system` namespace to the exclusion list of the policy.
 1. Attempt to install Azure Container Storage in the AKS cluster again.
 
+### Can't install and enable Azure Container Storage in node pools with taints
+
+You may have configured [node taints](/azure/aks/use-node-taints) on the node pools to restrict pods from being scheduled on these node pools. When you install and enable Azure Container Storage on these noode pools, it will be blocked because the required pods can't be created in these node pools. The behavior applies to both the system node pool when installing and the user node pools when enabling.
+
+You can check the node taints with the following example:
+
+```bash
+$ az aks nodepool list -g $resourceGroup --cluster-name $clusterName --query "[].{PoolName:name, nodeTaints:nodeTaints}"
+[
+...
+  {
+    "PoolName": "nodepoolx",
+    "nodeTaints": [
+      "sku=gpu:NoSchedule"
+    ]
+  }
+]
+
+```
+
+You can remove these taints temporarily to unblock and configure them back after you install and enable successfully. You can go to Azure Portal > AKS cluster > Node pools, click your node pool, remove the taints in "Taints and labels
+" section. Or you can use the following command to remove taints and confirm the change.
+
+```bash
+$ az aks nodepool update -g $resourceGroup --cluster-name $clusterName --name $nodePoolName --node-taints ""
+$ az aks nodepool list -g $resourceGroup --cluster-name $clusterName --query "[].{PoolName:name, nodeTaints:nodeTaints}"
+[
+...
+  {
+    "PoolName": "nodepoolx",
+    "nodeTaints": null
+  }
+]
+
+```
+
+Retry the installing or enabling after you remove node taints successfully. After it's completed successfully, you can configure node taints back to resume the pod scheduling restaints.
+
 ### Can't set storage pool type to NVMe
 
 If you try to install Azure Container Storage with Ephemeral Disk, specifically with local NVMe on a cluster where the virtual machine (VM) SKU doesn't have NVMe drives, you get the following error message: *Cannot set --storage-pool-option as NVMe as none of the node pools can support ephemeral NVMe disk*.
@@ -65,15 +103,15 @@ To check the status of your storage pools, run `kubectl describe sp <storage-poo
 
 ### Error when trying to expand an Azure Disks storage pool
 
-If your existing storage pool is less than 4 TiB (4,096 GiB), you can only expand it up to 4,095 GiB. If you try to expand beyond that, the internal PVC will get an error message like "Only Disk CachingType 'None' is supported for disk with size greater than 4095 GB" or ""Disk 'xxx' of size 4096 GB (<=4096 GB) cannot be resized to 16384 GB (>4096 GB) while it is attached to a running VM. Please stop your VM or detach the disk and retry the operation."
+If your existing storage pool is less than 4 TiB (4,096 GiB), you can only expand it up to 4,095 GiB. If you try to expand beyond that, the internal PVC will get an error message like "Only Disk CachingType 'None' is supported for disk with size greater than 4095 GB" or "Disk 'xxx' of size 4096 GB (<=4096 GB) cannot be resized to 16384 GB (>4096 GB) while it is attached to a running VM. Please stop your VM or detach the disk and retry the operation."
 
 To avoid errors, don't attempt to expand your current storage pool beyond 4,095 GiB if it is initially smaller than 4 TiB (4,096 GiB). Storage pools larger than 4 TiB can be expanded up to the maximum storage capacity available.
 
 This limitation only applies when using `Premium_LRS`, `Standard_LRS`, `StandardSSD_LRS`, `Premium_ZRS`, and `StandardSSD_ZRS` Disk SKUs.
 
 ### Elastic SAN creation fails
 
-If you're trying to create an Elastic SAN storage pool, you might see the message *Azure Elastic SAN creation failed: Maximum possible number of Elastic SAN for the Subscription created already*. This means that you've reached the limit on the number of Elastic SAN resources that can be deployed in a region per subscription. You can check the limit here: [Elastic SAN scalability and performance targets](../elastic-san/elastic-san-scale-targets.md#elastic-san-scale-targets). Consider deleting any existing Elastic SAN resources on the subscription that are no longer being used, or try creating the storage pool in a different region.
+If you're trying to create an Elastic SAN storage pool, you might see the message *Azure Elastic SAN creation failed: Maximum possible number of Elastic SAN for the Subscription created already*. This means that you reach the limit on the number of Elastic SAN resources that can be deployed in a region per subscription. You can check the limit here: [Elastic SAN scalability and performance targets](../elastic-san/elastic-san-scale-targets.md#elastic-san-scale-targets). Consider deleting any existing Elastic SAN resources on the subscription that are no longer being used, or try creating the storage pool in a different region.
 
 ### No block devices found
 
@@ -93,12 +131,6 @@ When disabling a storage pool type via `az aks update --disable-azure-container-
 
 If you select Y, an automatic validation runs to ensure that there are no persistent volumes created from the storage pool. Selecting n bypasses this validation and disables the storage pool type, deleting any existing storage pools and potentially affecting your application.
 
-### Can't delete resource group containing AKS cluster
-
-If you created an Elastic SAN storage pool, you might not be able to delete the resource group in which your AKS cluster is located.
-
-To resolve this, sign in to the [Azure portal](https://portal.azure.com?azure-portal=true) and select **Resource groups**. Locate the resource group that AKS created (the resource group name starts with **MC_**). Select the SAN resource object within that resource group. Manually remove all volumes and volume groups. Then retry deleting the resource group that includes your AKS cluster.
-
 ## Troubleshoot volume issues
 
 ### Pod pending creation due to ephemeral volume size above available capacity
@@ -150,11 +182,11 @@ ephemeraldisk-temp-diskpool-xbtlj   75660001280   75031990272   628011008   5609
 
 In this example, the available capacity of temp disk for a single node is `75031990272` bytes or 69 GiB.
 
-Adjust the volume storage size below available capacity and re-deploy your pod. See [Deploy a pod with a generic ephemeral volume](use-container-storage-with-temp-ssd.md#3-deploy-a-pod-with-a-generic-ephemeral-volume).
+Adjust the volume storage size below available capacity and redeploy your pod. See [Deploy a pod with a generic ephemeral volume](use-container-storage-with-temp-ssd.md#3-deploy-a-pod-with-a-generic-ephemeral-volume).
 
 ### Volume fails to attach due to metadata store offline
 
-Azure Container Storage uses `etcd`, a distributed, reliable key-value store, to store and manage metadata of volumes to support volume orchestration operations. For high availability and resiliency, `etcd` runs in three pods. When there are less than two `etcd` instances running, Azure Container Storage will halt volume orchestration operations while still allowing data access to the volumes. Azure Container Storage automatically detects when an `etcd` instance is offline and recovers it. However, if you notice volume orchestration errors after restarting an AKS cluster, it's possible that an `etcd` instance failed to auto-recover. Follow the instructions in this section to determine the health status of the `etcd` instances.
+Azure Container Storage uses `etcd`, a distributed, reliable key-value store, to store and manage metadata of volumes to support volume orchestration operations. For high availability and resiliency, `etcd` runs in three pods. When there are less than two `etcd` instances running, Azure Container Storage will halt volume orchestration operations while still allowing data access to the volumes. Azure Container Storage automatically detects when an `etcd` instance is offline and recovers it. However, if you notice volume orchestration errors after restarting an AKS cluster, it's possible that an `etcd` instance failed to autorecover. Follow the instructions in this section to determine the health status of the `etcd` instances.
 
 Run the following command to get a list of pods.
 
@@ -175,7 +207,7 @@ Describe the pod:
 kubectl describe pod fiopod
 ```
 
-Typically, you'll see volume failure messages if the metadata store is offline. In this example, **fiopod** is in **ContainerCreating** status, and the **FailedAttachVolume** warning indicates that the creation is pending due to volume attach failure.
+Typically, you see volume failure messages if the metadata store is offline. In this example, **fiopod** is in **ContainerCreating** status, and the **FailedAttachVolume** warning indicates that the creation is pending due to volume attach failure.
 
 ```output
 Name:             fiopod 
@@ -205,7 +237,7 @@ etcd-azurecontainerstorage-phf92lmqml                            1/1     Running
 etcd-azurecontainerstorage-xznvwcgq4p                            1/1     Running   0               4d19h
 ```
 
-If fewer than two instances are shown in the Running state, you can conclude that the volume is failing to attach due to the metadata store being offline, and the automated recovery wasn't successful. If this is the case, file a support ticket with [Azure Support]( https://azure.microsoft.com/support/).
+If fewer than two instances are shown in the Running state, you can conclude that the volume is failing to attach due to the metadata store being offline, and the automated recovery wasn't successful. If so, file a support ticket with [Azure Support]( https://azure.microsoft.com/support/).
 
 ## See also