Skip to content

Commit 4b2f882

Browse files
Merge pull request #291370 from fhryo-msft/patch-41
Update troubleshoot-container-storage.md
2 parents 4ce68c6 + 94fc43d commit 4b2f882

File tree

1 file changed

+76
-14
lines changed

1 file changed

+76
-14
lines changed

articles/storage/container-storage/troubleshoot-container-storage.md

Lines changed: 76 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,7 @@ ms.topic: how-to
1616

1717
### Azure Container Storage fails to install due to missing configuration
1818

19-
After running `az aks create`, you might see the message *Azure Container Storage failed to install. AKS cluster is created. Please run `az aks update` along with `--enable-azure-container-storage` to enable Azure Container Storage*.
19+
After running `az aks create`, you might see the message *Azure Container Storage failed to install. AKS cluster is created. Run `az aks update` along with `--enable-azure-container-storage` to enable Azure Container Storage*.
2020

2121
This message means that Azure Container Storage wasn't installed, but your AKS (Azure Kubernetes Service) cluster was created properly.
2222

@@ -28,7 +28,7 @@ az aks update -n <cluster-name> -g <resource-group> --enable-azure-container-sto
2828

2929
### Azure Container Storage fails to install due to Azure Policy restrictions
3030

31-
Azure Container Storage might fail to install if Azure Policy restrictions are in place. Specifically, Azure Container Storage relies on privileged containers, which can be blocked by Azure Policy. When they are blocked, the installation of Azure Container Storage might time out or fail, and you might see errors in the `gatekeeper-controller` logs such as:
31+
Azure Container Storage might fail to install if Azure Policy restrictions are in place. Specifically, Azure Container Storage relies on privileged containers. You may configure Azure Policy to block privileged containers. When they're blocked, the installation of Azure Container Storage might time out or fail, and you might see errors in the `gatekeeper-controller` logs such as:
3232

3333
```output
3434
$ kubectl logs -n gatekeeper-system deployment/gatekeeper-controller
@@ -55,7 +55,7 @@ To add the `acstor` namespace to the exclusion list, follow these steps:
5555

5656
### Can't install and enable Azure Container Storage in node pools with taints
5757

58-
You may have configured [node taints](/azure/aks/use-node-taints) on the node pools to restrict pods from being scheduled on these node pools. When you install and enable Azure Container Storage on these noode pools, it will be blocked because the required pods can't be created in these node pools. The behavior applies to both the system node pool when installing and the user node pools when enabling.
58+
You might configure [node taints](/azure/aks/use-node-taints) on the node pools to restrict pods from being scheduled on these node pools. Installing and enabling Azure Container Storage on these node pools may be blocked because the required pods can't be created in these node pools. The behavior applies to both the system node pool when installing and the user node pools when enabling.
5959

6060
You can check the node taints with the following example:
6161

@@ -89,7 +89,7 @@ $ az aks nodepool list -g $resourceGroup --cluster-name $clusterName --query "[]
8989

9090
```
9191

92-
Retry the installing or enabling after you remove node taints successfully. After it's completed successfully, you can configure node taints back to resume the pod scheduling restaints.
92+
Retry the installing or enabling after you remove node taints successfully. After it completes successfully, you can configure node taints back to resume the pod scheduling restraints.
9393

9494
### Can't set storage pool type to NVMe
9595

@@ -101,11 +101,73 @@ To remediate, create a node pool with a VM SKU that has NVMe drives and try agai
101101

102102
To check the status of your storage pools, run `kubectl describe sp <storage-pool-name> -n acstor`. Here are some issues you might encounter.
103103

104+
### Ephemeral storage pool doesn’t claim the capacity when the ephemeral disks are used by other daemonsets
105+
106+
Enabling an ephemeral storage pool on a node pool with temp SSD or local NVMe disks might not claim capacity from these disks if other daemonsets are using them.
107+
108+
Run the following steps to enable Azure Container Storage to manage these local disks exclusively:
109+
110+
1. Run the following command to see the claimed capacity by ephemeral storage pool:
111+
112+
```bash
113+
$ kubectl get sp -A
114+
NAMESPACE NAME CAPACITY AVAILABLE USED RESERVED READY AGE
115+
acstor ephemeraldisk-nvme 0 0 0 0 False 82s
116+
```
117+
This example shows zero capacity claimed by `ephemeraldisk-nvme` storage pool.
118+
119+
1. Run the following command to confirm unclaimed state of these local block devices and check existing file system on the disks:
120+
```bash
121+
$ kubectl get bd -A
122+
NAMESPACE NAME NODENAME SIZE CLAIMSTATE STATUS AGE
123+
acstor blockdevice-1f7ad8fa32a448eb9768ad8e261312ff aks-nodepoolnvme-38618677-vmss000001 1920383410176 Unclaimed Active 22m
124+
acstor blockdevice-9c8096fc47cc2b41a2ed07ec17a83527 aks-nodepoolnvme-38618677-vmss000000 1920383410176 Unclaimed Active 23m
125+
$ kubectl describe bd -n acstor blockdevice-1f7ad8fa32a448eb9768ad8e261312ff
126+
Name: blockdevice-1f7ad8fa32a448eb9768ad8e261312ff
127+
128+
Filesystem:
129+
Fs Type: ext4
130+
131+
```
132+
This example shows that the block devices are `Unclaimed` status and there's an existing file system on the disk.
133+
134+
1. Confirm that you want to use Azure Container Storage to manage the local data disks exclusively before proceeding.
135+
136+
1. Stop and remove the daemonsets or components that manage local data disks.
137+
138+
1. Log in to each node that has local data disks.
139+
140+
1. Remove existing file systems from all local data disks.
141+
142+
1. Restart ndm daemonset to discover unused local data disks.
143+
```bash
144+
$ kubectl rollout restart daemonset -l app=ndm -n acstor
145+
daemonset.apps/azurecontainerstorage-ndm restarted
146+
$ kubectl rollout status daemonset -l app=ndm -n acstor --watch
147+
148+
daemon set "azurecontainerstorage-ndm" successfully rolled out
149+
```
150+
151+
1. Wait a few seconds and check if the ephemeral storage pool claims the capacity from local data disks.
152+
153+
```bash
154+
$ kubectl wait -n acstor sp --all --for condition=ready
155+
storagepool.containerstorage.azure.com/ephemeraldisk-nvme condition met
156+
$ kubectl get bd -A
157+
NAMESPACE NAME NODENAME SIZE CLAIMSTATE STATUS AGE
158+
acstor blockdevice-1f7ad8fa32a448eb9768ad8e261312ff aks-nodepoolnvme-38618677-vmss000001 1920383410176 Claimed Active 4d16h
159+
acstor blockdevice-9c8096fc47cc2b41a2ed07ec17a83527 aks-nodepoolnvme-38618677-vmss000000 1920383410176 Claimed Active 4d16h
160+
$ kubectl get sp -A
161+
NAMESPACE NAME CAPACITY AVAILABLE USED RESERVED READY AGE
162+
acstor ephemeraldisk-nvme 3840766820352 3812058578944 28708241408 26832871424 True 4d16h
163+
```
164+
This example shows `ephemeraldisk-nvme` storage pool successfully claims the capacity from local NVMe disks on the nodes.
165+
104166
### Error when trying to expand an Azure Disks storage pool
105167

106-
If your existing storage pool is less than 4 TiB (4,096 GiB), you can only expand it up to 4,095 GiB. If you try to expand beyond that, the internal PVC will get an error message like "Only Disk CachingType 'None' is supported for disk with size greater than 4095 GB" or "Disk 'xxx' of size 4096 GB (<=4096 GB) cannot be resized to 16384 GB (>4096 GB) while it is attached to a running VM. Please stop your VM or detach the disk and retry the operation."
168+
If your existing storage pool is less than 4 TiB (4,096 GiB), you can only expand it up to 4,095 GiB. If you try to expand beyond the limit, the internal PVC shows an error message about disk size or caching type limitations. Stop your VM or detach the disk and retry the operation."
107169

108-
To avoid errors, don't attempt to expand your current storage pool beyond 4,095 GiB if it is initially smaller than 4 TiB (4,096 GiB). Storage pools larger than 4 TiB can be expanded up to the maximum storage capacity available.
170+
To avoid errors, don't attempt to expand your current storage pool beyond 4,095 GiB if it's initially smaller than 4 TiB (4,096 GiB). Storage pools larger than 4 TiB can be expanded up to the maximum storage capacity available.
109171

110172
This limitation only applies when using `Premium_LRS`, `Standard_LRS`, `StandardSSD_LRS`, `Premium_ZRS`, and `StandardSSD_ZRS` Disk SKUs.
111173

@@ -121,21 +183,21 @@ To remediate, create a node pool with a VM SKU that has NVMe drives and try agai
121183

122184
### Storage pool type already enabled
123185

124-
If you try to enable a storage pool type that's already enabled, you get the following message: *Invalid `--enable-azure-container-storage` value. Azure Container Storage is already enabled for storage pool type `<storage-pool-type>` in the cluster*. You can check if you have any existing storage pools created by running `kubectl get sp -n acstor`.
186+
If you try to enable a storage pool type that exists, you get the following message: *Invalid `--enable-azure-container-storage` value. Azure Container Storage is already enabled for storage pool type `<storage-pool-type>` in the cluster*. You can check if you have any existing storage pools created by running `kubectl get sp -n acstor`.
125187

126188
### Disabling a storage pool type
127189

128190
When disabling a storage pool type via `az aks update --disable-azure-container-storage <storage-pool-type>` or uninstalling Azure Container Storage via `az aks update --disable-azure-container-storage all`, if there's an existing storage pool of that type, you get the following message:
129191

130-
*Disabling Azure Container Storage for storage pool type `<storage-pool-type>` will forcefully delete all the storage pools of the same type and affect the applications using these storage pools. Forceful deletion of storage pools can also lead to leaking of storage resources which are being consumed. Do you want to validate whether any of the storage pools of type `<storage-pool-type>` are being used before disabling Azure Container Storage? (Y/n)*
192+
*Disabling Azure Container Storage for storage pool type `<storage-pool-type>` forcefully deletes all the storage pools of the same type and it affects the applications using these storage pools. Forceful deletion of storage pools can also lead to leaking of storage resources which are being consumed. Do you want to validate whether any of the storage pools of type `<storage-pool-type>` are being used before disabling Azure Container Storage? (Y/n)*
131193

132194
If you select Y, an automatic validation runs to ensure that there are no persistent volumes created from the storage pool. Selecting n bypasses this validation and disables the storage pool type, deleting any existing storage pools and potentially affecting your application.
133195

134196
## Troubleshoot volume issues
135197

136-
### Pod pending creation due to ephemeral volume size above available capacity
198+
### Pod pending creation due to ephemeral volume size beyond available capacity
137199

138-
An ephemeral volume is allocated on a single node. When you configure the size of ephemeral volumes for your pods, the size should be less than the available capacity of a single node's ephemeral disk. Otherwise, the pod creation will be in pending status.
200+
An ephemeral volume is allocated on a single node. When you configure the size of ephemeral volumes for your pods, the size should be less than the available capacity of a single node's ephemeral disk. Otherwise, the pod creation is in pending status.
139201

140202
Use the following command to check if your pod creation is in pending status.
141203

@@ -182,19 +244,19 @@ ephemeraldisk-temp-diskpool-xbtlj 75660001280 75031990272 628011008 5609
182244

183245
In this example, the available capacity of temp disk for a single node is `75031990272` bytes or 69 GiB.
184246

185-
Adjust the volume storage size below available capacity and redeploy your pod. See [Deploy a pod with a generic ephemeral volume](use-container-storage-with-temp-ssd.md#3-deploy-a-pod-with-a-generic-ephemeral-volume).
247+
Adjust the volume storage size less than available capacity and redeploy your pod. See [Deploy a pod with a generic ephemeral volume](use-container-storage-with-temp-ssd.md#3-deploy-a-pod-with-a-generic-ephemeral-volume).
186248

187249
### Volume fails to attach due to metadata store offline
188250

189-
Azure Container Storage uses `etcd`, a distributed, reliable key-value store, to store and manage metadata of volumes to support volume orchestration operations. For high availability and resiliency, `etcd` runs in three pods. When there are less than two `etcd` instances running, Azure Container Storage will halt volume orchestration operations while still allowing data access to the volumes. Azure Container Storage automatically detects when an `etcd` instance is offline and recovers it. However, if you notice volume orchestration errors after restarting an AKS cluster, it's possible that an `etcd` instance failed to autorecover. Follow the instructions in this section to determine the health status of the `etcd` instances.
251+
Azure Container Storage uses `etcd`, a distributed, reliable key-value store, to store and manage metadata of volumes to support volume orchestration operations. For high availability and resiliency, `etcd` runs in three pods. When there are less than two `etcd` instances running, Azure Container Storage halts volume orchestration operations while still allowing data access to the volumes. Azure Container Storage automatically detects when an `etcd` instance is offline and recovers it. However, if you notice volume orchestration errors after restarting an AKS cluster, it's possible that an `etcd` instance failed to autorecover. Follow the instructions in this section to determine the health status of the `etcd` instances.
190252

191253
Run the following command to get a list of pods.
192254

193255
```azurecli-interactive
194256
kubectl get pods
195257
```
196258

197-
You'll see output similar to the following.
259+
You may see output similar to the following.
198260

199261
```output
200262
NAME READY STATUS RESTARTS AGE
@@ -237,7 +299,7 @@ etcd-azurecontainerstorage-phf92lmqml 1/1 Running
237299
etcd-azurecontainerstorage-xznvwcgq4p 1/1 Running 0 4d19h
238300
```
239301

240-
If fewer than two instances are shown in the Running state, you can conclude that the volume is failing to attach due to the metadata store being offline, and the automated recovery wasn't successful. If so, file a support ticket with [Azure Support]( https://azure.microsoft.com/support/).
302+
If fewer than two instances are running, the volume isn't attaching because the metadata store is offline, and automated recovery failed. If so, file a support ticket with [Azure Support]( https://azure.microsoft.com/support/).
241303

242304
## See also
243305

0 commit comments

Comments
 (0)