Skip to content

Commit 86ec9f2

Browse files
authored
Merge pull request #22 from jacobwolfaws/main
Adding troubleshooting doc
2 parents 1be56d7 + 668b166 commit 86ec9f2

File tree

2 files changed

+61
-2
lines changed

2 files changed

+61
-2
lines changed

docs/README.md

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,10 @@
55

66
The [Amazon File Cache](https://docs.aws.amazon.com/fsx/latest/FileCacheGuide/) Container Storage Interface (CSI) Driver provides a [CSI](https://github.com/container-storage-interface/spec/blob/master/spec.md) interface used by container orchestrators to manage the lifecycle of Amazon file cache volumes.
77

8-
### CSI Specification Compability Matrix
8+
### Troubleshooting
9+
For help with troubleshooting, please refer to our [troubleshooting doc](https://github.com/kubernetes-sigs/aws-fsx-csi-driver/blob/master/docs/troubleshooting.md).
10+
11+
### CSI Specification Compatibility Matrix
912
| AWS File Cache CSI Driver \ CSI Version | v1.x.x |
1013
|-----------------------------------------|--------|
1114
| v0.1.0 | yes |
@@ -21,7 +24,7 @@ The following CSI interfaces are implemented:
2124

2225
The following sections are Kubernetes-specific. If you are a Kubernetes user, use the following for driver features, installation steps and examples.
2326

24-
### Kubernetes Version Compability Matrix
27+
### Kubernetes Version Compatibility Matrix
2528
| AWS File Cache CSI Driver \ Kubernetes Version | v1.22+ |
2629
|------------------------------------------------|--------|
2730
| v0.1.0 | yes |

docs/troubleshooting.md

Lines changed: 56 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,56 @@
1+
## Troubleshooting CSI Driver (Common Issues)
2+
3+
If you’re experiencing issues with the File Cache CSI Driver, always ensure that you’re using the [latest available Amazon File Cache CSI Driver](https://github.com/kubernetes-sigs/aws-file-cache-csi-driver#csi-specification-compatibility-matrix).
4+
You can check which version of the CSI Driver you’re running in the pods on your cluster by checking the fsx-plugin container image version in your cluster’s file-cache-csi-node or file-cache-csi-controller pods either via AWS EKS console or by calling `kubectl describe pod <pod-name>`.
5+
6+
If you’re not using the latest image, please upgrade the CSI Driver image you’re currently using on the pods in your cluster and see if the issue persists.
7+
8+
9+
### Troubleshooting Issues
10+
11+
For common File Cache issues, you can refer to the [Amazon File Cache troubleshooting guide](https://docs.aws.amazon.com/fsx/latest/FileCacheGuide/troubleshooting.html) for more details as it includes mitigations for common problems with Amazon File Cache.
12+
13+
#### Issue: Pod is stuck in ContainerCreating when trying to mount a volume.
14+
15+
##### Characteristics:
16+
17+
1. The underlying file system has a large number of files
18+
2. When calling `kubectl get pod <pod-name>` you see an error message similar to this:
19+
```
20+
Warning FailedMount kubelet Unable to attach or mount volumes: unmounted volumes=[fsx-volume-name], unattached volumes=[fsx-volume-name]: timed out waiting for the condition
21+
```
22+
23+
##### Likely Cause:
24+
Volume ownership is being set recursively on every file in the volume, which prevents the pod from mounting the volume for an extended period of time. See https://github.com/kubernetes/kubernetes/issues/69699
25+
26+
##### Mitigation:
27+
[Per Kubernetes documentation](https://kubernetes.io/blog/2020/12/14/kubernetes-release-1.20-fsgroupchangepolicy-fsgrouppolicy/#allow-users-to-skip-recursive-permission-changes-on-mount): “When configuring a pod’s security context, set fsGroupChangePolicy to "OnRootMismatch" so if the root of the volume already has the correct permissions, the recursive permission change can be skipped." After setting this policy, terminate the pod stuck in ContainerCreating and drain the node. Pod-level mounting on the new node should no longer have issues mounting if the volume root has the correct permissions.
28+
29+
For more information on configuring securityContext, see https://kubernetes.io/docs/tasks/configure-pod-container/security-context/.
30+
31+
32+
33+
#### Issue: Pods fail to mount file system with the following error:
34+
35+
```
36+
mount.lustre: mount cache_dns_name@tcp:/mountname at /fsx failed: Input/output error
37+
Is the MGS running?
38+
```
39+
40+
##### Likely Cause:
41+
Amazon File Cache rejects packets where the source port is neither 988 nor in the range 1018–1023. It may be that kube-proxy is redirecting the packet to a different port.
42+
43+
##### Mitigation:
44+
Run netstat -tlpna to confirm whether there are TCP connections established with a source port outside of the range 1018–1023. If there are such connections, enable SNAT to avoid redirecting packets to a different port. For more information, see https://docs.aws.amazon.com/eks/latest/userguide/external-snat.html
45+
46+
47+
48+
#### Issue: Pods are stuck in terminating when the [cluster autoscaler](https://github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler) is scaling down resources.
49+
50+
##### Likely Cause:
51+
A [July 2021 autoscaler change](https://github.com/kubernetes/autoscaler/pull/4172) introduced a [known issue](https://github.com/kubernetes/autoscaler/issues/5240) where daemonset pods are evicted at the same time as non-daemonset pods, which can cause a race condition where when daemonset pods are evicted prior to the non-daemonset pods, the non-daemonset pods are unable to unmount gracefully and are stuck in terminating.
52+
53+
##### Mitigation:
54+
Annotate Daemonset pods with `“cluster-autoscaler.kubernetes.io/enable-ds-eviction": "false"` , which will prevent the Daemonset pods from being evicted on resource scale-down and allow for the non-DS pods to unmount properly. This can be done by annotating the [pod spec of the Daemonset file](https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/FAQ.md#how-can-i-enabledisable-eviction-for-a-specific-daemonset).
55+
56+

0 commit comments

Comments
 (0)