Skip to content

Commit ca780be

Browse files
authored
Merge pull request #4725 from kwilczynski/feature/update-4191-with-failure-scenario
KEP-4191: Split Image Filesystem add failure scenario
2 parents b493f27 + c5f993d commit ca780be

File tree

1 file changed

+6
-0
lines changed
  • keps/sig-node/4191-split-image-filesystem

1 file changed

+6
-0
lines changed

keps/sig-node/4191-split-image-filesystem/README.md

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -927,6 +927,12 @@ For each of them, fill in the following information by copying the below templat
927927
- Testing: Are there any tests for failure mode? If not, describe why.
928928
-->
929929

930+
- Pods do not start correctly
931+
- Detection: The user notices that the desired pods are not starting correctly, and their status indicates an error or a failure related to image pull failures, which can then be traced to the Split Image Filesystem feature.
932+
- Mitigations: The Split Image Filesystem feature can be disabled as a mitigation step. However, it is not without side effects, where any container images downloaded before would have to be downloaded again. Thus, further investigation would be recommended before a decision to disable this feature is made. The user should also ensure that if the feature is disabled, enough disk space will be available at the location where the ContainerFs filesystem is currently pointed against. A restart of kubelet will be required if this feature is to be disabled.
933+
- Diagnostics: Kubernetes cluster events and specific pods statutes report image pull failures that are related to problems with one of the filesystem access permissions, storage volumes issues, mount points issues, etc., where none of the reported issues are related to disk space utilisation, which would otherwise trigger pods eviction. Reviewing CRI and kubelet service logs can help to determine the root cause. Additionally, reviewing operating system logs can be helpful and can be used to correlate events and any errors found in the service logs.
934+
- Testing: A set of end-to-end tests aims to cover this scenario.
935+
930936
###### What steps should be taken if SLOs are not being met to determine the problem?
931937

932938
The operator should ensure that:

0 commit comments

Comments
 (0)