Skip to content

Commit 1f91ba5

Browse files
author
Krzysztof Wilczyński
committed
Add missing troubleshooting steps
Signed-off-by: Krzysztof Wilczyński <[email protected]>
1 parent 220df51 commit 1f91ba5

File tree

1 file changed

+40
-0
lines changed
  • keps/sig-node/4191-split-image-filesystem

1 file changed

+40
-0
lines changed

keps/sig-node/4191-split-image-filesystem/README.md

Lines changed: 40 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -929,6 +929,46 @@ For each of them, fill in the following information by copying the below templat
929929

930930
###### What steps should be taken if SLOs are not being met to determine the problem?
931931

932+
The operator should ensure that:
933+
934+
- The underlying node is currently not under high load due to high CPU utilisation, memory pressure or storage volume latency (with the focus on I/O wait times)
935+
- There is sufficient disk space available on the filesystem or volume that is used for the image filesystem to use to store data
936+
- There are a sufficient number of inodes free and available, especially if the filesystem does not support a dynamic inodes allocation, on the provisioned filesystem where the image filesystem will store data
937+
- The volume, if backed by a local block device or network-attached storage, has been made available to the image filesystem to be used to store data
938+
- The CRI, container runtimes and kubelet have access to the location on the filesystem or the volume (block device) where the image filesystem will be storing data
939+
- The system user, if either CRI, container runtimes or kubelet have been configured to use a system user other than the privileged one such as root, has access to the filesystem location or volume where the image filesystem will store data
940+
- The node components, such as the CRI, container runtimes and kubelet, are up and running, and service logs are free from errors that might otherwise impact or degrade any of the components mentioned earlier
941+
- The CRI, container runtimes and kubelet service logs are free from error reports about the configured ContainerFs, ImageFs, and otherwise configured filesystem location or storage volumes
942+
943+
Additionally, the operator should also confirm that the necessary CRI and kubelet configuration has been deployed
944+
correctly and points to a correct path to a filesystem location where the image filesystem will be storing data.
945+
946+
While troubleshooting issues potentially related to the Split Image Filesystem feature, it's best to focus on
947+
the following areas:
948+
949+
- Current CPU and memory utilisation on the underlying node
950+
- Storage volumes, disk space availability, and sufficient inodes capacity
951+
- I/O wait times, read and write queue depths, and latency for the storage volumes
952+
- Any expected mount points, whether bind mounts or otherwise
953+
- Access permission issues
954+
- SELinux, AppArmor, or POSIX ACLs set up
955+
- The kernel message buffer (dmesg)
956+
- Operating system logs
957+
- Specific services logs, such as CRI, container runtimes and kubelet
958+
- Kubernetes cluster events with a focus on evictions of pods from affected nodes
959+
- Any relevant pods or workloads statuses
960+
- Kubernetes cluster health with a focus on the Control Plane and any affected nodes
961+
- Monitoring and alerting system or services, with a focus on recent and historic events (past 24 hours or so)
962+
963+
If the Kubernetes cluster sports an observability solution, it would be useful to look at the collected usage
964+
metrics so that any problems found could potentially be correlated to events and usage data from the last 24
965+
hours or so.
966+
967+
For cloud-based deployments, it would be prudent to interrogate any available monitoring dashboards for the node
968+
and any specific storage volume and to ensure that there is enough IOPS capacity provisioned and available, that
969+
the correct storage type has been provisioned, and that metrics such as burst capacity for IOPS and throughput
970+
aren't negatively impacted, should the storage volume support such features.
971+
932972
## Implementation History
933973

934974
- Initial Draft (September 12th 2023)

0 commit comments

Comments
 (0)