From 7b66411c594f6148172445e4ea9a9414b263157f Mon Sep 17 00:00:00 2001 From: Yasser Tahiri Date: Thu, 12 Jun 2025 22:54:30 +0100 Subject: [PATCH 1/2] feat(docs): add runbook for Node Disk IO Saturation alert --- .../PrometheusNodeDiskIOSaturation.md | 104 ++++++++++++++++++ 1 file changed, 104 insertions(+) create mode 100644 content/runbooks/prometheus/PrometheusNodeDiskIOSaturation.md diff --git a/content/runbooks/prometheus/PrometheusNodeDiskIOSaturation.md b/content/runbooks/prometheus/PrometheusNodeDiskIOSaturation.md new file mode 100644 index 0000000..5493a2d --- /dev/null +++ b/content/runbooks/prometheus/PrometheusNodeDiskIOSaturation.md @@ -0,0 +1,104 @@ +--- +title: Node Disk IO Saturation Alert +weight: 20 +--- + +# NodeDiskIOSaturation + +## Alert Details + +- **Alert Name**: NodeDiskIOSaturation +- **Severity**: Warning +- **Component**: Node Exporter +- **Namespace**: monitoring + +## Alert Description + +This alert fires when the disk IO queue (aqu-sq) is high on a specific device, indicating potential disk saturation. The alert triggers when the queue length has been above 10 for the last 30 minutes. + +## Alert Context + +The alert is generated by the node-exporter pod running in the monitoring namespace. It monitors the disk IO queue length for all block devices on the node. + +## Investigation Steps + +1. **Verify Alert Details** + - Check the specific device mentioned in the alert (e.g., sdc) + - Note the current queue length value + - Identify the affected node(s) + +2. **Check Node Resources** + + ```bash + # Get node status + kubectl describe node + + # Check node-exporter logs + kubectl logs -n monitoring -l app.kubernetes.io/name=prometheus-node-exporter + ``` + +3. **Investigate Disk Performance** + + ```bash + # SSH into the affected node + ssh + + # Check IO statistics + iostat -x 1 + + # Check disk queue length + cat /sys/block//queue/nr_requests + + # Check IO wait + top + ``` + +4. **Identify High IO Processes** + + ```bash + # List processes with high IO + iotop + + # Check IO statistics per process + pidstat -d 1 + ``` + +## Common Causes + +1. High disk I/O from applications +2. Insufficient disk performance for the workload +3. Disk hardware issues +4. Network storage performance issues +5. Resource contention from other workloads + +## Resolution Steps + +1. **Short-term Mitigation** + - Identify and stop non-critical high IO processes + - Consider moving workloads to other nodes + - Increase disk queue length if appropriate + +2. **Long-term Solutions** + - Upgrade disk hardware if consistently hitting limits + - Implement IO throttling for problematic workloads + - Consider using faster storage solutions + - Optimize application IO patterns + - Implement proper resource limits and requests + +3. **Preventive Measures** + - Monitor disk IO patterns + - Set up proper resource quotas + - Implement IO scheduling policies + - Regular performance testing + +## Related Alerts + +- NodeDiskSpaceFillingUp +- NodeDiskSpaceAlmostFull +- NodeDiskSpaceFull + +## References + +- [Prometheus Node Exporter Documentation](https://github.com/prometheus/node_exporter) +- [Kubernetes Node Resource Management](https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/) +- [Linux IO Scheduler Documentation](https://www.kernel.org/doc/html/latest/block/iosched.html) From 1f7cfea56edfb57ebd0a57251d76c40f66391118 Mon Sep 17 00:00:00 2001 From: Yasser Tahiri Date: Thu, 12 Jun 2025 22:57:10 +0100 Subject: [PATCH 2/2] :recycle: fix the directory of the document --- .../NodeDiskIOSaturation.md} | 0 1 file changed, 0 insertions(+), 0 deletions(-) rename content/runbooks/{prometheus/PrometheusNodeDiskIOSaturation.md => node/NodeDiskIOSaturation.md} (100%) diff --git a/content/runbooks/prometheus/PrometheusNodeDiskIOSaturation.md b/content/runbooks/node/NodeDiskIOSaturation.md similarity index 100% rename from content/runbooks/prometheus/PrometheusNodeDiskIOSaturation.md rename to content/runbooks/node/NodeDiskIOSaturation.md