prometheus-operator · yezz123 · Jun 12, 2025 · Jun 12, 2025
diff --git a/content/runbooks/node/NodeDiskIOSaturation.md b/content/runbooks/node/NodeDiskIOSaturation.md
@@ -0,0 +1,104 @@
+---
+title: Node Disk IO Saturation Alert
+weight: 20
+---
+
+# NodeDiskIOSaturation
+
+## Alert Details
+
+- **Alert Name**: NodeDiskIOSaturation
+- **Severity**: Warning
+- **Component**: Node Exporter
+- **Namespace**: monitoring
+
+## Alert Description
+
+This alert fires when the disk IO queue (aqu-sq) is high on a specific device, indicating potential disk saturation. The alert triggers when the queue length has been above 10 for the last 30 minutes.
+
+## Alert Context
+
+The alert is generated by the node-exporter pod running in the monitoring namespace. It monitors the disk IO queue length for all block devices on the node.
+
+## Investigation Steps
+
+1. **Verify Alert Details**
+   - Check the specific device mentioned in the alert (e.g., sdc)
+   - Note the current queue length value
+   - Identify the affected node(s)
+
+2. **Check Node Resources**
+
+   ```bash
+   # Get node status
+   kubectl describe node <node-name>
+
+   # Check node-exporter logs
+   kubectl logs -n monitoring -l app.kubernetes.io/name=prometheus-node-exporter
+   ```
+
+3. **Investigate Disk Performance**
+
+   ```bash
+   # SSH into the affected node
+   ssh <node-ip>
+
+   # Check IO statistics
+   iostat -x 1
+
+   # Check disk queue length
+   cat /sys/block/<device>/queue/nr_requests
+
+   # Check IO wait
+   top
+   ```
+
+4. **Identify High IO Processes**
+
+   ```bash
+   # List processes with high IO
+   iotop
+
+   # Check IO statistics per process
+   pidstat -d 1
+   ```
+
+## Common Causes
+
+1. High disk I/O from applications
+2. Insufficient disk performance for the workload
+3. Disk hardware issues
+4. Network storage performance issues
+5. Resource contention from other workloads
+
+## Resolution Steps
+
+1. **Short-term Mitigation**
+   - Identify and stop non-critical high IO processes
+   - Consider moving workloads to other nodes
+   - Increase disk queue length if appropriate
+
+2. **Long-term Solutions**
+   - Upgrade disk hardware if consistently hitting limits
+   - Implement IO throttling for problematic workloads
+   - Consider using faster storage solutions
+   - Optimize application IO patterns
+   - Implement proper resource limits and requests
+
+3. **Preventive Measures**
+   - Monitor disk IO patterns
+   - Set up proper resource quotas
+   - Implement IO scheduling policies
+   - Regular performance testing
+
+## Related Alerts
+
+- NodeDiskSpaceFillingUp
+- NodeDiskSpaceAlmostFull
+- NodeDiskSpaceFull
+
+## References
+
+- [Prometheus Node Exporter Documentation](https://github.com/prometheus/node_exporter)
+- [Kubernetes Node Resource Management](https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/)
+- [Linux IO Scheduler Documentation](https://www.kernel.org/doc/html/latest/block/iosched.html)