Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
104 changes: 104 additions & 0 deletions content/runbooks/node/NodeDiskIOSaturation.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,104 @@
---
title: Node Disk IO Saturation Alert
weight: 20
---

# NodeDiskIOSaturation

## Alert Details

- **Alert Name**: NodeDiskIOSaturation
- **Severity**: Warning
- **Component**: Node Exporter
- **Namespace**: monitoring

## Alert Description

This alert fires when the disk IO queue (aqu-sq) is high on a specific device, indicating potential disk saturation. The alert triggers when the queue length has been above 10 for the last 30 minutes.

## Alert Context

The alert is generated by the node-exporter pod running in the monitoring namespace. It monitors the disk IO queue length for all block devices on the node.

## Investigation Steps

1. **Verify Alert Details**
- Check the specific device mentioned in the alert (e.g., sdc)
- Note the current queue length value
- Identify the affected node(s)

2. **Check Node Resources**

```bash
# Get node status
kubectl describe node <node-name>

# Check node-exporter logs
kubectl logs -n monitoring -l app.kubernetes.io/name=prometheus-node-exporter
```

3. **Investigate Disk Performance**

```bash
# SSH into the affected node
ssh <node-ip>

# Check IO statistics
iostat -x 1

# Check disk queue length
cat /sys/block/<device>/queue/nr_requests

# Check IO wait
top
```

4. **Identify High IO Processes**

```bash
# List processes with high IO
iotop

# Check IO statistics per process
pidstat -d 1
```

## Common Causes

1. High disk I/O from applications
2. Insufficient disk performance for the workload
3. Disk hardware issues
4. Network storage performance issues
5. Resource contention from other workloads

## Resolution Steps

1. **Short-term Mitigation**
- Identify and stop non-critical high IO processes
- Consider moving workloads to other nodes
- Increase disk queue length if appropriate

2. **Long-term Solutions**
- Upgrade disk hardware if consistently hitting limits
- Implement IO throttling for problematic workloads
- Consider using faster storage solutions
- Optimize application IO patterns
- Implement proper resource limits and requests

3. **Preventive Measures**
- Monitor disk IO patterns
- Set up proper resource quotas
- Implement IO scheduling policies
- Regular performance testing

## Related Alerts

- NodeDiskSpaceFillingUp
- NodeDiskSpaceAlmostFull
- NodeDiskSpaceFull

## References

- [Prometheus Node Exporter Documentation](https://github.com/prometheus/node_exporter)
- [Kubernetes Node Resource Management](https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/)
- [Linux IO Scheduler Documentation](https://www.kernel.org/doc/html/latest/block/iosched.html)