Skip to content

Commit 05a233a

Browse files
committed
Added Joses suggestions
Signed-off-by: Burak Ok <[email protected]>
1 parent 598e12f commit 05a233a

File tree

1 file changed

+13
-18
lines changed

1 file changed

+13
-18
lines changed

support/azure/azure-kubernetes/availability-performance/identify-high-disk-io-latency-containers-aks.md

Lines changed: 13 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -3,13 +3,13 @@ title: Identify containers causing high disk I/O latency in AKS clusters
33
description: Learn how to identify which containers and pods are causing high disk I/O latency in your Azure Kubernetes Service clusters to easily troubleshoot issues using the open source project Inspektor Gadget.
44
ms.date: 07/16/2025
55
ms.author: burakok
6-
ms.reviewer: burakok, mayasingh
6+
ms.reviewer: burakok, mayasingh, blanquicet
77
ms.service: azure-kubernetes-service
88
ms.custom: sap:Node/node pool availability and performance
99
---
1010
# Troubleshoot high disk I/O latency in AKS clusters
1111

12-
Disk I/O latency can severely impact the performance and reliability of workloads running in AKS clusters.This article shows how to use the open source project [Inspektor Gadget](https://inspektor-gadget.io/) to identify which containers and pods are causing high disk I/O latency in Azure Kubernetes Service (AKS).
12+
Disk I/O latency can severely impact the performance and reliability of workloads running in Azure Kubernetes Service (AKS) clusters. This article shows how to use the open source project [Inspektor Gadget](https://aka.ms/ig-website) to identify which containers and pods are causing high disk I/O latency in AKS.
1313

1414
Inspektor Gadget provides eBPF-based gadgets that help you observe and troubleshoot disk I/O issues in Kubernetes environments.
1515

@@ -35,7 +35,7 @@ You may suspect disk I/O latency issues when you observe the following behaviors
3535

3636
### Step 1: Profile disk I/O latency with `profile_blockio`
3737

38-
The `profile_blockio` gadget gathers information about block device I/O usage and generates a histogram distribution of I/O latency when the gadget is stopped. This helps you visualize disk I/O performance and identify latency patterns. We can use this information to gather evidence to support or refute the hypothesis that the symptoms we are seeing are due to disk I/O issues.
38+
The [`profile_blockio`](https://aka.ms/ig-profile-blockio) gadget gathers information about block device I/O usage and generates a histogram distribution of I/O latency when the gadget is stopped. This helps you visualize disk I/O performance and identify latency patterns. We can use this information to gather evidence to support or refute the hypothesis that the symptoms we are seeing are due to disk I/O issues.
3939

4040
```console
4141
kubectl gadget run profile_blockio --node <node-name>
@@ -78,7 +78,7 @@ latency
7878
33554432 -> 67108864 : 0 | |
7979
```
8080

81-
**High disk I/O stress example** (with stress-ng --hdd 10 --io 10 running):
81+
**High disk I/O stress example** (with `stress-ng --hdd 10 --io 10` running to simulate I/O load):
8282

8383
```
8484
latency
@@ -112,18 +112,18 @@ latency
112112
33554432 -> 67108864 : 0 | |
113113
```
114114

115-
**Interpreting the results**: Compare the baseline vs. stress scenarios:
115+
**Interpreting the results**: To identify which node has I/O pressure you can compare the baseline vs. stress scenarios:
116116
- **Baseline**: Most operations (4,211 count) in the 16-32ms range, typical for normal system activity
117117
- **Under stress**: Significantly more operations in higher latency ranges (9,552 operations in 131-262ms, 6,778 in 262-524ms)
118118
- **Performance degradation**: The stress test shows operations extending into the 500ms-2s range, indicating disk saturation
119119
- **Concerning signs**: Look for high counts above 100ms (100,000µs) which may indicate disk performance issues
120120

121121
### Step 2: Find top disk I/O consumers with `top_blockio`
122122

123-
The `top_blockio` gadget provides a periodic list of pods and containers with the highest disk I/O operations. This gadget requires kernel version 6.5 or higher (available on Azure Linux 3).
123+
The [`top_blockio`](https://aka.ms/ig-top-blockio) gadget provides a periodic list of containers with the highest disk I/O operations. Optionally we can limit the tracing to the node we identified in Step 1. This gadget requires kernel version 6.5 or higher (available on [Azure Linux Container Host clusters](/azure/aks/use-azure-linux)).
124124

125125
```console
126-
kubectl gadget run top_blockio --namespace <namespace>
126+
kubectl gadget run top_blockio --namespace <namespace> [--node <node-name>]
127127
```
128128

129129
Sample output:
@@ -134,14 +134,14 @@ aks-nodepool1-…99-vmss000000
134134
aks-nodepool1-…99-vmss000000 0 0 8 0 24576 1549 6 read
135135
```
136136

137-
Identify containers with unusually high BYTES, US (time spent in microseconds), or IO counts which may indicate high disk activity. In this example, we can see significant write activity (173MB) with considerable time spent (~154 seconds total).
137+
From the output, we can identify containers with unusually high number of bytes read/written into the disk (`BYTES` column), time spent on reading/writing operations (`US` column), or number of IO operations (`IO` column) which may indicate high disk activity. In this example, we can see significant write activity (173MB) with considerable time spent (~154 seconds total).
138138

139139
> [!NOTE]
140140
> Empty K8S.NAMESPACE, K8S.PODNAME, and K8S.CONTAINERNAME fields can occur during kernel space initiated operations or high-volume I/O. You can still use the `top_file` gadget for detailed process information when these fields are empty.
141141
142142
### Step 3: Identify files causing high disk activity with `top_file`
143143

144-
The `top_file` gadget reports periodically the read/write activity by file, helping you identify specific files that are causing high disk activity.
144+
The [`top_file`](https://aka.ms/ig-top-file) gadget reports periodically the read/write activity by file, helping you identify specific processes in which containers are causing high disk activity.
145145

146146
```console
147147
kubectl gadget run top_file --namespace <namespace>
@@ -158,14 +158,14 @@ aks-nodepool1-…99-vmss000000 default stress-hdd stress-hdd
158158
...
159159
```
160160

161-
This output shows which files are being accessed most frequently, helping you pinpoint specific files contributing to disk latency. In this example, the stress-hdd pod is creating multiple temporary files with significant write activity (18-23MB each).
161+
This output shows which files are being accessed most frequently, helping you pinpoint what specific file a given process is reading/writing the most. In this example, the stress-hdd pod is creating multiple temporary files with significant write activity (18-23MB each)
162162

163163
### Root cause analysis workflow
164164

165165
By combining all three gadgets, you can trace disk latency issues from symptoms to root cause:
166166

167-
1. **`profile_blockio`** identifies that disk latency exists (high counts in 100ms+ ranges)
168-
2. **`top_blockio`** shows which processes are consuming the most disk I/O (173MB writes with 154 seconds total time spent)
167+
1. **`profile_blockio`** identifies that disk latency exists in a given node (high counts in 100ms+ ranges)
168+
2. **`top_blockio`** shows which processes are generating the most disk I/O (173MB writes with 154 seconds total time spent)
169169
3. **`top_file`** reveals the specific files and commands causing the issue (stress command creating /stress.* files)
170170

171171
This complete visibility allows you to:
@@ -180,15 +180,10 @@ With this information, you can take targeted action rather than making broad sys
180180

181181
Based on the results from these gadgets, you can take the following actions:
182182

183-
- **High latency in `profile_blockio`**: Investigate the underlying disk performance, consider using premium SSD or Ultra disk storage
183+
- **High latency in `profile_blockio`**: Investigate the underlying disk performance and if the workload needs better disk performance, consider using [storage optimized nodes](/azure/virtual-machines/sizes/overview#storage-optimized)
184184
- **High I/O operations in `top_blockio`**: Review application logic to optimize disk access patterns or implement caching
185185
- **Specific files in `top_file`**: Analyze if files can be moved to faster storage, cached, or if application logic can be optimized
186186

187-
For further troubleshooting:
188-
- Move disk-intensive workloads to dedicated node pools with faster storage
189-
- Implement application-level caching to reduce disk I/O
190-
- Consider using Azure managed services (like Azure Database) for data-intensive operations
191-
192187
## Related content
193188

194189
- [Inspektor Gadget documentation](https://inspektor-gadget.io/docs/latest/gadgets/)

0 commit comments

Comments
 (0)