Skip to content

Commit 38f9f9d

Browse files
committed
tech review updates
1 parent f81c150 commit 38f9f9d

File tree

2 files changed

+25
-27
lines changed

2 files changed

+25
-27
lines changed

support/azure/azure-kubernetes/availability-performance/identify-memory-saturation-aks.md

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,9 @@
11
---
22
title: Troubleshoot Memory Saturation in AKS Clusters
33
description: Troubleshoot memory saturation in Azure Kubernetes Service (AKS) clusters across namespaces and containers. Learn how to identify the hosting node.
4-
ms.date: 06/27/2025
4+
ms.date: 08/18/2025
55
editor: v-jsitser
6-
ms.reviewer: chiragpa, aritraghosh, v-leedennis
6+
ms.reviewer: chiragpa, aritraghosh, v-leedennis, v-liuamson
77
ms.service: azure-kubernetes-service
88
ms.custom: sap:Node/node pool availability and performance
99
---
@@ -61,9 +61,9 @@ Container Insights is a feature within AKS that monitors container workload perf
6161
1. Because the first node has the highest memory usage, select that node to investigate the memory usage of the pods that are running on the node.
6262

6363
:::image type="complex" source="./media/identify-memory-saturation-aks/containers-containerinsights-memorypressure.png" alt-text="Azure portal screenshot of a node's containers under the Nodes view in Container Insights within an Azure Kubernetes Service (AKS) cluster." lightbox="./media/identify-memory-saturation-aks/containers-containerinsights-memorypressure.png":::
64-
64+
6565
The Azure portal screenshot shows a table of nodes. The first node is expanded to display an **Other processes** heading and a sublist of processes that are running within the first node. As for the nodes themselves, the table column values for the processes include **Name**, **Status**, **Max %** (the percentage of memory capacity that's used), **Max** (memory usage), **Containers**, **UpTime**, **Controller**, and **Trend Max % (1 bar = 15m)**. The processes also have an expand/collapse arrow icon next to their names.
66-
66+
6767
Nine processes are listed under the node. The statuses are all **Ok**, the maximum percentage of memory used for the processes ranges from 16 to 0.3 percent, the maximum memory used is from 0.7 mc to 22 mc, the number of containers used is 1 to 3, and the uptime is 3 to 4 days. Unlike for the node, the processes all have a corresponding controller listed. In this screenshot, the controller names are prefixes of the process names, and they're hyperlinked.
6868
:::image-end:::
6969

@@ -91,7 +91,7 @@ This procedure uses the kubectl commands in a console. It displays only the curr
9191
aks-testmemory-30616462-vmss000002 74m 3% 1715Mi 31%
9292
```
9393

94-
2. Get the list of pods that are running on the node and their memory usage by running the [kubectl get pods](https://kubernetes.io/docs/reference/generated/kubectl/kubectl-commands#get) and [kubectl top pods](https://kubernetes.io/docs/reference/generated/kubectl/kubectl-commands#-em-pod-em-) commands:
94+
2. Get the list of pods that are running on the node and their memory usage by running the [kubectl get pods](https://kubernetes.io/docs/reference/generated/kubectl/kubectl-commands#get) and [kubectl top pods](https://kubernetes.io/docs/reference/generated/kubectl/kubectl-commands#-em-pod-em-) commands:
9595

9696
```bash
9797
kubectl get pods --all-namespaces --output wide \
@@ -205,7 +205,7 @@ For advanced process level memory analysis, use [Inspektor Gadget](https://go.mi
205205
aks-agentpool-3…901-vmss000001 default memory-stress 21677 stress 944 MB 872 MB 5.2
206206
aks-agentpool-3…901-vmss000001 default memory-stress 21679 stress 944 MB 796 MB 4.8
207207
208-
```
208+
```
209209

210210
You can use this output to identify the processes that are consuming the most memory on the node. The output can include the node name, namespace, pod name, container name, process ID (PID), command name (COMM), CPU, and memory usage. For more details, see [the documentation](https://aka.ms/igtopprocess).
211211

@@ -215,7 +215,7 @@ Review the following table to learn how to implement best practices for avoiding
215215

216216
| Best practice | Description |
217217
|---|---|
218-
| Use memory [requests and limits](https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/#requests-and-limits) | Kubernetes provides options to specify the minimum memory size (*request*) and the maximum memory size (*limit*) for a container. By configuring limits on pods, you can avoid memory pressure on the node. Make sure that the aggregate limits for all pods that are running doesn't exceed the node's available memory. This situation is called *overcommitting*. The Kubernetes scheduler allocates resources based on set requests and limits through [Quality of Service](https://kubernetes.io/docs/tasks/configure-pod-container/quality-service-pod/) (QoS). Without appropriate limits, the scheduler might schedule too many pods on a single node. This situation might eventually bring down the node. Additionally, while the kubelet is evicting pods, it prioritizes pods in which the memory usage exceeds their defined requests. We recommend that you set the memory request close to the actual usage. |
218+
| Use memory [requests and limits](https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/#requests-and-limits) | Kubernetes provides options to specify the minimum memory size (_request_) and the maximum memory size (_limit_) for a container. By configuring limits on pods, you can avoid memory pressure on the node. Make sure that the aggregate limits for all pods that are running doesn't exceed the node's available memory. This situation is called _overcommitting_. The Kubernetes scheduler allocates resources based on set requests and limits through [Quality of Service](https://kubernetes.io/docs/tasks/configure-pod-container/quality-service-pod/) (QoS). Without appropriate limits, the scheduler might schedule too many pods on a single node. This situation might eventually bring down the node. Additionally, while the kubelet is evicting pods, it prioritizes pods in which the memory usage exceeds their defined requests. We recommend that you set the memory request close to the actual usage. |
219219
| Enable the [horizontal pod autoscaler](/azure/aks/tutorial-kubernetes-scale?tabs=azure-cli#autoscale-pods) | By scaling the cluster, you can balance the requests across many pods to prevent memory saturation. This technique can reduce the memory footprint on the specific node. |
220220
| Use [anti-affinity tags](https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#inter-pod-affinity-and-anti-affinity) | For scenarios in which memory is unbounded by design, you can use node selectors and affinity or anti-affinity tags, which can isolate the workload to specific nodes. By using anti-affinity tags, you can prevent other workloads from scheduling pods on these nodes and reduce the memory saturation problem. |
221221
| Choose [higher SKU VMs](https://azure.microsoft.com/pricing/details/virtual-machines/linux/) | VMs that have more random-access memory (RAM) are better suited to handle high memory usage. To use this option, you must create a new node pool, cordon the nodes (make them unschedulable), and drain the existing node pool. |

support/azure/azure-kubernetes/availability-performance/troubleshoot-oomkilled-aks-clusters.md

Lines changed: 18 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -1,13 +1,13 @@
11
---
22
title: Troubleshoot OOMkilled in AKS clusters
33
description: Troubleshoot and resolve out-of-memory (OOMkilled) issues in Azure Kubernetes Service (AKS) clusters.
4-
ms.date: 08/13/2025
4+
ms.date: 08/18/2025
55
editor: v-jsitser
66
ms.reviewer: v-liuamson
77
ms.service: azure-kubernetes-service
88
ms.custom: sap:Node/node pool availability and performance
99
---
10-
# Troubleshooting OOMKilled in AKS clusters
10+
# Troubleshooting OOMKilled in AKS clusters
1111

1212
## Understanding OOM Kills
1313

@@ -96,47 +96,47 @@ You can use one of these various methods to identify the POD which is killed due
9696

9797
Use the following command to check the status of all pods in a namespace:
9898

99-
- `kubectl get pods -n \<namespace\>`
99+
- `kubectl get pods -n <namespace>`
100100

101101
Look for pods with statuses of OOMKilled.
102102

103103
### Describe the Pod
104104

105-
Use `kubectl describe pod \<pod-name\>` to get detailed information about the pod.
105+
Use `kubectl describe pod <pod-name>` to get detailed information about the pod.
106106

107-
- `kubectl describe pod \<pod-name\> -n \<namespace\>`
107+
- `kubectl describe pod <pod-name> -n <namespace>`
108108

109109
In the output, check the Container Statuses section for indications of OOM kills.
110110

111111
### Pod Logs
112112

113-
Review pod logs using `kubectl logs \<pod-name\>` to identify memory-related issues.
113+
Review pod logs using `kubectl logs <pod-name>` to identify memory-related issues.
114114

115115
To view the logs of the pod, use:
116116

117-
- `kubectl logs \<pod-name\> -n \<namespace\>`
117+
- `kubectl logs <pod-name> -n <namespace>`
118118

119119
If the pod has restarted, check the previous logs:
120120

121-
- `kubectl logs \<pod-name\> -n \<namespace\> \--previous`
121+
- `kubectl logs <pod-name> -n <namespace> --previous`
122122

123123
### Node Logs
124124

125125
You can [review the kubelet logs](/azure/aks/kubelet-logs) on the node to see if there are messages indicating that the OOM killer was triggered at the time of the issue and that pod's memory usage reached its limit.
126126

127127
Alternatively, you can [SSH into the node](/azure/aks/node-access) where the pod was running and check the kernel logs for any OOM messages. This command will display which processes the OOM killer terminated:
128128

129-
`chroot /host \# access the node session`
129+
`chroot /host # access the node session`
130130

131-
`grep -i \"Memory cgroup out of memory\" /var/log/syslog`
131+
`grep -i "Memory cgroup out of memory" /var/log/syslog`
132132

133133
### Events
134134

135-
- Use `kubectl get events \--sort-by=.lastTimestamp -n \<namespace\>` to find OOMKilled pods.
135+
- Use `kubectl get events --sort-by=.lastTimestamp -n <namespace>` to find OOMKilled pods.
136136

137137
- Use the events section from the pod description to look for OOM-related messages:
138138

139-
- `kubectl describe pod \<pod-name\> -n \<namespace\>`
139+
- `kubectl describe pod <pod-name> -n <namespace>`
140140

141141
## Handling OOMKilled for system pods
142142

@@ -186,8 +186,8 @@ restart.
186186
To solve, review request and limits documentation to understand how to modify
187187
your deployment accordingly. For more information, see [Resource Management for Pods and Containers](https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/#requests-and-limits).
188188

189-
`kubectl set resources deployment \<deployment-name\>
190-
\--limits=memory=\<LIMITS\>Mi ---requests=memory=\<MEMORY\>Mi`
189+
`kubectl set resources deployment <deployment-name>
190+
--limits=memory=<LIMITS>Mi ---requests=memory=<MEMORY>Mi`
191191

192192
Setting resource requests and limits to the recommended amount for the
193193
application pod.
@@ -201,24 +201,22 @@ Confirm the Memory Pressure at the pod level:
201201

202202
Use kubectl top to check memory usage:
203203

204-
`kubectl top pod \<pod-name\> -n \<namespace\>`
204+
`kubectl top pod <pod-name> -n <namespace>`
205205

206206
If metrics are unavailable, you can inspect cgroup stats directly:
207207

208-
`kubectl exec -it \<pod-name\> -n \<namespace\> \-- cat
209-
/sys/fs/cgroup/memory.current`
208+
`kubectl exec -it <pod-name> -n <namespace> -- cat /sys/fs/cgroup/memory.current`
210209

211210
Or you can use this to see the value in MB:
212211

213-
`kubectl exec -it \<pod-name\> -n \<namespace\> \-- cat
214-
/sys/fs/cgroup/memory.current \| awk \'{print \$1/1024/1024 \" MB\"}\'`
212+
`kubectl exec -it <pod-name> -n <namespace> -- cat /sys/fs/cgroup/memory.current | awk '{print $1/1024/1024 " MB"}'`
215213

216214
This helps to confirm whether the pod is approaching or exceeding its
217215
memory limits.
218216

219217
- Check for OOMKilled Events
220218

221-
`kubectl get events \--sort-by=\'.lastTimestamp\' -n \<namespace\>`
219+
`kubectl get events --sort-by='.lastTimestamp' -n <namespace>`
222220

223221
To resolve, engage the application vendor. If the app is from a third party, check
224222
if they have known issues or memory tuning guides. Also, depending on the application framework, ask the vendor to verify whether they are using the latest version of Java or .Net as recommended in [Memory saturation occurs in pods after cluster upgrade to Kubernetes 1.25](../create-upgrade-delete/aks-memory-saturation-after-upgrade.md).

0 commit comments

Comments
 (0)