Skip to content

Commit 528103b

Browse files
author
amsliu
committed
AB#7058: Troubleshoot OOMkilled in AKS clusters
1 parent aaab486 commit 528103b

File tree

3 files changed

+246
-2
lines changed

3 files changed

+246
-2
lines changed

support/azure/azure-kubernetes/availability-performance/identify-memory-saturation-aks.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@ This article discusses methods for troubleshooting memory saturation issues. Mem
1414
## Prerequisites
1515

1616
- The Kubernetes [kubectl](https://kubernetes.io/docs/reference/kubectl/overview/) command-line tool. To install kubectl by using [Azure CLI](/cli/azure/install-azure-cli), run the [az aks install-cli](/cli/azure/aks#az-aks-install-cli) command.
17-
- The open source project [Inspektor Gadget](../logs/capture-system-insights-from-aks.md#what-is-inspektor-gadget) for advanced process level memory analysis. For more information, see [How to install Inspektor Gadget in an AKS cluster](../logs/capture-system-insights-from-aks.md#how-to-install-inspektor-gadget-in-an-aks-cluster).
17+
- The open source project [Inspektor Gadget](../logs/capture-system-insights-from-aks.md#what-is-inspektor-gadget) for advanced process level memory analysis. For more information, see [How to install Inspektor Gadget in an AKS cluster](../logs/capture-system-insights-from-aks.md#how-to-install-inspektor-gadget-in-an-aks-cluster).
1818

1919
## Symptoms
2020

@@ -25,7 +25,7 @@ The following table outlines the common symptoms of memory saturation.
2525
| Unschedulable pods | Additional pods can't be scheduled if the node is close to its set memory limit. |
2626
| Pod eviction | If a node is running out of memory, the kubelet can evict pods. Although the control plane tries to reschedule the evicted pods on other nodes that have resources, there's no guarantee that other nodes have sufficient memory to run these pods. |
2727
| Node not ready | Memory saturation can cause `kubelet` and `containerd` to become unresponsive, eventually causing node readiness issues. |
28-
| Out-of-memory (OOM) kill | An OOM problem occurs if the pod eviction can't prevent a node issue. |
28+
| Out-of-memory (OOM) kill | An OOM problem occurs if the pod eviction can't prevent a node issue. For more information, see [Troubleshoot OOMkilled in AKS clusters](./troubleshoot-oomkilled-in-aks-clusters.md).|
2929

3030
## Troubleshooting checklist
3131

Lines changed: 242 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,242 @@
1+
---
2+
title: Troubleshoot OOMkilled in AKS clusters
3+
description: Troubleshoot and resolve out-of-memory (OOMkilled) issues in Azure Kubernetes Service (AKS) clusters.
4+
ms.date: 08/13/2025
5+
editor: v-jsitser
6+
ms.reviewer: v-liuamson
7+
ms.service: azure-kubernetes-service
8+
ms.custom: sap:Node/node pool availability and performance
9+
---
10+
# Troubleshooting OOMKilled in AKS clusters
11+
12+
## Understanding OOM Kills
13+
14+
When you run workloads in Azure Kubernetes Service (AKS), you might
15+
encounter out of memory errors that cause restarts on your system or
16+
application pods. This guide helps identify and resolve out-of-memory
17+
(OOMKilled) issues in AKS cluster nodes.
18+
19+
> [!IMPORTANT]
20+
> Use the [AKS Diagnose and Solve
21+
> Problems](/azure/aks/aks-diagnostics#open-aks-diagnose-and-solve-problems)
22+
> section in the Azure portal to help address with memory issues in your
23+
> cluster.
24+
25+
### OOMKilled and Evictions Explained
26+
27+
In Kubernetes environments like AKS, memory-related issues can result in
28+
two distinct types of events: **OOMKilled** and **Evictions**. While
29+
both are triggered by resource pressure, they differ in cause, scope,
30+
and behavior.
31+
32+
OOMKilled will only be reported for containers that have been terminated
33+
by the kernel OOM killer. It\'s important to note that it\'s the
34+
container that exceeds its memory limit that gets terminated, and by
35+
default restarted, as opposed to the whole pod. Evictions on the other
36+
hand happen at the pod level and are triggered by Kubernetes,
37+
specifically by the **Kubelet** running on every node, when the node is
38+
running low on memory. Pods that have been evicted will report a status
39+
of **Failed** and a reason of **Evicted**.
40+
41+
### OOMKilled: Container-Level Termination
42+
43+
**OOMKilled** occurs when a **container** exceeds its memory limit and
44+
is terminated by the **Linux kernel's Out-Of-Memory (OOM) killer**.
45+
46+
This is a container-specific event. Only the container that breaches its memory limit is affected. The pod may continue running if it contains other healthy containers. The terminated container is typically restarted automatically.
47+
48+
Common indicators include **exit code 137** and the reason **OOMKilled**
49+
in `kubectl describe pod`.
50+
51+
### Evictions: Pod-Level Removal by Kubelet
52+
53+
While this guide focuses on **OOMKilled**, it is useful to understand
54+
that **Evictions** are a separate mechanism in Kubernetes. They occur at
55+
the **pod level**, for instance when the [node is under memory
56+
pressure](./identify-memory-saturation-aks.md).
57+
58+
> [!NOTE]
59+
> This guide does not cover all probable causes of pod eviction, as its scope is
60+
> limited to memory-related OOMKilled events.
61+
62+
- The **Kubelet** may evict pods to free up memory and maintain node stability.
63+
64+
- Evicted pods will show a status of **Failed** and a reason of **Evicted**.
65+
66+
- Unlike OOMKilled, which targets individual containers, evictions affect the entire pod.
67+
68+
## Possible Causes of OOM Kills
69+
70+
OOMKilled events may occur due to several reasons. The following are the most common causes:
71+
72+
- **Resource Overcommitment**: Pod resource requests and limits are not set appropriately, leading to excessive memory usage.
73+
74+
- **Memory Leaks**: Applications may have memory leaks that cause them to consume more memory over time.
75+
76+
- **High Workload**: Sudden spikes in application load can lead to increased memory usage beyond allocated limits.
77+
78+
- **Insufficient Node Resources**: The node may not have enough memory to support the running pods, leading to OOM kills.
79+
80+
- **Inefficient Resource Management**: Lack of resource quotas and limits can lead to uncontrolled resource consumption.
81+
82+
## Identifying OOM Killed Pods
83+
84+
You can use one of these various methods to identify the POD which is killed due to memory pressure:
85+
86+
> [!NOTE]
87+
> OOMKilled can happen to both **system pods (those in the kube-system
88+
> namespace created by AKS)** and **user pods (pods in other namespaces)**. It is
89+
> essential to first identify which pods are affected before taking further action.
90+
> [!IMPORTANT]
91+
> Use the [AKS Diagnose and Solve
92+
> Problems](/azure/aks/aks-diagnostics#open-aks-diagnose-and-solve-problems)
93+
> section in the Azure portal to help address with memory issues in your cluster.
94+
95+
### Check Pod Status
96+
97+
Use the following command to check the status of all pods in a namespace:
98+
99+
- `kubectl get pods -n \<namespace\>`
100+
101+
Look for pods with statuses of OOMKilled.
102+
103+
### Describe the Pod
104+
105+
Use `kubectl describe pod \<pod-name\>` to get detailed information about the pod.
106+
107+
- `kubectl describe pod \<pod-name\> -n \<namespace\>`
108+
109+
In the output, check the Container Statuses section for indications of OOM kills.
110+
111+
### Pod Logs
112+
113+
Review pod logs using `kubectl logs \<pod-name\>` to identify memory-related issues.
114+
115+
To view the logs of the pod, use:
116+
117+
- `kubectl logs \<pod-name\> -n \<namespace\>`
118+
119+
If the pod has restarted, check the previous logs:
120+
121+
- `kubectl logs \<pod-name\> -n \<namespace\> \--previous`
122+
123+
### Node Logs
124+
125+
You can [review the kubelet logs](/azure/aks/kubelet-logs) on the node to see if there are messages indicating that the OOM killer was triggered at the time of the issue and that pod's memory usage reached its limit.
126+
127+
Alternatively, you can [SSH into the node](/azure/aks/node-access) where the pod was running and check the kernel logs for any OOM messages. This command will display which processes the OOM killer terminated:
128+
129+
`chroot /host \# access the node session`
130+
131+
`grep -i \"Memory cgroup out of memory\" /var/log/syslog`
132+
133+
### Events
134+
135+
- Use `kubectl get events \--sort-by=.lastTimestamp -n \<namespace\>` to find OOMKilled pods.
136+
137+
- Use the events section from the pod description to look for OOM-related messages:
138+
139+
- `kubectl describe pod \<pod-name\> -n \<namespace\>`
140+
141+
## Handling OOMKilled for system pods
142+
143+
> [!NOTE]
144+
> System pods refer to those located in the kube-system namespace and created by AKS.
145+
146+
### metrics-server
147+
148+
**Issue:**
149+
150+
- OOMKilled due to insufficient resources.
151+
152+
**Solution:**
153+
154+
- [Configure Metrics Server VPA](/azure/aks/use-metrics-server-vertical-pod-autoscaler)
155+
can be used to allocate additional resources to the Metrics Server.
156+
157+
### CoreDNS
158+
159+
**Issue:**
160+
161+
- OOMKilled due to traffic spikes.
162+
163+
**Solution:**
164+
165+
- [Customize CoreDNS scaling](/azure/aks/coredns-custom) to properly configure its auto-scaling based on the workload requirements.
166+
167+
### Other pods in the kube-system namespace
168+
169+
Make sure that the cluster includes a [system node pool](/azure/aks/use-system-pools?tabs=azure-cli) and a user node pool to isolate memory-heavy user workloads from system
170+
pods. You should also confirm that the system node pool has at least three nodes.
171+
172+
## Handling OOMKilled for User pods
173+
174+
User pods may be OOMKilled due to insufficient memory limits or excessive memory consumption. Solutions include setting appropriate resource requests and limits, and engaging application vendors to investigate memory usage.
175+
176+
### Cause 1: User workloads may be running in a system node pools
177+
178+
It is recommended to create user node pools for user workloads. For more information, see: [Manage system node pools in Azure Kubernetes Service (AKS)](https://learn.microsoft.com/en-us/azure/aks/use-system-pools).
179+
180+
### Cause 2: Application pod keeps restarting due to OOMkilled
181+
182+
This behavior might be due to the pod not having enough memory assigned
183+
to it and it requires more, which will cause the pod to constantly
184+
restart.
185+
186+
To solve, review request and limits documentation to understand how to modify
187+
your deployment accordingly. For more information, see [Resource Management for Pods and Containers](https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/%22%20/l%20%22requests-and-limits).
188+
189+
`kubectl set resources deployment \<deployment-name\>
190+
\--limits=memory=\<LIMITS\>Mi ---requests=memory=\<MEMORY\>Mi`
191+
192+
Setting resource requests and limits to the recommended amount for the
193+
application pod.
194+
195+
To diagnose, see [Azure Kubernetes Service (AKS) Diagnose and Solve Problems
196+
overview](/azure/aks/aks-diagnostics).
197+
198+
### Cause 3: Application running in pod is consuming excessive memory
199+
200+
Confirm the Memory Pressure at the pod level:
201+
202+
Use kubectl top to check memory usage:
203+
204+
`kubectl top pod \<pod-name\> -n \<namespace\>`
205+
206+
If metrics are unavailable, you can inspect cgroup stats directly:
207+
208+
`kubectl exec -it \<pod-name\> -n \<namespace\> \-- cat
209+
/sys/fs/cgroup/memory.current`
210+
211+
Or you can use this to see the value in MB:
212+
213+
`kubectl exec -it \<pod-name\> -n \<namespace\> \-- cat
214+
/sys/fs/cgroup/memory.current \| awk \'{print \$1/1024/1024 \" MB\"}\'`
215+
216+
This helps to confirm whether the pod is approaching or exceeding its
217+
memory limits.
218+
219+
- Check for OOMKilled Events
220+
221+
`kubectl get events \--sort-by=\'.lastTimestamp\' -n \<namespace\>`
222+
223+
To resolve, engage the application vendor. If the app is from a third party, check
224+
if they have known issues or memory tuning guides. Also, depending on the application framework, ask the vendor to verify whether they are using the latest version of Java or .Net as recommended in [Memory saturation occurs in pods after cluster upgrade to Kubernetes 1.25](../create-upgrade-delete/aks-memory-saturation-after-upgrade.md).
225+
226+
The application vendor or team can investigate the application to determine why it is using so much memory, such as checking for a memory leak or assessing if the app needs higher memory resource limits in the pod configuration. In the meantime, to temporarily mitigate the problem increase the memory limit for the containers experiencing OOMKill events.
227+
228+
## Avoiding OOMKill in the future
229+
230+
### Validating and Setting resource limits
231+
232+
Review and set appropriate [resource requests and limits](/azure/aks/developer-best-practices-resource-management#define-pod-resource-requests-and-limits)
233+
for all application pods. Use `kubectl top` and `cgroup` stats to validate memory usage.
234+
235+
### Setting up system and user node pools
236+
237+
Make sure the cluster includes a [system node pool](/azure/aks/use-system-pools?tabs=azure-cli) and a user node pool to isolate memory-heavy user workloads from system
238+
pods. You should also confirm that the system node pool has at least three nodes.
239+
240+
### Application assessment
241+
242+
To ensure optimal performance and avoid memory starvation within your AKS cluster, we recommend reviewing the resource usage patterns of your application. Specifically, assess whether the application is requesting appropriate memory limits and whether its behavior under load aligns with the allocated resources. This evaluation will help identify if adjustments are needed to prevent pod evictions or OOMKilled events.

support/azure/azure-kubernetes/toc.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -176,6 +176,8 @@
176176
href: availability-performance/cluster-service-health-probe-mode-issues.md
177177
- name: Troubleshoot pod scheduler errors
178178
href: availability-performance/troubleshoot-pod-scheduler-errors.md
179+
- name: Troubleshoot OOMkilled in AKS clusters
180+
href: availability-performance/troubleshoot-oomkilled-in-aks-clusters.md
179181
- name: Troubleshoot node not ready
180182
items:
181183
- name: Basic troubleshooting

0 commit comments

Comments
 (0)