Skip to content

Commit e3ccef8

Browse files
committed
Provides documentation to handle out of memory pods
1 parent c1b9ec6 commit e3ccef8

File tree

2 files changed

+65
-0
lines changed

2 files changed

+65
-0
lines changed

articles/operator-nexus/TOC.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -284,6 +284,8 @@
284284
href: troubleshoot-control-plane-quorum.md
285285
- name: Troubleshoot Accepted Cluster Resource
286286
href: troubleshoot-accepted-cluster-hydration.md
287+
- name: Troubleshoot Out of Memory Pods
288+
href: troubleshoot-memory-limits.md
287289
- name: BareMetal Actions
288290
expanded: false
289291
items:
Lines changed: 63 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,63 @@
1+
---
2+
title: Troubleshoot container memory limits
3+
description: Troubleshooting Kubernetes container limits
4+
ms.service: azure-operator-nexus
5+
ms.custom: troubleshooting
6+
ms.topic: troubleshooting
7+
ms.date: 11/01/2024
8+
ms.author: matthewernst
9+
author: matternst7258
10+
---
11+
12+
# Troubleshoot container memory limits
13+
14+
## Alerting for memory limits
15+
16+
It is recommended to have alerts setup for the Operator Nexus cluster to look for Kubernetes pods restarting from OOMKill errors. These alerts will allow customers to know if a component on a server is working appropriately.
17+
18+
## Identifying Out of Memory (OOM) pods
19+
20+
Start by identifying any components that are restarting or show OOMKill
21+
22+
```azcli
23+
az networkcloud baremetalmachine run-read-command --name "<bareMetalMachineName>" \
24+
--limit-time-seconds 60 \
25+
--commands "[{command:'kubectl get',arguments:[pods,-n,nc-system]}]" \
26+
--resource-group "<cluster_MRG>" \
27+
--subscription "<subscription>"
28+
```
29+
30+
Once identified, a `describe pod` command can determine the status and restart count.
31+
32+
```azcli
33+
az networkcloud baremetalmachine run-read-command --name "<bareMetalMachineName>" \
34+
--limit-time-seconds 60 \
35+
--commands "[{command:'kubectl describe',arguments:[pod,<podName>,-n,nc-system]}]" \
36+
--resource-group "<cluster_MRG>" \
37+
--subscription "<subscription>"
38+
```
39+
40+
At the same time, a `get events` command can provide history to see the frequency of pod restarts.
41+
42+
```azcli
43+
az networkcloud baremetalmachine run-read-command --name "<bareMetalMachineName>" \
44+
--limit-time-seconds 60 \
45+
--commands "[{command:'kubectl get',arguments:[events,-n,nc-system,|,grep,<podName>]}]" \
46+
--resource-group "<cluster_MRG>" \
47+
--subscription "<subscription>"
48+
```
49+
50+
The data from these commands identify whether a pod is restarting due to `OOMKill`.
51+
52+
## Patching memory limits
53+
54+
It is recommended for all memory limit changes be reported to Microsoft support for further investigation or adjustments.
55+
56+
> [!WARNING]
57+
> Patching memory limits to a pod are not permanent and can be overwritten if the pod restarts.
58+
59+
## Known services susceptible to OOM issues
60+
61+
* cdi-operator
62+
* vulnerability-operator
63+
* cluster-metadata-operator

0 commit comments

Comments
 (0)