Skip to content

Commit 6ef24b5

Browse files
Merge pull request #289671 from matternst7258/matternst7258/memory-limits
[operator-nexus] Identify memory limits for container pods
2 parents 4f010f2 + c7c65ff commit 6ef24b5

File tree

2 files changed

+97
-0
lines changed

2 files changed

+97
-0
lines changed

articles/operator-nexus/TOC.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -284,6 +284,8 @@
284284
href: troubleshoot-control-plane-quorum.md
285285
- name: Troubleshoot Accepted Cluster Resource
286286
href: troubleshoot-accepted-cluster-hydration.md
287+
- name: Troubleshoot Out of Memory Pods
288+
href: troubleshoot-memory-limits.md
287289
- name: BareMetal Actions
288290
expanded: false
289291
items:
Lines changed: 95 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,95 @@
1+
---
2+
title: Troubleshoot container memory limits
3+
description: Troubleshooting Kubernetes container limits
4+
ms.service: azure-operator-nexus
5+
ms.custom: troubleshooting
6+
ms.topic: troubleshooting
7+
ms.date: 11/01/2024
8+
ms.author: matthewernst
9+
author: matternst7258
10+
---
11+
12+
# Troubleshoot container memory limits
13+
14+
## Alerting for memory limits
15+
16+
It's recommended to have alerts set up for the Operator Nexus cluster to look for Kubernetes pods restarting from OOMKill errors. These alerts allow customers to know if a component on a server is working appropriately.
17+
18+
Metrics exposed to identify memory limits:
19+
20+
| Metric Name | Description |
21+
| ------------------------------------ | ------------------------------------------------ |
22+
| Container Restarts | `kube_pod_container_status_restarts_total` |
23+
| Container Status Terminated Reason | `kube_pod_container_status_terminated_reason` |
24+
| Container Resource Limits | `kube_pod_container_resource_limits` |
25+
26+
`Container Status Terminated Reason` displays the OOMKill reason for impacted pods.
27+
28+
## Identifying Out of Memory (OOM) pods
29+
30+
Start by identifying any components that are restarting or show OOMKill.
31+
32+
```azcli
33+
az networkcloud baremetalmachine run-read-command --name "<bareMetalMachineName>" \
34+
--limit-time-seconds 60 \
35+
--commands "[{command:'kubectl get',arguments:[pods,-n,nc-system]}]" \
36+
--resource-group "<cluster_MRG>" \
37+
--subscription "<subscription>"
38+
```
39+
40+
Once identified, a `describe pod` command can determine the status and restart count.
41+
42+
```azcli
43+
az networkcloud baremetalmachine run-read-command --name "<bareMetalMachineName>" \
44+
--limit-time-seconds 60 \
45+
--commands "[{command:'kubectl describe',arguments:[pod,<podName>,-n,nc-system]}]" \
46+
--resource-group "<cluster_MRG>" \
47+
--subscription "<subscription>"
48+
```
49+
50+
At the same time, a `get events` command can provide history to see the frequency of pod restarts.
51+
52+
```azcli
53+
az networkcloud baremetalmachine run-read-command --name "<bareMetalMachineName>" \
54+
--limit-time-seconds 60 \
55+
--commands "[{command:'kubectl get',arguments:[events,-n,nc-system,|,grep,<podName>]}]" \
56+
--resource-group "<cluster_MRG>" \
57+
--subscription "<subscription>"
58+
```
59+
60+
The data from these commands identify whether a pod is restarting due to `OOMKill`.
61+
62+
## Patching memory limits
63+
64+
Raise a Microsoft support request for all memory limit changes for adjustments and support.
65+
66+
> [!WARNING]
67+
> Patching memory limits to a pod are not permanent and can be overwritten if the pod restarts.
68+
69+
## Confirm memory limit changes
70+
71+
When memory limits change, the pods should return to `Ready` state and stop restarting.
72+
73+
The following commands can be used to confirm the behavior.
74+
75+
```azcli
76+
az networkcloud baremetalmachine run-read-command --name "<bareMetalMachineName>" \
77+
--limit-time-seconds 60 \
78+
--commands "[{command:'kubectl get',arguments:[pods,-n,nc-system]}]" \
79+
--resource-group "<cluster_MRG>" \
80+
--subscription "<subscription>"
81+
```
82+
83+
```azcli
84+
az networkcloud baremetalmachine run-read-command --name "<bareMetalMachineName>" \
85+
--limit-time-seconds 60 \
86+
--commands "[{command:'kubectl describe',arguments:[pod,<podName>,-n,nc-system]}]" \
87+
--resource-group "<cluster_MRG>" \
88+
--subscription "<subscription>"
89+
```
90+
91+
## Known services susceptible to OOM issues
92+
93+
* cdi-operator
94+
* vulnerability-operator
95+
* cluster-metadata-operator

0 commit comments

Comments
 (0)