|
| 1 | +--- |
| 2 | +title: Troubleshoot container memory limits |
| 3 | +description: Troubleshooting Kubernetes container limits |
| 4 | +ms.service: azure-operator-nexus |
| 5 | +ms.custom: troubleshooting |
| 6 | +ms.topic: troubleshooting |
| 7 | +ms.date: 11/01/2024 |
| 8 | +ms.author: matthewernst |
| 9 | +author: matternst7258 |
| 10 | +--- |
| 11 | + |
| 12 | +# Troubleshoot container memory limits |
| 13 | + |
| 14 | +## Alerting for memory limits |
| 15 | + |
| 16 | +It's recommended to have alerts set up for the Operator Nexus cluster to look for Kubernetes pods restarting from OOMKill errors. These alerts allow customers to know if a component on a server is working appropriately. |
| 17 | + |
| 18 | +Metrics exposed to identify memory limits: |
| 19 | + |
| 20 | +| Metric Name | Description | |
| 21 | +| ------------------------------------ | ------------------------------------------------ | |
| 22 | +| Container Restarts | `kube_pod_container_status_restarts_total` | |
| 23 | +| Container Status Terminated Reason | `kube_pod_container_status_terminated_reason` | |
| 24 | +| Container Resource Limits | `kube_pod_container_resource_limits` | |
| 25 | + |
| 26 | +`Container Status Terminated Reason` displays the OOMKill reason for impacted pods. |
| 27 | + |
| 28 | +## Identifying Out of Memory (OOM) pods |
| 29 | + |
| 30 | +Start by identifying any components that are restarting or show OOMKill. |
| 31 | + |
| 32 | +```azcli |
| 33 | +az networkcloud baremetalmachine run-read-command --name "<bareMetalMachineName>" \ |
| 34 | + --limit-time-seconds 60 \ |
| 35 | + --commands "[{command:'kubectl get',arguments:[pods,-n,nc-system]}]" \ |
| 36 | + --resource-group "<cluster_MRG>" \ |
| 37 | + --subscription "<subscription>" |
| 38 | +``` |
| 39 | + |
| 40 | +Once identified, a `describe pod` command can determine the status and restart count. |
| 41 | + |
| 42 | +```azcli |
| 43 | +az networkcloud baremetalmachine run-read-command --name "<bareMetalMachineName>" \ |
| 44 | + --limit-time-seconds 60 \ |
| 45 | + --commands "[{command:'kubectl describe',arguments:[pod,<podName>,-n,nc-system]}]" \ |
| 46 | + --resource-group "<cluster_MRG>" \ |
| 47 | + --subscription "<subscription>" |
| 48 | +``` |
| 49 | + |
| 50 | +At the same time, a `get events` command can provide history to see the frequency of pod restarts. |
| 51 | + |
| 52 | +```azcli |
| 53 | +az networkcloud baremetalmachine run-read-command --name "<bareMetalMachineName>" \ |
| 54 | + --limit-time-seconds 60 \ |
| 55 | + --commands "[{command:'kubectl get',arguments:[events,-n,nc-system,|,grep,<podName>]}]" \ |
| 56 | + --resource-group "<cluster_MRG>" \ |
| 57 | + --subscription "<subscription>" |
| 58 | +``` |
| 59 | + |
| 60 | +The data from these commands identify whether a pod is restarting due to `OOMKill`. |
| 61 | + |
| 62 | +## Patching memory limits |
| 63 | + |
| 64 | +Raise a Microsoft support request for all memory limit changes for adjustments and support. |
| 65 | + |
| 66 | +> [!WARNING] |
| 67 | +> Patching memory limits to a pod are not permanent and can be overwritten if the pod restarts. |
| 68 | +
|
| 69 | +## Confirm memory limit changes |
| 70 | + |
| 71 | +When memory limits change, the pods should return to `Ready` state and stop restarting. |
| 72 | + |
| 73 | +The following commands can be used to confirm the behavior. |
| 74 | + |
| 75 | +```azcli |
| 76 | +az networkcloud baremetalmachine run-read-command --name "<bareMetalMachineName>" \ |
| 77 | + --limit-time-seconds 60 \ |
| 78 | + --commands "[{command:'kubectl get',arguments:[pods,-n,nc-system]}]" \ |
| 79 | + --resource-group "<cluster_MRG>" \ |
| 80 | + --subscription "<subscription>" |
| 81 | +``` |
| 82 | + |
| 83 | +```azcli |
| 84 | +az networkcloud baremetalmachine run-read-command --name "<bareMetalMachineName>" \ |
| 85 | + --limit-time-seconds 60 \ |
| 86 | + --commands "[{command:'kubectl describe',arguments:[pod,<podName>,-n,nc-system]}]" \ |
| 87 | + --resource-group "<cluster_MRG>" \ |
| 88 | + --subscription "<subscription>" |
| 89 | +``` |
| 90 | + |
| 91 | +## Known services susceptible to OOM issues |
| 92 | + |
| 93 | +* cdi-operator |
| 94 | +* vulnerability-operator |
| 95 | +* cluster-metadata-operator |
0 commit comments