You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/azure-monitor/containers/kubernetes-metric-alerts.md
+14-10Lines changed: 14 additions & 10 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -2,8 +2,8 @@
2
2
title: Recommended alert rules for Kubernetes clusters
3
3
description: Describes how to enable recommended metric alerts rules for a Kubernetes cluster in Azure Monitor.
4
4
ms.topic: conceptual
5
-
ms.date: 03/05/2024
6
-
ms.reviewer: aul
5
+
ms.date: 06/17/2024
6
+
ms.reviewer: vdiec
7
7
---
8
8
9
9
# Recommended alert rules for Kubernetes clusters
@@ -20,6 +20,10 @@ There are two types of metric alert rules used with Kubernetes clusters.
20
20
## Enable recommended alert rules
21
21
Use one of the following methods to enable the recommended alert rules for your cluster. You can enable both Prometheus and platform metric alert rules for the same cluster.
22
22
23
+
>[!NOTE]
24
+
>To enable recommended alerts on Arc-enabled Kubernetes clusters, ARM templates are the only supported method.
25
+
>
26
+
23
27
### [Azure portal](#tab/portal)
24
28
Using the Azure portal, the Prometheus rule group will be created in the same region as the cluster.
25
29
@@ -136,7 +140,7 @@ The following tables list the details of each recommended alert rule. Source cod
136
140
|:---|:---|:---:|:---:|
137
141
| KubeCPUQuotaOvercommit | The CPU resource quota allocated to namespaces exceeds the available CPU resources on the cluster's nodes by more than 50% for the last 5 minutes. | >1.5 | 5 |
138
142
| KubeMemoryQuotaOvercommit | The memory resource quota allocated to namespaces exceeds the available memory resources on the cluster's nodes by more than 50% for the last 5 minutes. | >1.5 | 5 |
139
-
|Number of OOM killed containers is greater than 0| One or more containers within pods have been killed due to out-of-memory (OOM) events for the last 5 minutes. | >0 | 5 |
143
+
|KubeContainerOOMKilledCount | One or more containers within pods have been killed due to out-of-memory (OOM) events for the last 5 minutes. | >0 | 5 |
140
144
| KubeClientErrors | The rate of client errors (HTTP status codes starting with 5xx) in Kubernetes API requests exceeds 1% of the total API request rate for the last 15 minutes. | >0.01 | 15 |
141
145
| KubePersistentVolumeFillingUp | The persistent volume is filling up and is expected to run out of available space evaluated on the available space ratio, used space, and predicted linear trend of available space over the last 6 hours. These conditions are evaluated over the last 60 minutes. | N/A | 60 |
142
146
| KubePersistentVolumeInodesFillingUp | Less than 3% of the inodes within a persistent volume are available for the last 15 minutes. | <0.03 | 15 |
@@ -158,29 +162,29 @@ The following tables list the details of each recommended alert rule. Source cod
|Average PV usage is greater than 80%| The average usage of Persistent Volumes (PVs) on pod exceeds 80% for the last 15 minutes. | >0.8 | 15 |
165
+
|KubePVUsageHigh| The average usage of Persistent Volumes (PVs) on pod exceeds 80% for the last 15 minutes. | >0.8 | 15 |
162
166
| KubeDeploymentReplicasMismatch | There is a mismatch between the desired number of replicas and the number of available replicas for the last 10 minutes. | N/A | 10 |
163
167
| KubeStatefulSetReplicasMismatch | The number of ready replicas in the StatefulSet does not match the total number of replicas in the StatefulSet for the last 15 minutes. | N/A | 15 |
164
168
| KubeHpaReplicasMismatch | The Horizontal Pod Autoscaler in the cluster has not matched the desired number of replicas for the last 15 minutes. | N/A | 15 |
165
169
| KubeHpaMaxedOut | The Horizontal Pod Autoscaler (HPA) in the cluster has been running at the maximum replicas for the last 15 minutes. | N/A | 15 |
166
170
| KubePodCrashLooping | One or more pods is in a CrashLoopBackOff condition, where the pod continuously crashes after startup and fails to recover successfully for the last 15 minutes. | >=1 | 15 |
167
171
| KubeJobStale | At least one Job instance did not complete successfully for the last 6 hours. | >0 | 360 |
168
-
|Pod container restarted in last 1 hour| One or more containers within pods in the Kubernetes cluster have been restarted at least once within the last hour. | >0 | 15 |
169
-
|Ready state of pods is less than 80%| The percentage of pods in a ready state falls below 80% for any deployment or daemonset in the Kubernetes cluster for the last 5 minutes. | <0.8 | 5 |
170
-
|Number of pods in failed state are greater than 0.| One or more pods is in a failed state for the last 5 minutes. | >0 | 5 |
172
+
|KubePodContainerRestart| One or more containers within pods in the Kubernetes cluster have been restarted at least once within the last hour. | >0 | 15 |
173
+
|KubePodReadyStateLow| The percentage of pods in a ready state falls below 80% for any deployment or daemonset in the Kubernetes cluster for the last 5 minutes. | <0.8 | 5 |
174
+
|KubePodFailedState| One or more pods is in a failed state for the last 5 minutes. | >0 | 5 |
171
175
| KubePodNotReadyByController | One or more pods are not in a ready state (i.e., in the "Pending" or "Unknown" phase) for the last 15 minutes. | >0 | 15 |
172
176
| KubeStatefulSetGenerationMismatch | The observed generation of a Kubernetes StatefulSet does not match its metadata generation for the last 15 minutes. | N/A | 15 |
173
177
| KubeJobFailed | One or more Kubernetes jobs have failed within the last 15 minutes. | >0 | 15 |
174
-
|Average CPU usage per container is greater than 95%| The average CPU usage per container exceeds 95% for the last 5 minutes. | >0.95 | 5 |
175
-
|Average Memory usage per container is greater than 95%| The average memory usage per container exceeds 95% for the last 5 minutes. | >0.95 | 10 |
178
+
|KubeContainerAverageCPUHigh| The average CPU usage per container exceeds 95% for the last 5 minutes. | >0.95 | 5 |
179
+
|KubeContainerAverageMemoryHigh| The average memory usage per container exceeds 95% for the last 5 minutes. | >0.95 | 10 |
176
180
| KubeletPodStartUpLatencyHigh | The 99th percentile of the pod startup latency exceeds 60 seconds for the last 10 minutes. | >60 | 10 |
| Node cpu percentage is greater than 95% | The node CPU percentage is greater than 95% for the last 5 minutes. | 95 | 5 |
183
-
| Node memory working set percentage is greater than 100% | The node memory working set percentage is greater than 95% for the last 5 minutes. | 100 | 5 |
187
+
| Node memory working set percentage is greater than 100% | The node memory working set percentage is greater than 100% for the last 5 minutes. | 100 | 5 |
0 commit comments