Skip to content

Commit 25781f2

Browse files
TheRealNoobskl
andauthored
feat(alerts): KubeNodePressure and KubeNodeEviction (#1014)
* feat: create alert "KubeletEvictingPods" Signed-off-by: TheRealNoob <[email protected]> * fix syntax * move to resources.libsonnet Signed-off-by: TheRealNoob <[email protected]> * add selector filter Signed-off-by: TheRealNoob <[email protected]> * move {{cluster}} injection Co-authored-by: Stephen Lang <[email protected]> * redo alerts Signed-off-by: TheRealNoob <[email protected]> * update runbook Signed-off-by: TheRealNoob <[email protected]> * add tests Signed-off-by: TheRealNoob <[email protected]> * fix tests Signed-off-by: TheRealNoob <[email protected]> * fix "smelly selector" syntax preference this turned out to be a good chance because it made me realize there was an additional label value here that wasn't be handled. Signed-off-by: TheRealNoob <[email protected]> * Update alerts/kubelet.libsonnet Co-authored-by: Stephen Lang <[email protected]> * Update alerts/kubelet.libsonnet Co-authored-by: Stephen Lang <[email protected]> * add test KubeNodePressure Signed-off-by: TheRealNoob <[email protected]> * chore: make --always-make markdownfmt * rename KubeNodeEviction, fix test case Signed-off-by: TheRealNoob <[email protected]> * remove in-progress change Signed-off-by: TheRealNoob <[email protected]> * Update kubelet.libsonnet update KubeNodeEviction query * Update tests.yaml update KubeNodeEviction test case * update test KubeNodeEviction Co-authored-by: Stephen Lang <[email protected]> --------- Signed-off-by: TheRealNoob <[email protected]> Co-authored-by: Stephen Lang <[email protected]> Co-authored-by: Stephen Lang <[email protected]>
1 parent 4ff562d commit 25781f2

File tree

3 files changed

+113
-1
lines changed

3 files changed

+113
-1
lines changed

alerts/kubelet.libsonnet

Lines changed: 42 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,9 @@ local utils = import '../lib/utils.libsonnet';
1212

1313
kubeletCertExpirationWarningSeconds: 7 * 24 * 3600,
1414
kubeletCertExpirationCriticalSeconds: 1 * 24 * 3600,
15+
16+
// Evictions per second that will trigger an alert. The default value will trigger on any evictions.
17+
KubeNodeEvictionRateThreshold: 0.0,
1518
},
1619

1720
prometheusAlerts+:: {
@@ -37,6 +40,24 @@ local utils = import '../lib/utils.libsonnet';
3740
'for': '15m',
3841
alert: 'KubeNodeNotReady',
3942
},
43+
{
44+
alert: 'KubeNodePressure',
45+
expr: |||
46+
kube_node_status_condition{%(kubeStateMetricsSelector)s,condition=~"(MemoryPressure|DiskPressure|PIDPressure)",status="true"} == 1
47+
and on (%(clusterLabel)s, node)
48+
kube_node_spec_unschedulable{%(kubeStateMetricsSelector)s} == 0
49+
||| % $._config,
50+
labels: {
51+
severity: 'info',
52+
},
53+
'for': '10m',
54+
annotations: {
55+
description: '{{ $labels.node }}%s has active Condition {{ $labels.condition }}. This is caused by resource usage exceeding eviction thresholds.' % [
56+
utils.ifShowMultiCluster($._config, ' on cluster {{ $labels.%(clusterLabel)s }}' % $._config),
57+
],
58+
summary: 'Node has as active Condition.',
59+
},
60+
},
4061
{
4162
expr: |||
4263
(kube_node_spec_taint{%(kubeStateMetricsSelector)s,key="node.kubernetes.io/unreachable",effect="NoSchedule"} unless ignoring(key,value) kube_node_spec_taint{%(kubeStateMetricsSelector)s,key=~"%(kubeNodeUnreachableIgnoreKeys)s"}) == 1
@@ -103,6 +124,27 @@ local utils = import '../lib/utils.libsonnet';
103124
summary: 'Node readiness status is flapping.',
104125
},
105126
},
127+
{
128+
alert: 'KubeNodeEviction',
129+
expr: |||
130+
sum(rate(kubelet_evictions{%(kubeletSelector)s}[15m])) by(%(clusterLabel)s, eviction_signal, instance)
131+
* on (%(clusterLabel)s, instance) group_left(node)
132+
max by (%(clusterLabel)s, instance, node) (
133+
kubelet_node_name{%(kubeletSelector)s}
134+
)
135+
> %(KubeNodeEvictionRateThreshold)s
136+
||| % $._config,
137+
labels: {
138+
severity: 'info',
139+
},
140+
'for': '0s',
141+
annotations: {
142+
description: 'Node {{ $labels.node }}%s is evicting Pods due to {{ $labels.eviction_signal }}. Eviction occurs when eviction thresholds are crossed, typically caused by Pods exceeding RAM/ephemeral-storage limits.' % [
143+
utils.ifShowMultiCluster($._config, ' on {{ $labels.%(clusterLabel)s }}' % $._config),
144+
],
145+
summary: 'Node is evicting pods.',
146+
},
147+
},
106148
{
107149
alert: 'KubeletPlegDurationHigh',
108150
expr: |||

runbook.md

Lines changed: 11 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -182,10 +182,15 @@ This page collects this repositories alerts and begins the process of describing
182182
### Group Name: "kubernetes-system"
183183

184184
##### Alert Name: "KubeNodeNotReady"
185-
+ *Message*: `{{ $labels.node }} has been unready for more than 15 minutes."`
185+
+ *Message*: `{{ $labels.node }} has been unready for more than 15 minutes.`
186186
+ *Severity*: warning
187187
+ *Runbook*: [Link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubenodenotready/)
188188

189+
##### Alert Name: "KubeNodePressure"
190+
+ *Message*: `{{ $labels.node }} has active Condition {{ $labels.condition }}. This is caused by resource usage exceeding eviction thresholds.`
191+
+ *Severity*: info
192+
+ *Runbook*: [Link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubenodepressure/)
193+
189194
##### Alert Name: "KubeNodeUnreachable"
190195
+ *Message*: `{{ $labels.node }} is unreachable and some workloads may be rescheduled.`
191196
+ *Severity*: warning
@@ -201,6 +206,11 @@ This page collects this repositories alerts and begins the process of describing
201206
+ *Severity*: warning
202207
+ *Runbook*: [Link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubenodereadinessflapping/)
203208

209+
##### Alert Name: "KubeNodeEviction"
210+
+ *Message*: `Node {{ $labels.node }} is evicting Pods due to {{ $labels.eviction_signal }}. Eviction occurs when eviction thresholds are crossed, typically caused by Pods exceeding RAM/ephemeral-storage limits.`
211+
+ *Severity*: info
212+
+ *Runbook*: [Link](https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubenodeeviction)
213+
204214
##### Alert Name: "KubeletPlegDurationHigh"
205215
+ *Message*: `The Kubelet Pod Lifecycle Event Generator has a 99th percentile duration of {{ $value }} seconds on node {{ $labels.node }}.`
206216
+ *Severity*: warning

tests/tests.yaml

Lines changed: 60 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -626,6 +626,43 @@ tests:
626626
description: 'minikube has been unready for more than 15 minutes.'
627627
runbook_url: 'https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubenodenotready'
628628

629+
- interval: 1m
630+
input_series:
631+
# node=minikube is uncordoned so we expect the alert to fire
632+
- series: 'kube_node_status_condition{condition="MemoryPressure",status="true",endpoint="https-main",cluster="kubernetes",instance="10.0.2.15:10250",job="kube-state-metrics",namespace="monitoring",node="minikube",pod="kube-state-metrics-b894d84cc-d6htw",service="kube-state-metrics"}'
633+
values: '0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1'
634+
- series: 'kube_node_spec_unschedulable{endpoint="https-main",cluster="kubernetes",instance="10.0.2.15:10250",job="kube-state-metrics",namespace="monitoring",node="minikube",pod="kube-state-metrics-b894d84cc-d6htw",service="kube-state-metrics"}'
635+
values: '0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0'
636+
# node=minikube2 is cordoned so we expect the alert to not fire
637+
- series: 'kube_node_status_condition{condition="MemoryPressure",status="true",endpoint="https-main",cluster="kubernetes",instance="10.0.2.15:10250",job="kube-state-metrics",namespace="monitoring",node="minikube2",pod="kube-state-metrics-b894d84cc-d6htw",service="kube-state-metrics"}'
638+
values: '0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1'
639+
- series: 'kube_node_spec_unschedulable{endpoint="https-main",cluster="kubernetes",instance="10.0.2.15:10250",job="kube-state-metrics",namespace="monitoring",node="minikube2",pod="kube-state-metrics-b894d84cc-d6htw",service="kube-state-metrics"}'
640+
values: '1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1'
641+
alert_rule_test:
642+
# does not meet 'for: 10m' threshold
643+
- eval_time: 12m
644+
alertname: KubeNodePressure
645+
exp_alerts: []
646+
- eval_time: 15m
647+
alertname: KubeNodePressure
648+
exp_alerts:
649+
- exp_labels:
650+
condition: MemoryPressure
651+
status: "true"
652+
cluster: kubernetes
653+
node: minikube
654+
severity: info
655+
endpoint: https-main
656+
instance: 10.0.2.15:10250
657+
job: kube-state-metrics
658+
namespace: monitoring
659+
pod: kube-state-metrics-b894d84cc-d6htw
660+
service: kube-state-metrics
661+
exp_annotations:
662+
summary: "Node has as active Condition."
663+
description: 'minikube has active Condition MemoryPressure. This is caused by resource usage exceeding eviction thresholds.'
664+
runbook_url: 'https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubenodepressure'
665+
629666
- interval: 1m
630667
input_series:
631668
# node=minikube is uncordoned so we expect the alert to fire
@@ -651,6 +688,29 @@ tests:
651688
description: 'The readiness status of node minikube has changed 9 times in the last 15 minutes.'
652689
runbook_url: 'https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubenodereadinessflapping'
653690

691+
- interval: 1m
692+
input_series:
693+
# This metric only appears with non-zero values, meaning the transition from 0 --> 1 doesn't trigger an alert
694+
# However, since that's undesired behavior it'd be kinda pointless to test for it, even though it's expected.
695+
- series: 'kubelet_evictions{eviction_signal="memory.available",endpoint="https-main",cluster="kubernetes",instance="10.0.2.15:10250",job="kubelet",namespace="monitoring",node="minikube",pod="kube-state-metrics-b894d84cc-d6htw",service="kube-state-metrics"}'
696+
values: '1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 3 3 3 3 3'
697+
- series: 'kubelet_node_name{cluster="kubernetes", instance="10.0.2.15:10250", node="minikube", job="kubelet"}'
698+
values: '1x20'
699+
alert_rule_test:
700+
- eval_time: 18m
701+
alertname: KubeNodeEviction
702+
exp_alerts:
703+
- exp_labels:
704+
eviction_signal: memory.available
705+
cluster: kubernetes
706+
node: minikube
707+
instance: 10.0.2.15:10250
708+
severity: info
709+
exp_annotations:
710+
summary: "Node is evicting pods."
711+
description: 'Node minikube is evicting Pods due to memory.available. Eviction occurs when eviction thresholds are crossed, typically caused by Pods exceeding RAM/ephemeral-storage limits.'
712+
runbook_url: 'https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubenodeeviction'
713+
654714
# Verify that node:node_num_cpu:sum triggers no many-to-many errors.
655715
- interval: 1m
656716
input_series:

0 commit comments

Comments
 (0)