You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
bugfix: refactor alerts to accomodate for single-node clusters
For the sake of brevity, let:
Q: kube_node_status_allocatable{job="kube-state-metrics",resource="cpu"} (allocable), and,
QQ: namespace_cpu:kube_pod_container_resource_requests:sum{} (requested), thus,
both quota alerts relevant here exist in the form: sum(QQ) by (cluster) - (sum(Q) by (cluster) - max(Q) by (cluster)) > 0 and (sum(Q) by (cluster) - max(Q) by (cluster)) > 0,
which, in case of a single-node cluster (sum(Q) by (cluster) = max(Q) by (cluster)), is reduced to,
sum(QQ) by (cluster) > 0, i.e., the alert will fire if *any* request limits exist.
To address this, drop the "max(Q) by (cluster)" buffer assumed in
non-SNO clusters from SNO, reducing the expression to: sum(QQ) by (cluster) - sum(Q) by (cluster) > 0 (total requeted - total allocable > 0 to trigger alert),
since there is only a single node, so a buffer of the same sort does not
make sense.
Signed-off-by: Pranshu Srivastava <[email protected]>
Copy file name to clipboardExpand all lines: alerts/resource_alerts.libsonnet
+47-16Lines changed: 47 additions & 16 deletions
Original file line number
Diff line number
Diff line change
@@ -36,18 +36,34 @@ local utils = import '../lib/utils.libsonnet';
36
36
} +
37
37
if $._config.showMultiCluster then {
38
38
expr: |||
39
-
sum(namespace_cpu:kube_pod_container_resource_requests:sum{%(ignoringOverprovisionedWorkloadSelector)s}) by (%(clusterLabel)s) - (sum(kube_node_status_allocatable{%(kubeStateMetricsSelector)s,resource="cpu"}) by (%(clusterLabel)s) - max(kube_node_status_allocatable{%(kubeStateMetricsSelector)s,resource="cpu"}) by (%(clusterLabel)s)) > 0
39
+
(count(kube_node_info) == 1
40
40
and
41
-
(sum(kube_node_status_allocatable{%(kubeStateMetricsSelector)s,resource="cpu"}) by (%(clusterLabel)s) - max(kube_node_status_allocatable{%(kubeStateMetricsSelector)s,resource="cpu"}) by (%(clusterLabel)s)) > 0
41
+
sum(namespace_cpu:kube_pod_container_resource_requests:sum{%(ignoringOverprovisionedWorkloadSelector)s}) by (%(clusterLabel)s) -
42
+
sum(kube_node_status_allocatable{%(kubeStateMetricsSelector)s,resource="cpu"}) by (%(clusterLabel)s) > 0)
43
+
or
44
+
(sum(namespace_cpu:kube_pod_container_resource_requests:sum{%(ignoringOverprovisionedWorkloadSelector)s}) by (%(clusterLabel)s) -
45
+
(sum(kube_node_status_allocatable{%(kubeStateMetricsSelector)s,resource="cpu"}) by (%(clusterLabel)s) -
46
+
max(kube_node_status_allocatable{%(kubeStateMetricsSelector)s,resource="cpu"}) by (%(clusterLabel)s)) > 0
47
+
and
48
+
(sum(kube_node_status_allocatable{%(kubeStateMetricsSelector)s,resource="cpu"}) by (%(clusterLabel)s) -
49
+
max(kube_node_status_allocatable{%(kubeStateMetricsSelector)s,resource="cpu"}) by (%(clusterLabel)s)) > 0)
42
50
||| % $._config,
43
51
annotations+: {
44
52
description:'Cluster {{ $labels.%(clusterLabel)s }} has overcommitted CPU resource requests for Pods by {{ $value }} CPU shares and cannot tolerate node failure.' % $._config,
description:'Cluster has overcommitted CPU resource requests for Pods by {{ $value }} CPU shares and cannot tolerate node failure.' % $._config,
@@ -65,24 +81,39 @@ local utils = import '../lib/utils.libsonnet';
65
81
} +
66
82
if $._config.showMultiCluster then {
67
83
expr: |||
68
-
sum(namespace_memory:kube_pod_container_resource_requests:sum{%(ignoringOverprovisionedWorkloadSelector)s}) by (%(clusterLabel)s) - (sum(kube_node_status_allocatable{resource="memory", %(kubeStateMetricsSelector)s}) by (%(clusterLabel)s) - max(kube_node_status_allocatable{resource="memory", %(kubeStateMetricsSelector)s}) by (%(clusterLabel)s)) > 0
84
+
(count(kube_node_info) == 1
69
85
and
70
-
(sum(kube_node_status_allocatable{resource="memory", %(kubeStateMetricsSelector)s}) by (%(clusterLabel)s) - max(kube_node_status_allocatable{resource="memory", %(kubeStateMetricsSelector)s}) by (%(clusterLabel)s)) > 0
86
+
sum(namespace_memory:kube_pod_container_resource_requests:sum{%(ignoringOverprovisionedWorkloadSelector)s}) by (%(clusterLabel)s) -
87
+
sum(kube_node_status_allocatable{resource="memory", %(kubeStateMetricsSelector)s}) by (%(clusterLabel)s) > 0)
88
+
or
89
+
(sum(namespace_memory:kube_pod_container_resource_requests:sum{%(ignoringOverprovisionedWorkloadSelector)s}) by (%(clusterLabel)s) -
90
+
(sum(kube_node_status_allocatable{resource="memory", %(kubeStateMetricsSelector)s}) by (%(clusterLabel)s) -
91
+
max(kube_node_status_allocatable{resource="memory", %(kubeStateMetricsSelector)s}) by (%(clusterLabel)s)) > 0
92
+
and
93
+
(sum(kube_node_status_allocatable{resource="memory", %(kubeStateMetricsSelector)s}) by (%(clusterLabel)s) -
94
+
max(kube_node_status_allocatable{resource="memory", %(kubeStateMetricsSelector)s}) by (%(clusterLabel)s)) > 0)
71
95
||| % $._config,
72
96
annotations+: {
73
97
description:'Cluster {{ $labels.%(clusterLabel)s }} has overcommitted memory resource requests for Pods by {{ $value | humanize }} bytes and cannot tolerate node failure.' % $._config,
0 commit comments