Skip to content

Commit ab4cb2b

Browse files
rexagodskl
andauthored
bugfix: refactor alerts to accomodate for non-HA clusters (#1010)
* bugfix: refactor alerts to accomodate for single-node clusters For the sake of brevity, let: Q: kube_node_status_allocatable{job="kube-state-metrics",resource="cpu"} (allocable), and, QQ: namespace_cpu:kube_pod_container_resource_requests:sum{} (requested), thus, both quota alerts relevant here exist in the form: sum(QQ) by (cluster) - (sum(Q) by (cluster) - max(Q) by (cluster)) > 0 and (sum(Q) by (cluster) - max(Q) by (cluster)) > 0, which, in case of a single-node cluster (sum(Q) by (cluster) = max(Q) by (cluster)), is reduced to, sum(QQ) by (cluster) > 0, i.e., the alert will fire if *any* request limits exist. To address this, drop the "max(Q) by (cluster)" buffer assumed in non-SNO clusters from SNO, reducing the expression to: sum(QQ) by (cluster) - sum(Q) by (cluster) > 0 (total requeted - total allocable > 0 to trigger alert), since there is only a single node, so a buffer of the same sort does not make sense. Signed-off-by: Pranshu Srivastava <[email protected]> * fixup! bugfix: refactor alerts to accomodate for single-node clusters * test: Add tests for KubeCPUOvercommit and KubeMemoryOvercommit * `s/kube_node_info/kube_node_role` Use kube_node_role{role="control-plane"} (see [1] and [2]) to estimate if the cluster is HA or not. * [1]: https://github.com/search?q=repo%3Akubernetes%2Fkube-state-metrics%20kube_node_role&type=code * [2]: https://kubernetes.io/docs/reference/labels-annotations-taints/#node-role-kubernetes-io-control-plane Also drop any thresholds as they would lead to false positives. * follow-up: add test cases for non-HA (two-node) Signed-off-by: Pranshu Srivastava <[email protected]> --------- Signed-off-by: Pranshu Srivastava <[email protected]> Co-authored-by: Stephen Lang <[email protected]>
1 parent eb4aea6 commit ab4cb2b

File tree

2 files changed

+214
-16
lines changed

2 files changed

+214
-16
lines changed

alerts/resource_alerts.libsonnet

Lines changed: 47 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -36,18 +36,34 @@ local utils = import '../lib/utils.libsonnet';
3636
} +
3737
if $._config.showMultiCluster then {
3838
expr: |||
39-
sum(namespace_cpu:kube_pod_container_resource_requests:sum{%(ignoringOverprovisionedWorkloadSelector)s}) by (%(clusterLabel)s) - (sum(kube_node_status_allocatable{%(kubeStateMetricsSelector)s,resource="cpu"}) by (%(clusterLabel)s) - max(kube_node_status_allocatable{%(kubeStateMetricsSelector)s,resource="cpu"}) by (%(clusterLabel)s)) > 0
39+
(sum(namespace_cpu:kube_pod_container_resource_requests:sum{%(ignoringOverprovisionedWorkloadSelector)s}) by (%(clusterLabel)s) -
40+
sum(kube_node_status_allocatable{%(kubeStateMetricsSelector)s,resource="cpu"}) by (%(clusterLabel)s) > 0
4041
and
41-
(sum(kube_node_status_allocatable{%(kubeStateMetricsSelector)s,resource="cpu"}) by (%(clusterLabel)s) - max(kube_node_status_allocatable{%(kubeStateMetricsSelector)s,resource="cpu"}) by (%(clusterLabel)s)) > 0
42+
count by (%(clusterLabel)s) (max by (%(clusterLabel)s, node) (kube_node_role{%(kubeStateMetricsSelector)s, role="control-plane"})) < 3)
43+
or
44+
(sum(namespace_cpu:kube_pod_container_resource_requests:sum{%(ignoringOverprovisionedWorkloadSelector)s}) by (%(clusterLabel)s) -
45+
(sum(kube_node_status_allocatable{%(kubeStateMetricsSelector)s,resource="cpu"}) by (%(clusterLabel)s) -
46+
max(kube_node_status_allocatable{%(kubeStateMetricsSelector)s,resource="cpu"}) by (%(clusterLabel)s)) > 0
47+
and
48+
(sum(kube_node_status_allocatable{%(kubeStateMetricsSelector)s,resource="cpu"}) by (%(clusterLabel)s) -
49+
max(kube_node_status_allocatable{%(kubeStateMetricsSelector)s,resource="cpu"}) by (%(clusterLabel)s)) > 0)
4250
||| % $._config,
4351
annotations+: {
4452
description: 'Cluster {{ $labels.%(clusterLabel)s }} has overcommitted CPU resource requests for Pods by {{ printf "%%.2f" $value }} CPU shares and cannot tolerate node failure.' % $._config,
4553
},
4654
} else {
4755
expr: |||
48-
sum(namespace_cpu:kube_pod_container_resource_requests:sum{%(ignoringOverprovisionedWorkloadSelector)s}) - (sum(kube_node_status_allocatable{resource="cpu", %(kubeStateMetricsSelector)s}) - max(kube_node_status_allocatable{resource="cpu", %(kubeStateMetricsSelector)s})) > 0
56+
(sum(namespace_cpu:kube_pod_container_resource_requests:sum{%(ignoringOverprovisionedWorkloadSelector)s}) -
57+
sum(kube_node_status_allocatable{resource="cpu", %(kubeStateMetricsSelector)s}) > 0
58+
and
59+
count(max by (node) (kube_node_role{%(kubeStateMetricsSelector)s, role="control-plane"})) < 3)
60+
or
61+
(sum(namespace_cpu:kube_pod_container_resource_requests:sum{%(ignoringOverprovisionedWorkloadSelector)s}) -
62+
(sum(kube_node_status_allocatable{resource="cpu", %(kubeStateMetricsSelector)s}) -
63+
max(kube_node_status_allocatable{resource="cpu", %(kubeStateMetricsSelector)s})) > 0
4964
and
50-
(sum(kube_node_status_allocatable{resource="cpu", %(kubeStateMetricsSelector)s}) - max(kube_node_status_allocatable{resource="cpu", %(kubeStateMetricsSelector)s})) > 0
65+
(sum(kube_node_status_allocatable{resource="cpu", %(kubeStateMetricsSelector)s}) -
66+
max(kube_node_status_allocatable{resource="cpu", %(kubeStateMetricsSelector)s})) > 0)
5167
||| % $._config,
5268
annotations+: {
5369
description: 'Cluster has overcommitted CPU resource requests for Pods by {{ $value }} CPU shares and cannot tolerate node failure.' % $._config,
@@ -65,24 +81,39 @@ local utils = import '../lib/utils.libsonnet';
6581
} +
6682
if $._config.showMultiCluster then {
6783
expr: |||
68-
sum(namespace_memory:kube_pod_container_resource_requests:sum{%(ignoringOverprovisionedWorkloadSelector)s}) by (%(clusterLabel)s) - (sum(kube_node_status_allocatable{resource="memory", %(kubeStateMetricsSelector)s}) by (%(clusterLabel)s) - max(kube_node_status_allocatable{resource="memory", %(kubeStateMetricsSelector)s}) by (%(clusterLabel)s)) > 0
84+
(sum(namespace_memory:kube_pod_container_resource_requests:sum{%(ignoringOverprovisionedWorkloadSelector)s}) by (%(clusterLabel)s) -
85+
sum(kube_node_status_allocatable{resource="memory", %(kubeStateMetricsSelector)s}) by (%(clusterLabel)s) > 0
6986
and
70-
(sum(kube_node_status_allocatable{resource="memory", %(kubeStateMetricsSelector)s}) by (%(clusterLabel)s) - max(kube_node_status_allocatable{resource="memory", %(kubeStateMetricsSelector)s}) by (%(clusterLabel)s)) > 0
87+
count by (%(clusterLabel)s) (max by (%(clusterLabel)s, node) (kube_node_role{%(kubeStateMetricsSelector)s, role="control-plane"})) < 3)
88+
or
89+
(sum(namespace_memory:kube_pod_container_resource_requests:sum{%(ignoringOverprovisionedWorkloadSelector)s}) by (%(clusterLabel)s) -
90+
(sum(kube_node_status_allocatable{resource="memory", %(kubeStateMetricsSelector)s}) by (%(clusterLabel)s) -
91+
max(kube_node_status_allocatable{resource="memory", %(kubeStateMetricsSelector)s}) by (%(clusterLabel)s)) > 0
92+
and
93+
(sum(kube_node_status_allocatable{resource="memory", %(kubeStateMetricsSelector)s}) by (%(clusterLabel)s) -
94+
max(kube_node_status_allocatable{resource="memory", %(kubeStateMetricsSelector)s}) by (%(clusterLabel)s)) > 0)
7195
||| % $._config,
7296
annotations+: {
7397
description: 'Cluster {{ $labels.%(clusterLabel)s }} has overcommitted memory resource requests for Pods by {{ $value | humanize }} bytes and cannot tolerate node failure.' % $._config,
7498
},
75-
} else
76-
{
77-
expr: |||
78-
sum(namespace_memory:kube_pod_container_resource_requests:sum{%(ignoringOverprovisionedWorkloadSelector)s}) - (sum(kube_node_status_allocatable{resource="memory", %(kubeStateMetricsSelector)s}) - max(kube_node_status_allocatable{resource="memory", %(kubeStateMetricsSelector)s})) > 0
79-
and
80-
(sum(kube_node_status_allocatable{resource="memory", %(kubeStateMetricsSelector)s}) - max(kube_node_status_allocatable{resource="memory", %(kubeStateMetricsSelector)s})) > 0
81-
||| % $._config,
82-
annotations+: {
83-
description: 'Cluster has overcommitted memory resource requests for Pods by {{ $value | humanize }} bytes and cannot tolerate node failure.',
84-
},
99+
} else {
100+
expr: |||
101+
(sum(namespace_memory:kube_pod_container_resource_requests:sum{%(ignoringOverprovisionedWorkloadSelector)s}) -
102+
sum(kube_node_status_allocatable{resource="memory", %(kubeStateMetricsSelector)s}) > 0
103+
and
104+
count(max by (node) (kube_node_role{%(kubeStateMetricsSelector)s, role="control-plane"})) < 3)
105+
or
106+
(sum(namespace_memory:kube_pod_container_resource_requests:sum{%(ignoringOverprovisionedWorkloadSelector)s}) -
107+
(sum(kube_node_status_allocatable{resource="memory", %(kubeStateMetricsSelector)s}) -
108+
max(kube_node_status_allocatable{resource="memory", %(kubeStateMetricsSelector)s})) > 0
109+
and
110+
(sum(kube_node_status_allocatable{resource="memory", %(kubeStateMetricsSelector)s}) -
111+
max(kube_node_status_allocatable{resource="memory", %(kubeStateMetricsSelector)s})) > 0)
112+
||| % $._config,
113+
annotations+: {
114+
description: 'Cluster has overcommitted memory resource requests for Pods by {{ $value | humanize }} bytes and cannot tolerate node failure.',
85115
},
116+
},
86117
{
87118
alert: 'KubeCPUQuotaOvercommit',
88119
labels: {

tests/tests.yaml

Lines changed: 167 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1424,3 +1424,170 @@ tests:
14241424
runbook_url: "https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubestatefulsetreplicasmismatch"
14251425
summary: "StatefulSet has not matched the expected number of replicas."
14261426

1427+
- name: KubeCPUOvercommit alert (single-node)
1428+
interval: 1m
1429+
input_series:
1430+
- series: 'namespace_cpu:kube_pod_container_resource_requests:sum{cluster="kubernetes", namespace="default"}'
1431+
values: '1x10'
1432+
- series: 'namespace_cpu:kube_pod_container_resource_requests:sum{cluster="kubernetes", namespace="kube-system"}'
1433+
values: '1x10'
1434+
- series: 'kube_node_status_allocatable{cluster="kubernetes", node="n1", resource="cpu", job="kube-state-metrics"}'
1435+
values: '1.9x10' # This value was seen on a 2x vCPU node
1436+
- series: 'kube_node_role{cluster="kubernetes", node="n1", role="control-plane", job="kube-state-metrics"}'
1437+
values: '1x10'
1438+
alert_rule_test:
1439+
- eval_time: 9m
1440+
alertname: KubeCPUOvercommit
1441+
- eval_time: 10m
1442+
alertname: KubeCPUOvercommit
1443+
exp_alerts:
1444+
- exp_labels:
1445+
severity: warning
1446+
exp_annotations:
1447+
description: Cluster has overcommitted CPU resource requests for Pods by 0.10000000000000009 CPU shares and cannot tolerate node failure.
1448+
runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubecpuovercommit
1449+
summary: Cluster has overcommitted CPU resource requests.
1450+
1451+
- name: KubeCPUOvercommit alert (multi-node; non-HA)
1452+
interval: 1m
1453+
input_series:
1454+
- series: 'namespace_cpu:kube_pod_container_resource_requests:sum{cluster="kubernetes", namespace="default"}'
1455+
values: '2x10'
1456+
- series: 'namespace_cpu:kube_pod_container_resource_requests:sum{cluster="kubernetes", namespace="kube-system"}'
1457+
values: '2x10'
1458+
- series: 'kube_node_status_allocatable{cluster="kubernetes", node="n1", resource="cpu", job="kube-state-metrics"}'
1459+
values: '1.9x10' # This value was seen on a 2x vCPU node
1460+
- series: 'kube_node_status_allocatable{cluster="kubernetes", node="n2", resource="cpu", job="kube-state-metrics"}'
1461+
values: '1.9x10'
1462+
- series: 'kube_node_role{cluster="kubernetes", node="n1", role="control-plane", job="kube-state-metrics"}'
1463+
values: '1x10'
1464+
- series: 'kube_node_role{cluster="kubernetes", node="n2", role="control-plane", job="kube-state-metrics"}'
1465+
values: '1x10'
1466+
alert_rule_test:
1467+
- eval_time: 9m
1468+
alertname: KubeCPUOvercommit
1469+
- eval_time: 10m
1470+
alertname: KubeCPUOvercommit
1471+
exp_alerts:
1472+
- exp_labels:
1473+
severity: warning
1474+
exp_annotations:
1475+
description: Cluster has overcommitted CPU resource requests for Pods by 0.20000000000000018 CPU shares and cannot tolerate node failure.
1476+
runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubecpuovercommit
1477+
summary: Cluster has overcommitted CPU resource requests.
1478+
1479+
- name: KubeCPUOvercommit alert (multi-node; HA)
1480+
interval: 1m
1481+
input_series:
1482+
- series: 'namespace_cpu:kube_pod_container_resource_requests:sum{cluster="kubernetes", namespace="default"}'
1483+
values: '2x10'
1484+
- series: 'namespace_cpu:kube_pod_container_resource_requests:sum{cluster="kubernetes", namespace="kube-system"}'
1485+
values: '2x10'
1486+
- series: 'kube_node_status_allocatable{cluster="kubernetes", node="n1", resource="cpu", job="kube-state-metrics"}'
1487+
values: '1.9x10' # This value was seen on a 2x vCPU node
1488+
- series: 'kube_node_status_allocatable{cluster="kubernetes", node="n2", resource="cpu", job="kube-state-metrics"}'
1489+
values: '1.9x10'
1490+
- series: 'kube_node_status_allocatable{cluster="kubernetes", node="n3", resource="cpu", job="kube-state-metrics"}'
1491+
values: '1.9x10'
1492+
- series: 'kube_node_role{cluster="kubernetes", node="n1", role="control-plane", job="kube-state-metrics"}'
1493+
values: '1x10'
1494+
- series: 'kube_node_role{cluster="kubernetes", node="n2", role="control-plane", job="kube-state-metrics"}'
1495+
values: '1x10'
1496+
- series: 'kube_node_role{cluster="kubernetes", node="n3", role="control-plane", job="kube-state-metrics"}'
1497+
values: '1x10'
1498+
alert_rule_test:
1499+
- eval_time: 9m
1500+
alertname: KubeCPUOvercommit
1501+
- eval_time: 10m
1502+
alertname: KubeCPUOvercommit
1503+
exp_alerts:
1504+
- exp_labels:
1505+
severity: warning
1506+
exp_annotations:
1507+
description: Cluster has overcommitted CPU resource requests for Pods by 0.20000000000000062 CPU shares and cannot tolerate node failure.
1508+
runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubecpuovercommit
1509+
summary: Cluster has overcommitted CPU resource requests.
1510+
1511+
- name: KubeMemoryOvercommit alert (single-node)
1512+
interval: 1m
1513+
input_series:
1514+
- series: 'namespace_memory:kube_pod_container_resource_requests:sum{cluster="kubernetes", namespace="default"}'
1515+
values: '1000000000x10' # 1 GB
1516+
- series: 'namespace_memory:kube_pod_container_resource_requests:sum{cluster="kubernetes", namespace="kube-system"}'
1517+
values: '1000000000x10'
1518+
- series: 'kube_node_status_allocatable{cluster="kubernetes", node="n1", resource="memory", job="kube-state-metrics"}'
1519+
values: '1000000000x10'
1520+
- series: 'kube_node_role{cluster="kubernetes", node="n1", role="control-plane", job="kube-state-metrics"}'
1521+
values: '1x10'
1522+
alert_rule_test:
1523+
- eval_time: 9m
1524+
alertname: KubeMemoryOvercommit
1525+
- eval_time: 10m
1526+
alertname: KubeMemoryOvercommit
1527+
exp_alerts:
1528+
- exp_labels:
1529+
severity: warning
1530+
exp_annotations:
1531+
description: Cluster has overcommitted memory resource requests for Pods by 1G bytes and cannot tolerate node failure.
1532+
runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubememoryovercommit
1533+
summary: Cluster has overcommitted memory resource requests.
1534+
1535+
- name: KubeMemoryOvercommit alert (multi-node; non-HA)
1536+
interval: 1m
1537+
input_series:
1538+
- series: 'namespace_memory:kube_pod_container_resource_requests:sum{cluster="kubernetes", namespace="default"}'
1539+
values: '2000000000x10' # 2 GB
1540+
- series: 'namespace_memory:kube_pod_container_resource_requests:sum{cluster="kubernetes", namespace="kube-system"}'
1541+
values: '2000000000x10'
1542+
- series: 'kube_node_status_allocatable{cluster="kubernetes", node="n1", resource="memory", job="kube-state-metrics"}'
1543+
values: '1000000000x10'
1544+
- series: 'kube_node_status_allocatable{cluster="kubernetes", node="n2", resource="memory", job="kube-state-metrics"}'
1545+
values: '1000000000x10'
1546+
- series: 'kube_node_role{cluster="kubernetes", node="n1", role="control-plane", job="kube-state-metrics"}'
1547+
values: '1x10'
1548+
- series: 'kube_node_role{cluster="kubernetes", node="n2", role="control-plane", job="kube-state-metrics"}'
1549+
values: '1x10'
1550+
alert_rule_test:
1551+
- eval_time: 9m
1552+
alertname: KubeMemoryOvercommit
1553+
- eval_time: 10m
1554+
alertname: KubeMemoryOvercommit
1555+
exp_alerts:
1556+
- exp_labels:
1557+
severity: warning
1558+
exp_annotations:
1559+
description: Cluster has overcommitted memory resource requests for Pods by 2G bytes and cannot tolerate node failure.
1560+
runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubememoryovercommit
1561+
summary: Cluster has overcommitted memory resource requests.
1562+
1563+
- name: KubeMemoryOvercommit alert (multi-node; HA)
1564+
interval: 1m
1565+
input_series:
1566+
- series: 'namespace_memory:kube_pod_container_resource_requests:sum{cluster="kubernetes", namespace="default"}'
1567+
values: '2000000000x10' # 2 GB
1568+
- series: 'namespace_memory:kube_pod_container_resource_requests:sum{cluster="kubernetes", namespace="kube-system"}'
1569+
values: '2000000000x10'
1570+
- series: 'kube_node_status_allocatable{cluster="kubernetes", node="n1", resource="memory", job="kube-state-metrics"}'
1571+
values: '1000000000x10'
1572+
- series: 'kube_node_status_allocatable{cluster="kubernetes", node="n2", resource="memory", job="kube-state-metrics"}'
1573+
values: '1000000000x10'
1574+
- series: 'kube_node_status_allocatable{cluster="kubernetes", node="n3", resource="memory", job="kube-state-metrics"}'
1575+
values: '1000000000x10'
1576+
- series: 'kube_node_role{cluster="kubernetes", node="n1", role="control-plane", job="kube-state-metrics"}'
1577+
values: '1x10'
1578+
- series: 'kube_node_role{cluster="kubernetes", node="n2", role="control-plane", job="kube-state-metrics"}'
1579+
values: '1x10'
1580+
- series: 'kube_node_role{cluster="kubernetes", node="n3", role="control-plane", job="kube-state-metrics"}'
1581+
values: '1x10'
1582+
alert_rule_test:
1583+
- eval_time: 9m
1584+
alertname: KubeMemoryOvercommit
1585+
- eval_time: 10m
1586+
alertname: KubeMemoryOvercommit
1587+
exp_alerts:
1588+
- exp_labels:
1589+
severity: warning
1590+
exp_annotations:
1591+
description: Cluster has overcommitted memory resource requests for Pods by 2G bytes and cannot tolerate node failure.
1592+
runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubememoryovercommit
1593+
summary: Cluster has overcommitted memory resource requests.

0 commit comments

Comments
 (0)