Skip to content

Commit 3830dfd

Browse files
authored
Fix KubeClientCertificateExpiration alerts (#941)
1) Change aggregation `by (le)` to `without(service,endpoint...)`, dropping only useless labels, but keeping external labels (like environment etc) intact. Otherwise they get dropped. 2) Change order of metrics in expression: `apiserver_client_certificate_expiration_seconds_bucket` metric comes first so actual expiration date is shown as result in Grafana->Explore queries, not `apiserver_client_certificate_expiration_seconds_count` value (which is quite useless). This make it easier to troubleshoot. 3) Finally, fix aggregation for `on(job)` to become `(job, cluster, instance)`. Otherwise, It would be enough to have just single instance with certificate expiration problem, and it would set all apiservers to 'firing' (false positive!).
1 parent c70f03d commit 3830dfd

File tree

1 file changed

+6
-2
lines changed

1 file changed

+6
-2
lines changed

alerts/kube_apiserver.libsonnet

Lines changed: 6 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -50,7 +50,9 @@ local utils = import '../lib/utils.libsonnet';
5050
{
5151
alert: 'KubeClientCertificateExpiration',
5252
expr: |||
53-
apiserver_client_certificate_expiration_seconds_count{%(kubeApiserverSelector)s} > 0 and on(%(clusterLabel)s, job) histogram_quantile(0.01, sum by (%(clusterLabel)s, job, le) (rate(apiserver_client_certificate_expiration_seconds_bucket{%(kubeApiserverSelector)s}[5m]))) < %(certExpirationWarningSeconds)s
53+
histogram_quantile(0.01, sum without (%(namespaceLabel)s, service, endpoint) (rate(apiserver_client_certificate_expiration_seconds_bucket{%(kubeApiserverSelector)s}[5m]))) < %(certExpirationWarningSeconds)s
54+
and
55+
on(job, %(clusterLabel)s, instance) apiserver_client_certificate_expiration_seconds_count{%(kubeApiserverSelector)s} > 0
5456
||| % $._config,
5557
'for': '5m',
5658
labels: {
@@ -64,7 +66,9 @@ local utils = import '../lib/utils.libsonnet';
6466
{
6567
alert: 'KubeClientCertificateExpiration',
6668
expr: |||
67-
apiserver_client_certificate_expiration_seconds_count{%(kubeApiserverSelector)s} > 0 and on(%(clusterLabel)s, job) histogram_quantile(0.01, sum by (%(clusterLabel)s, job, le) (rate(apiserver_client_certificate_expiration_seconds_bucket{%(kubeApiserverSelector)s}[5m]))) < %(certExpirationCriticalSeconds)s
69+
histogram_quantile(0.01, sum without (%(namespaceLabel)s, service, endpoint) (rate(apiserver_client_certificate_expiration_seconds_bucket{%(kubeApiserverSelector)s}[5m]))) < %(certExpirationCriticalSeconds)s
70+
and
71+
on(job, %(clusterLabel)s, instance) apiserver_client_certificate_expiration_seconds_count{%(kubeApiserverSelector)s} > 0
6872
||| % $._config,
6973
'for': '5m',
7074
labels: {

0 commit comments

Comments
 (0)