manifests/0000_90_kube-controller-manager-operator_05_alerts: Template console links in alert descriptions #837

wking · 2025-04-11T16:59:08Z

Prometheus alerts support Go templating, and this pull uses that to provide more context like "which namespace?", "which PodDisruptionBudget?", "where can I find that PDB in the in-cluster web console?", and "what oc command would I run to see garbage-collection sync logs?". This should make understanding the context of the alert more straightforward, with the responder having to dip into labels and guess.

openshift-ci · 2025-04-11T16:59:48Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: wking
Once this PR has been reviewed and has the lgtm label, please assign atiratree for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

wking · 2025-04-11T22:16:54Z

manifests/0000_90_kube-controller-manager-operator_05_alerts.yaml

            summary: The pod disruption budget is preventing further disruption to pods.
-            description: The pod disruption budget is at the minimum disruptions allowed level. The number of current healthy pods is equal to the desired healthy pods.
+            description: |-
+              The {{ $labels.poddisruptionbudget }} pod disruption budget in the {{ $labels.namespace}} namespace is at the maximum allowed disruption. The number of current healthy pods is equal to the desired healthy pods.{{ with $console_url := "console_url" | query }}{{ if ne (len (label "url" (first $console_url))) 0}} For more information refer to {{ label "url" (first $console_url) }}/k8s/ns/{{ $labels.namespace }}/poddisruptionbudgets/{{ $labels.poddisruptionbudget }}{{ end }}{{ end }}


Hmm, not sure what to do with the unit failure:

: TestYamlCorrectness expand_less 0s {=== RUN TestYamlCorrectness assets_test.go:2 ... === RUN TestYamlCorrectness assets_test.go:2 ...}

the test-case's stdout includes:

=== RUN TestYamlCorrectness assets_test.go:20: Unexpected error reading manifests from ../../manifests/: failed to render "0000_90_kube-controller-manager-operator_05_alerts.yaml": template: 0000_90_kube-controller-manager-operator_05_alerts.yaml:29: undefined variable "$labels"

I guess the that's this assets.New call through assetFromTemplate through renderFile to this template.New. I'm not clear on why this operator feels like these manifests should be Go templates. Maybe we can pivot to using ManifestsFromFiles.

That is one option.

But we could also try to define the variables and actually try to render it in a test, similar to what we do with other templates

cluster-kube-controller-manager-operator/pkg/cmd/render/render.go

Line 119 in cec410b

type TemplateData struct {

It might be useful to have a test specifically for rendering the alerts to see that it resolves correctly.

wking · 2025-04-15T21:58:01Z

manifests/0000_90_kube-controller-manager-operator_05_alerts.yaml

          labels:
            severity: critical
            namespace: openshift-kube-controller-manager
        - alert: PodDisruptionBudgetAtLimit


New thread for Cluster Bot testing. As of daae216, with a launch 4.19,openshift/cluster-kube-controller-manager-operator#837 aws cluster, make a PDB mad:

$ oc adm cordon -l node-role.kubernetes.io/worker= $ oc -n openshift-monitoring delete pod prometheus-k8s-0 $ oc -n openshift-monitoring get poddisruptionbudget prometheus-k8s NAME MIN AVAILABLE MAX UNAVAILABLE ALLOWED DISRUPTIONS AGE prometheus-k8s 1 N/A 0 42m

I didn't wait for the alert to kick over into firing, but checking on pending, this looks... almost good to me:

the issue is the <span class="co-resource-item monitoring__resource-item--monitoring-alert co-resource-item--inline"> bit for the NS injected into my attempt at constructing a console link.

To trip PodDisruptionBudgetLimit I'll look to a different workload, since I don't want to completely break Prometheus (it would make it hard to test alert behavior):

$ oc adm cordon -l node-role.kubernetes.io/master= $ oc -n openshift-console delete -l component=downloads pods $ oc -n openshift-console get poddisruptionbudget downloads NAME MIN AVAILABLE MAX UNAVAILABLE ALLOWED DISRUPTIONS AGE downloads N/A 1 0 53m

In that case, the rendering looks great, although I'm not clear on why it's not seeing the NS rendering issue:

I'm also not clear on how to trigger GarbageCollectorSyncFailed to test its rendering.

daae216 -> 9331433 added some whitespace before a }} to try to get closer to what the working PodDisruptionBudgetLimit description is doing:

$ git diff --word-diff daae2166273f0a..9331433c28f8eb3 -U0 diff --git a/manifests/0000_90_kube-controller-manager-operator_05_alerts.yaml b/manifests/0000_90_kube-controller-manager-operator_05_alerts.yaml index 8135439..5f299b9 100644 --- a/manifests/0000_90_kube-controller-manager-operator_05_alerts.yaml +++ b/manifests/0000_90_kube-controller-manager-operator_05_alerts.yaml @@ -29 +29 @@ spec: The {{ $labels.poddisruptionbudget }} pod disruption budget in the {{ [-$labels.namespace}}-]{+$labels.namespace }}+} namespace is at the maximum allowed disruption. The number of current healthy pods is equal to the desired healthy pods.{{ with $console_url := "console_url" | query }}{{ if ne (len (label "url" (first $console_url))) 0}} For more information refer to {{ label "url" (first $console_url) }}/k8s/ns/{{ $labels.namespace }}/poddisruptionbudgets/{{ $labels.poddisruptionbudget }}{{ end }}{{ end }}

But sadly the NS markup injected into the middle of the console PDB link is still there:

Yeah, this is not really optimal. As far as I can see, it should be pretty simple by adding poddisruptionbudget resource to here: https://github.com/openshift/monitoring-plugin/blob/6f948e4323bdf7c68e6b625ce3020116b5b4571a/web/src/components/alerting/AlertsDetailPage.tsx#L450

…e console links in alert descriptions Prometheus alerts support Go templating [1], and this commit uses that to provide more context like "which namespace?", "which PodDisruptionBudget?", "where can I find that PDB in the in-cluster web console?", and "what 'oc' command would I run to see garbage-collection sync logs?". This should make understanding the context of the alert more straightforward, with the responder having to dip into labels and guess. Using |- for trimmed, block style strings avoids YAML parsers choking on the "for more details: ..." colon with "mapping values are not allowed in this context" and similar. [1]: https://prometheus.io/docs/prometheus/latest/configuration/template_reference/

petr-muller · 2025-04-16T11:38:54Z

/cc

petr-muller · 2025-04-16T12:48:56Z

LGTM from a consumer perspective 👍

atiratree

There are a few rough edges, but the general idea is very nice!

atiratree · 2025-04-25T19:15:50Z

manifests/0000_90_kube-controller-manager-operator_05_alerts.yaml

          labels:
            severity: critical
            namespace: openshift-kube-controller-manager
        - alert: PodDisruptionBudgetAtLimit


Yeah, this is not really optimal. As far as I can see, it should be pretty simple by adding poddisruptionbudget resource to here: https://github.com/openshift/monitoring-plugin/blob/6f948e4323bdf7c68e6b625ce3020116b5b4571a/web/src/components/alerting/AlertsDetailPage.tsx#L450

atiratree · 2025-04-25T19:18:15Z

manifests/0000_90_kube-controller-manager-operator_05_alerts.yaml

            summary: The pod disruption budget is preventing further disruption to pods.
-            description: The pod disruption budget is at the minimum disruptions allowed level. The number of current healthy pods is equal to the desired healthy pods.
+            description: |-
+              The {{ $labels.poddisruptionbudget }} pod disruption budget in the {{ $labels.namespace }} namespace is at the maximum allowed disruption. The number of current healthy pods is equal to the desired healthy pods.{{ with $console_url := "console_url" | query }}{{ if ne (len (label "url" (first $console_url))) 0}} For more information refer to {{ label "url" (first $console_url) }}/k8s/ns/{{ $labels.namespace }}/poddisruptionbudgets/{{ $labels.poddisruptionbudget }}{{ end }}{{ end }}


Then we would not need the console url and let the console handle all the link rendering for the poddisruptionbudget. It would also make the description nicer when looking at the alert detail in /monitoring/alertrules/1234.

atiratree · 2025-04-25T19:27:05Z

manifests/0000_90_kube-controller-manager-operator_05_alerts.yaml

            summary: There was a problem with syncing the resources for garbage collection.
-            description: Garbage Collector had a problem with syncing and monitoring the available resources. Please see KubeControllerManager logs for more details.
+            description: |-
+              Garbage Collector had a problem with syncing and monitoring the available resources. Please see KubeControllerManager logs for more details: 'oc -n {{ $labels.namespace }} logs -c {{ $labels.container }} {{ $labels.pod }}'{{ with $console_url := "console_url" | query }}{{ if ne (len (label "url" (first $console_url))) 0}} For more information refer to {{ label "url" (first $console_url) }}/k8s/ns/{{ $labels.namespace }}/pods/{{ $labels.pod }}/logs?container={{ $labels.container }} {{ end }}{{ end }}.


This seems too verbose to me. How to invoke the logs should be a responsibility of the runbook IMO.

But similar to the PDB case, we could link directly to the pod in the console: https://github.com/openshift/monitoring-plugin/blob/6f948e4323bdf7c68e6b625ce3020116b5b4571a/web/src/components/alerting/AlertsDetailPage.tsx#L450 without too much extra markup.

atiratree · 2025-04-25T19:49:31Z

manifests/0000_90_kube-controller-manager-operator_05_alerts.yaml

            summary: The pod disruption budget is preventing further disruption to pods.
-            description: The pod disruption budget is at the minimum disruptions allowed level. The number of current healthy pods is equal to the desired healthy pods.
+            description: |-
+              The {{ $labels.poddisruptionbudget }} pod disruption budget in the {{ $labels.namespace}} namespace is at the maximum allowed disruption. The number of current healthy pods is equal to the desired healthy pods.{{ with $console_url := "console_url" | query }}{{ if ne (len (label "url" (first $console_url))) 0}} For more information refer to {{ label "url" (first $console_url) }}/k8s/ns/{{ $labels.namespace }}/poddisruptionbudgets/{{ $labels.poddisruptionbudget }}{{ end }}{{ end }}


That is one option.

But we could also try to define the variables and actually try to render it in a test, similar to what we do with other templates

cluster-kube-controller-manager-operator/pkg/cmd/render/render.go

Line 119 in cec410b

type TemplateData struct {

It might be useful to have a test specifically for rendering the alerts to see that it resolves correctly.

openshift-bot · 2025-07-25T01:01:44Z

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

openshift-bot · 2025-08-24T08:30:13Z

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

openshift-ci · 2025-09-23T19:32:54Z

@wking: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/e2e-tests-extension	`9331433`	link	true	`/test e2e-tests-extension`
ci/prow/e2e-aws-ovn-upgrade	`9331433`	link	true	`/test e2e-aws-ovn-upgrade`
ci/prow/unit	`9331433`	link	true	`/test unit`
ci/prow/e2e-aws-operator	`9331433`	link	true	`/test e2e-aws-operator`
ci/prow/e2e-aws-ovn	`9331433`	link	true	`/test e2e-aws-ovn`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

openshift-ci bot requested review from deads2k and ingvagabund April 11, 2025 16:59

wking force-pushed the template-alert-descriptions branch from 4a9d9e4 to daae216 Compare April 11, 2025 19:54

wking commented Apr 11, 2025

View reviewed changes

wking commented Apr 15, 2025

View reviewed changes

wking force-pushed the template-alert-descriptions branch from daae216 to 9331433 Compare April 15, 2025 22:19

openshift-ci bot requested a review from petr-muller April 16, 2025 11:38

atiratree reviewed Apr 25, 2025

View reviewed changes

openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 25, 2025

openshift-ci bot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Aug 24, 2025

manifests/0000_90_kube-controller-manager-operator_05_alerts: Template console links in alert descriptions #837

Are you sure you want to change the base?

manifests/0000_90_kube-controller-manager-operator_05_alerts: Template console links in alert descriptions #837

Uh oh!

Conversation

wking commented Apr 11, 2025

Uh oh!

openshift-ci bot commented Apr 11, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

petr-muller commented Apr 16, 2025

Uh oh!

petr-muller commented Apr 16, 2025

Uh oh!

atiratree left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

atiratree Apr 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

openshift-bot commented Jul 25, 2025

Uh oh!

openshift-bot commented Aug 24, 2025

Uh oh!

openshift-ci bot commented Sep 23, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

atiratree Apr 25, 2025 •

edited

Loading