Skip to content

KFLUXINFRA-2067: Push Kueue metrics to RHOBS #7563

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Aug 12, 2025

Conversation

gbenhaim
Copy link
Member

Those metrics will be used for creating SLO alerts.

Copy link
Contributor

Code Review by Gemini

--- a/components/monitoring/prometheus/base/monitoringstack/monitoringstack.yaml
+++ b/components/monitoring/prometheus/base/monitoringstack/monitoringstack.yaml
@@ -147,7 +147,7 @@
         # Kueue
         - '{__name__="tekton_kueue_cel_evaluations_total"}'
         - '{__name__="kube_pod_container_status_restarts_total", namespace=~"tekton-kueue|kueue-external-admission|openshift-kueue-operator"}'
-        - '{__name__="kube_pod_container_status_ready", namespace=~"tekton-kueue|kueue-external-admission|openshift-kueue-operator}'
+        - '{__name__="kube_pod_container_status_ready", namespace=~"tekton-kueue|kueue-external-admission|openshift-kueue-operator"}'
         - '{__name__="kueue_cluster_queue_status"}'
         - '{__name__="kueue_admission_wait_time_seconds_bucket"}'
         - '{__name__="apiserver_admission_webhook_request_total", name="pipelinerun-kueue-defaulter.tekton-kueue.io"}'

Issue:
The line for kube_pod_container_status_ready is missing a closing double quote " for the namespace regex and a closing curly brace } for the entire match expression. This will lead to a parsing error in the Prometheus configuration.

Suggested change:

         - '{__name__="kube_pod_container_status_restarts_total", namespace=~"tekton-kueue|kueue-external-admission|openshift-kueue-operator"}'
-        - '{__name__="kube_pod_container_status_ready", namespace=~"tekton-kueue|kueue-external-admission|openshift-kueue-operator}'
+        - '{__name__="kube_pod_container_status_ready", namespace=~"tekton-kueue|kueue-external-admission|openshift-kueue-operator"}'
         - '{__name__="kueue_cluster_queue_status"}'

Copy link
Contributor

Code Review by Gemini

--- a/components/monitoring/prometheus/base/monitoringstack/monitoringstack.yaml
+++ b/components/monitoring/prometheus/base/monitoringstack/monitoringstack.yaml
@@ -148,9 +148,9 @@ spec:
         - '{__name__="kube_deployment_spec_replicas", namespace=~"gitops-service-argocd"}'
 
         # Kueue
-        - '{__name__="tekton_kueue_cel_evaluations_total"}'
+        - '{__name__="tekton_kueue_cel_evaluations_total", namespace=~"tekton-kueue|kueue-external-admission|openshift-kueue-operator"}'
         - '{__name__="kube_pod_container_status_restarts_total", namespace=~"tekton-kueue|kueue-external-admission|openshift-kueue-operator"}'
         - '{__name__="kube_pod_container_status_ready", namespace=~"tekton-kueue|kueue-external-admission|openshift-kueue-operator"}'
-        - '{__name__="kueue_cluster_queue_status"}'
-        - '{__name__="kueue_admission_wait_time_seconds_bucket"}'
+        - '{__name__="kueue_cluster_queue_status", namespace=~"tekton-kueue|kueue-external-admission|openshift-kueue-operator"}'
+        - '{__name__="kueue_admission_wait_time_seconds_bucket", namespace=~"tekton-kueue|kueue-external-admission|openshift-kueue-operator"}'
         - '{__name__="apiserver_admission_webhook_request_total", name="pipelinerun-kueue-defaulter.tekton-kueue.io"}'
 
         # Namespace (expression):  "build-service"

Copy link
Contributor

Code Review by Gemini

### Suggested Improvements

For consistency with other `kube_pod_container_status` entries in the file (e.g., for `release-service`), it is recommended to list `kube_pod_container_status_ready` before `kube_pod_container_status_restarts_total`.

**File: `components/monitoring/prometheus/base/monitoringstack/monitoringstack.yaml`**

```diff
--- a/components/monitoring/prometheus/base/monitoringstack/monitoringstack.yaml
+++ b/components/monitoring/prometheus/base/monitoringstack/monitoringstack.yaml
@@ -148,8 +148,8 @@
 
         # Kueue
         - '{__name__="tekton_kueue_cel_evaluations_total"}'
-        - '{__name__="kube_pod_container_status_restarts_total", namespace=~"tekton-kueue|kueue-external-admission|openshift-kueue-operator"}'
-        - '{__name__="kube_pod_container_status_ready", namespace=~"tekton-kueue|kueue-external-admission|openshift-kueue-operator"}'
+        - '{__name__="kube_pod_container_status_ready", namespace=~"tekton-kueue|kueue-external-admission|openshift-kueue-operator"}'
+        - '{__name__="kube_pod_container_status_restarts_total", namespace=~"tekton-kueue|kueue-external-admission|openshift-kueue-operator"}'
         - '{__name__="kueue_cluster_queue_status"}'
         - '{__name__="kueue_admission_wait_time_seconds_bucket"}'
         - '{__name__="apiserver_admission_webhook_request_total", name="pipelinerun-kueue-defaulter.tekton-kueue.io"}'

@gbenhaim
Copy link
Member Author

@FaisalAl-Rayes @mike-kingsbury can you ptal?

@gbenhaim
Copy link
Member Author

/test appstudio-e2e-tests

Copy link
Contributor

@mike-kingsbury mike-kingsbury left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM....

Copy link

openshift-ci bot commented Aug 11, 2025

@gbenhaim: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/appstudio-e2e-tests ad63d80 link true /test appstudio-e2e-tests

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Those metrics will be used for creating SLO alerts.

Signed-off-by: Gal Ben Haim <[email protected]>
Copy link
Contributor

Code Review by Gemini

No bugs or issues were found in the provided code changes. The additions are appropriate for pushing Kueue metrics and correctly ensure that necessary labels are preserved.

components/monitoring/prometheus/production/base/monitoringstack/endpoints-params.yaml

components/monitoring/prometheus/staging/base/monitoringstack/endpoints-params.yaml

Improvements:
The added metrics and their filtering are correctly configured to capture relevant Kueue data.

Exact lines to be changed (already implemented in the diff):
No changes needed, the current additions are correct.

--- a/components/monitoring/prometheus/production/base/monitoringstack/endpoints-params.yaml
+++ b/components/monitoring/prometheus/production/base/monitoringstack/endpoints-params.yaml
@@ -53,6 +53,13 @@
     - '{__name__="kube_deployment_status_replicas_ready", namespace="gitops-service-argocd"}'
     - '{__name__="kube_deployment_spec_replicas", namespace=~"gitops-service-argocd"}'
 
+    # Kueue
+    - '{__name__="tekton_kueue_cel_evaluations_total"}'
+    - '{__name__="kube_pod_container_status_restarts_total", namespace=~"tekton-kueue|kueue-external-admission|openshift-kueue-operator"}'
+    - '{__name__="kube_pod_container_status_ready", namespace=~"tekton-kueue|kueue-external-status_ready", namespace=~"tekton-kueue|kueue-external-admission|openshift-kueue-operator"}'
+    - '{__name__="kueue_cluster_queue_status"}'
+    - '{__name__="kueue_admission_wait_time_seconds_bucket"}'
+    - '{__name__="apiserver_admission_webhook_request_total", name="pipelinerun-kueue-defaulter.tekton-kueue.io"}'
+    - '{__name__="up", job=~".*kueue.*"}'
+
     # Namespace (expression):  "build-service"
     - '{__name__="kube_deployment_status_replicas_ready", namespace="build-service"}'
     - '{__name__="kube_deployment_status_replicas_available", namespace="build-service"}'

components/monitoring/prometheus/production/base/monitoringstack/writeRelabelConfigs.yaml

components/monitoring/prometheus/staging/base/monitoringstack/writeRelabelConfigs.yaml

Improvements:
Adding name and cluster_queue to the regex for labels to keep is essential for the new Kueue metrics to be meaningful, as these labels provide critical context.

Exact lines to be changed (already implemented in the diff):
No changes needed, the current additions are correct.

--- a/components/monitoring/prometheus/production/base/monitoringstack/writeRelabelConfigs.yaml
+++ b/components/monitoring/prometheus/production/base/monitoringstack/writeRelabelConfigs.yaml
@@ -15,4 +15,5 @@
       http_route|http_status_code|gin_errors|rule_result|rule_execution_cause|\
       policy_name|policy_background_mode|rule_type|policy_type|policy_validation_mode|\
-      resource_request_operation|resource_kind|policy_change_type|event_type"
+      resource_request_operation|resource_kind|policy_change_type|event_type|\
+      name|cluster_queue"

@gbenhaim
Copy link
Member Author

@mike-kingsbury I rebased the PR. I also modified the LabelKeep list to include the labels I need. Can you please take a look again? thanks!

@mike-kingsbury
Copy link
Contributor

/lgtm
/approve

Copy link

openshift-ci bot commented Aug 12, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: gbenhaim, mike-kingsbury

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-bot openshift-merge-bot bot merged commit 88cea78 into redhat-appstudio:main Aug 12, 2025
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants