o11y: Split MonitoringStack by environment #7509

pacho-rh · 2025-08-06T18:35:05Z

This is in attempt to speed up the pipeline whenever changes to this MonitoringStack definition is made. This is especially crucial for if we need to revert a breaking change to this definition. Additional, this allows us to test changes to this definition in dev or stage first before applying them to production.

pacho-rh · 2025-08-06T18:35:31Z

cc @gcpsoares @mftb

TominoFTW · 2025-08-07T09:58:01Z

/lgtm
/approve

TominoFTW · 2025-08-07T11:05:07Z

Approved, I am just wondering for what is the development directory? Is that for local cluster development?

With that its just the question if pipeline of adding something should be just: update staging => check if its how you desired (change/revert) => update production?

pacho-rh · 2025-08-07T13:32:12Z

Approved, I am just wondering for what is the development directory? Is that for local cluster development?

Yup, that directory is for deployment to development clusters (e.g. local dev cluster)

With that its just the question if pipeline of adding something should be just: update staging => check if its how you desired (change/revert) => update production?

Yup. Changes should be made to development in addition to staging too. But that makes me realize perhaps we ought to have the development layer refer to staging MonitoringStack definition as to cut out that disconnect.
I'll see if I can get the changes in here. If not, then as a follow-up PR.

EDIT: nvm, looks like we need do that another way. I get this error if I try to directly reference the staging MonitoringStack defintion:

$ kustomize build components/monitoring/prometheus/development/monitoringstack/
Error: accumulating resources: accumulation err='accumulating resources from '../../staging/base/monitoringstack/monitoringstack.yaml': security; file '.../infra-deployments/components/monitoring/prometheus/staging/base/monitoringstack/monitoringstack.yaml' is not in or below '.../infra-deployments/components/monitoring/prometheus/development/monitoringstack'': must build at directory: '.../infra-deployments/components/monitoring/prometheus/staging/base/monitoringstack/monitoringstack.yaml': file is not directory

github-actions · 2025-08-07T14:00:35Z

ERROR:

pacho-rh · 2025-08-07T14:02:58Z

But that makes me realize perhaps we ought to have the development layer refer to staging MonitoringStack definition as to cut out that disconnect.

I kept the old base MonitoringStack to be referred to by staging and development. Instead having production with its own MonitoringStack definition.

github-actions · 2025-08-07T14:07:18Z

ERROR:

ci-operator · 2025-08-07T14:12:40Z

The outcome of just the staging/developments pointing to the base while production points to one outside the base effectively inverts the expectation that "base is Production and anything that shares its config" - instead of a follow-up PR, could we include the change also in this PR (so both Development and Stage point to the same config in their own directory, and not to the base)?

TominoFTW · 2025-08-07T14:14:53Z

I am thinking if we can just create a "symlink" to the file in staging perhaps? with ../../../staging/base/monitoringstack/monitoringstack.yaml path for development? I think something similar worked for playbooks

pacho-rh · 2025-08-07T14:41:52Z

I am thinking if we can just create a "symlink" to the file in staging perhaps? with ../../../staging/base/monitoringstack/monitoringstack.yaml path for development? I think something similar worked for playbooks

Using symlinks, I get the same error I posted previously.

pacho-rh · 2025-08-07T14:56:33Z

could we include the change also in this PR (so both Development and Stage point to the same config in their own directory, and not to the base)?

@ci-operator @TominoFTW - Revising this but any idea on how to go about this?

I cannot have development's kustomization.yaml refer directly to resources outside of its directory (e.g. ../../staging/base/monitoringstack/monitoringstack.yaml). Instead, I am allowed to refer to the directory holding the staging base definitions (../../staging/base/monitoringstack).
My issue with that is it includes more than just monitoringstack.yaml. It includes some patching which will also get applied to development. Additional changes made to staging base may inadvertently affect development.

An idea I have is using the following structure but what do you guys think?

.
...
├── development
    ...
│   └── monitoringstack
│       ├── cluster-type-patch.yaml
│       ├── kustomization.yaml (points to ../../staging/base/monitoringstack)
│       └── remote-write-env-details.yaml
└── staging
    ├── base
    ...
    │   ├── monitoringstack
    │   │   ├── kustomization.yaml
    │   │   └── monitoringstack.yaml
    ├── monitoringstack
    │   ├── cluster-type-patch.yaml
    │   ├── kustomization.yaml (points to ../base/monitoringstack)
    │   └── remote-write-env-details.yaml
    ├── stone-stage-p01
    ...

TominoFTW · 2025-08-07T15:11:58Z

Hmm, I am worried that it gets more and more confusing with that change 🤔

But I don't really see any other way out of that tho, so if that is working I am okay with using that. 👍

pacho-rh · 2025-08-07T15:12:08Z

Or better yet:

.
...
├── development
    ...
│   └── monitoringstack
│       ├── cluster-type-patch.yaml
│       ├── kustomization.yaml (points to ../../staging/base/monitoringstackbase)
│       └── remote-write-env-details.yaml
└── staging
    ├── base
    │   ├── kustomization.yaml
    │   ├── monitoringstack
    │   │   ├── cluster-type-patch.yaml
    │   │   ├── kustomization.yaml (points to ../monitoringstackbase)
    │   │   └── remote-write-env-details.yaml
    │   ├── monitoringstackbase
    │   │   ├── kustomization.yaml
    │   │   └── monitoringstack.yaml
    │   └── rhobs-secret-path.yaml
    ├── kflux-stg-es01
    ...

TominoFTW · 2025-08-07T15:16:36Z

Or better yet:

One more thought for this:

.
├── base/  # Production only
├── shared/  # Shared config for dev/staging
│   └── monitoringstack/
│       ├── monitoringstack.yaml
│       └── kustomization.yaml
├── development/ # Ref to shared
└── staging/ # Ref to shared

Could something like this be done?

pacho-rh · 2025-08-07T15:20:32Z

Could something like this be done?

I think this runs into the same problem. It could be assumed that shared also include resources for production

TominoFTW · 2025-08-07T15:22:44Z

Could something like this be done?

I think this runs into the same problem. It could be assumed that shared also include resources for production

then stg-dev-(shared)-monitoringstack 😆

I just want to completely omit the idea of having nested folders when symlinks don't work 🤔

github-actions · 2025-08-07T16:54:28Z

ERROR:

pacho-rh · 2025-08-07T16:55:47Z

then stg-dev-(shared)-monitoringstack 😆

I just want to completely omit the idea of having nested folders when symlinks don't work 🤔

I agree with avoiding nested folders. Made a new folder for holding staging and development common resources and moved the MonitoringStack definition under there.

github-actions · 2025-08-07T17:10:52Z

Code Review by Gemini

The changes aim to split the MonitoringStack definition by environment to improve pipeline speed and enable environment-specific testing.

However, there are a couple of points to consider:

Duplication of MonitoringStack Definition

The MonitoringStack and ServiceMonitor definitions in components/monitoring/prometheus/production/base/monitoringstack/monitoringstack.yaml (the renamed original file) and components/monitoring/prometheus/stg-dev-common/monitoringstack/monitoringstack.yaml (the new file) are almost entirely identical. The only functional difference observed in this diff is the resources block within the MonitoringStack kind.

This duplication means that any future changes to the core MonitoringStack configuration (e.g., retention, logLevel, remoteWrite relabelings, or the ServiceMonitor scrape configurations) will need to be applied to both files. This increases maintenance overhead and the risk of inconsistencies between environments.

Suggestion:
Consider creating a truly common base/monitoringstack/monitoringstack.yaml that contains the shared definition of the MonitoringStack, ClusterRoleBinding, and ServiceMonitor resources. Then, use Kustomize overlays in components/monitoring/prometheus/production/base/monitoringstack/kustomization.yaml and components/monitoring/prometheus/stg-dev-common/monitoringstack/kustomization.yaml to apply environment-specific patches (e.g., resource limits, remoteWrite URLs/audiences if they differ, or any other future divergences).

This approach would reduce duplication and make it easier to manage common configurations while still allowing for environment-specific overrides.

Resource Allocation for Staging/Development

In components/monitoring/prometheus/stg-dev-common/monitoringstack/monitoringstack.yaml, the MonitoringStack is configured with memory: 16Gi for both requests and limits:

# components/monitoring/prometheus/stg-dev-common/monitoringstack/monitoringstack.yaml
    requests:
      cpu: 500m
      memory: 16Gi
    limits:
      memory: 16Gi

While this might be appropriate for production, 16Gi of memory seems quite high for a development or staging environment. If the goal is to speed up pipelines and enable testing, having lower resource requirements for non-production environments could be beneficial, especially if resources are constrained.

Suggestion:
Review if 16Gi memory is truly necessary for the stg-dev-common environment. Consider reducing this value to a more appropriate level for non-production usage, or making it configurable via an overlay if different staging/dev clusters have varying resource capacities.

github-actions · 2025-08-07T18:39:33Z

ERROR:

github-actions · 2025-08-07T18:41:53Z

Code Review by Gemini

The changes aim to split the MonitoringStack definition by environment, which is a good approach for improving pipeline speed and enabling environment-specific testing. However, there are several critical issues in the implementation that need to be addressed.

Identified Issues and Suggested Changes:

1. File: `components/monitoring/prometheus/base/monitoringstack/monitoringstack.yaml`

Issue 1: Dangerous default for writeRelabelConfigs regex.
The regex for LabelKeep is changed to an empty string (""). In Prometheus, LabelKeep with an empty regex means "keep no labels", effectively stripping all labels from metrics before remote writing. This will make the metrics largely unusable for analysis and alerting. If the intent is to make this configuration environment-specific, this block should be removed from the base entirely, and then added by environment-specific overlays.

Suggested Change:
Remove the writeRelabelConfigs block entirely from the base MonitoringStack definition.

--- a/components/monitoring/prometheus/base/monitoringstack/monitoringstack.yaml
+++ b/components/monitoring/prometheus/base/monitoringstack/monitoringstack.yaml
@@ -34,19 +34,11 @@
           audience: # added by overlays
         tokenUrl: https://sso.redhat.com/auth/realms/redhat-external/protocol/openid-connect/token
       url: # added by overlays
-      writeRelabelConfigs:
-      - action: LabelKeep
-        regex: "__name__|source_environment|source_cluster|namespace|app|pod|container|\
-          label_pipelines_appstudio_openshift_io_type|health_status|dest_namespace|\
-          controller|service|reason|phase|type|resource|resourcequota|le|app|image|\
-          commit_hash|job|operation|tokenName|rateLimited|state|persistentvolumeclaim|\
-          storageclass|volumename|release_reason|instance|result|deployment_reason|\
-          validation_reason|strategy|succeeded|target|name|method|code|sp|le|\
-          unexpected_status|failure|hostname|label_app_kubernetes_io_managed_by|status|\
-          pipeline|pipelinerun|schedule|check|grpc_service|grpc_code|\
-          grpc_method|lease|lease_holder|deployment|platform|mode|cpu|role|node|kind|\
-          verb|request_kind|tested_cluster|resource_type|exported_job|http_method|\
-          http_route|http_status_code|gin_errors|rule_result|rule_execution_cause|\
-          policy_name|policy_background_mode|rule_type|policy_type|policy_validation_mode|\
-          resource_request_operation|resource_kind|policy_change_type|event_type"
-          
-
+      # writeRelabelConfigs: This block should be added by environment-specific overlays
 ---
 # Grant permission to Federate In-Cluster Prometheus
 apiVersion: rbac.authorization.k8s.io/v1
 kind: ClusterRoleBinding
 metadata:
   name: appstudio-federate-ms-view
   labels:

Issue 2: Dangerous default for ServiceMonitor match[] parameters.
The match[] array under endpoints/0/params is changed to an empty array ([]). In Prometheus federation, an empty match[] parameter means "match all series". This will cause the ServiceMonitor to scrape and remote write all metrics from the in-cluster Prometheus, leading to excessive data volume, high costs, and potential performance issues for both the local Prometheus and the remote endpoint. Similar to writeRelabelConfigs, this should be removed from the base and added by environment-specific overlays.

Suggested Change:
Remove the match[] array entirely from the base ServiceMonitor definition.

--- a/components/monitoring/prometheus/base/monitoringstack/monitoringstack.yaml
+++ b/components/monitoring/prometheus/base/monitoringstack/monitoringstack.yaml
@@ -85,108 +85,10 @@
       app.kubernetes.io/managed-by: observability-operator
       app.kubernetes.io/name: appstudio-federate-ms-prometheus
   endpoints:
   - params:
-      'match[]': []  # scrape only required metrics from in-cluster prometheus
     relabelings:
     # override the target's address by the prometheus-k8s service name.
     - action: replace
       targetLabel: __address__
       replacement: prometheus-k8s.openshift-monitoring.svc:9091
     # remove the default target labels as they aren't relevant in case of federation.
     - action: labeldrop
       regex: pod|namespace|service|endpoint|container
     # 30s interval creates 4 scrapes per minute
     # prometheus-k8s.svc x 2 ms-prometheus x (60s/ 30s) = 4

2. File: `components/monitoring/prometheus/development/monitoringstack/cluster-type-patch.yaml`

Issue: Incorrect patch operation (op: replace).
The patch operation for cluster-type-patch.yaml was changed from op: add to op: replace at path: /spec/endpoints/0/relabelings/0. The ServiceMonitor in the base already has a relabeling at index 0 (for __address__). Changing to op: replace will overwrite this essential relabeling instead of adding a new one. The original op: add at index 0 would insert the new relabeling and shift existing ones, which is a valid way to add.

Suggested Change:
Revert op: replace back to op: add.

--- a/components/monitoring/prometheus/development/monitoringstack/cluster-type-patch.yaml
+++ b/components/monitoring/prometheus/development/monitoringstack/cluster-type-patch.yaml
@@ -1,6 +1,6 @@
 ---
-- op: add
+- op: add
   path: /spec/endpoints/0/relabelings/0
   value:
     targetLabel: source_environment
     replacement: development-cluster

3. File: `components/monitoring/prometheus/development/monitoringstack/kustomization.yaml`

Issue 1: Incorrect base resource dependency.
The development kustomization now uses ../../staging/base/monitoringstack as its base. This breaks the logical hierarchy where base should be the most generic definition, and environments (dev, staging, prod) build upon it.

Suggested Change:
Change the base resource back to ../../base/monitoringstack.

--- a/components/monitoring/prometheus/development/monitoringstack/kustomization.yaml
+++ b/components/monitoring/prometheus/development/monitoringstack/kustomization.yaml
@@ -1,7 +1,7 @@
 apiVersion: kustomize.config.k8s.io/v1beta1
 kind: Kustomization
 resources:
   - ../../base/observability-operator
-  - ../../staging/base/monitoringstack
+  - ../../base/monitoringstack
 patches:
   - path: cluster-type-patch.yaml
     target:

Issue 2: Missing environment-specific patches.
The development kustomization is missing the patches for endpoints-params-match.yaml and writeRelabelConfigs-regex.yaml. This means it will inherit the problematic empty match[] (match all) and empty regex (keep no labels) from the modified base file, leading to incorrect metric scraping and label filtering. These patches should be applied to the development environment as well. Assuming development should use the same configurations as staging for these for now.

Suggested Change:
Add the endpoints-params-match.yaml and writeRelabelConfigs-regex.yaml patches, referencing the staging versions for consistency.

--- a/components/monitoring/prometheus/development/monitoringstack/kustomization.yaml
+++ b/components/monitoring/prometheus/development/monitoringstack/kustomization.yaml
@@ -10,3 +10,11 @@
       kind: ServiceMonitor
   - path: remote-write-env-details.yaml
     target:
       name: appstudio-federate-ms
       kind: MonitoringStack
+  - path: ../../staging/base/monitoringstack/endpoints-params-match.yaml
+    target:
+      name: appstudio-federate-smon
+      kind: ServiceMonitor
+  - path: ../../staging/base/monitoringstack/writeRelabelConfigs-regex.yaml
+    target:
+      name: appstudio-federate-ms
+      kind: MonitoringStack

4. Files: `components/monitoring/prometheus/production/base/monitoringstack/kustomization.yaml` and `components/monitoring/prometheus/staging/base/monitoringstack/kustomization.yaml`

Issue: Incorrect patch operation (op: replace).
If match[] and writeRelabelConfigs are removed from the base monitoringstack.yaml (as suggested in point 1), then the patches that add these fields in the environment-specific kustomizations should use op: add instead of op: replace.

Suggested Change (for both production and staging kustomizations):
Change op: replace to op: add for the endpoints-params-match.yaml and writeRelabelConfigs-regex.yaml patches.

--- a/components/monitoring/prometheus/production/base/monitoringstack/kustomization.yaml
+++ b/components/monitoring/prometheus/production/base/monitoringstack/kustomization.yaml
@@ -5,18 +5,22 @@
   - path: cluster-type-patch.yaml
     target:
       name: appstudio-federate-smon
       kind: ServiceMonitor
-  - path: endpoints-params-match.yaml
+  - op: add # Change from 'replace' to 'add'
+    path: endpoints-params-match.yaml
     target:
       name: appstudio-federate-smon
       kind: ServiceMonitor
   - path: cluster-type-patch.yaml
     target:
       name: appstudio-federate-uwm-smon
       kind: ServiceMonitor
   - path: remote-write-env-details.yaml
     target:
       name: appstudio-federate-ms
       kind: MonitoringStack
-  - path: writeRelabelConfigs-regex.yaml
+  - op: add # Change from 'replace' to 'add'
+    path: writeRelabelConfigs-regex.yaml
     target:
       name: appstudio-federate-ms
       kind: MonitoringStack

This is in attempt to speed up the pipeline whenever changes to this MonitoringStack definition is made. This is especially crucial for if we need to revert a breaking change to this definition. Additional, this allows us to test changes to this definition in dev or stage first before applying them to production.

github-actions · 2025-08-07T19:20:19Z

Code Review by Gemini

The changes aim to split the MonitoringStack definition by environment to improve pipeline speed and enable environment-specific testing. While the overall approach of moving environment-specific configurations to overlays is good, there are a couple of issues that need to be addressed.

Issues and Suggestions

1. Bug: Incorrect Kustomize operation in `cluster-type-patch.yaml` for development

The patch in components/monitoring/prometheus/development/monitoringstack/cluster-type-patch.yaml changes the operation from add to replace for a ServiceMonitor relabeling rule. This is problematic because the base ServiceMonitor already has a critical relabeling rule at index 0 (for __address__). Changing op: replace will overwrite this existing rule instead of adding the source_environment label as intended. The original op: add was correct for inserting the new rule.

File: components/monitoring/prometheus/development/monitoringstack/cluster-type-patch.yaml

--- a/components/monitoring/prometheus/development/monitoringstack/cluster-type-patch.yaml
+++ b/components/monitoring/prometheus/development/monitoringstack/cluster-type-patch.yaml
@@ -1,6 +1,6 @@
 ---
-- op: replace
+- op: add
   path: /spec/endpoints/0/relabelings/0
   value:
     targetLabel: source_environment
     replacement: development-cluster

2. Architectural Inconsistency: Development environment's Kustomize base

The development environment's kustomization.yaml now bases its MonitoringStack resource on ../../staging/base/monitoringstack. This creates an inconsistent hierarchy compared to staging and production, which both base directly on ../../../base/monitoringstack.

This setup means that development will implicitly inherit the endpoints-params and writeRelabelConfigs from staging without explicitly defining them. This contradicts the goal of "testing changes to this definition in dev or stage first" if dev and stage are effectively identical in terms of these configurations and dev cannot easily diverge.

For better clarity, independence, and adherence to Kustomize best practices for environment overlays, development should also base directly on the common ../../../base/monitoringstack and apply its own specific patches, similar to staging and production.

File: components/monitoring/prometheus/development/monitoringstack/kustomization.yaml

--- a/components/monitoring/prometheus/development/monitoringstack/kustomization.yaml
+++ b/components/monitoring/prometheus/development/monitoringstack/kustomization.yaml
@@ -3,12 +3,20 @@
 resources:
   - ../../base/observability-operator
-  - ../../staging/base/monitoringstack
+  - ../../../base/monitoringstack # Base off the common base
 patches:
   - path: cluster-type-patch.yaml
     target:
       name: appstudio-federate-smon
       kind: ServiceMonitor
   - path: cluster-type-patch.yaml
     target:
       name: appstudio-federate-uwm-smon
       kind: ServiceMonitor
   - path: remote-write-env-details.yaml
+    target:
+      name: appstudio-federate-ms
+      kind: MonitoringStack
+  - path: endpoints-params.yaml # Add dev-specific endpoints-params
+    target:
+      name: appstudio-federate-smon
+      kind: ServiceMonitor
+  - path: writeRelabelConfigs.yaml # Add dev-specific writeRelabelConfigs
+    target:
+      name: appstudio-federate-ms
+      kind: MonitoringStack

New File: components/monitoring/prometheus/development/monitoringstack/endpoints-params.yaml
(Content should be identical to components/monitoring/prometheus/production/base/monitoringstack/endpoints-params.yaml initially, allowing for future divergence.)

---
- op: replace
  path: /spec/endpoints/0/params
  value:
    'match[]':  # scrape only required metrics from in-cluster prometheus
    - '{__name__="pipeline_service_schedule_overhead_percentage_sum"}'
    - '{__name__="pipeline_service_schedule_overhead_percentage_count"}'
    - '{__name__="pipeline_service_execution_overhead_percentage_sum"}'
    - '{__name__="pipeline_service_execution_overhead_percentage_count"}'
    - '{__name__="pipelinerun_duration_scheduled_seconds_sum"}'
    - '{__name__="pipelinerun_duration_scheduled_seconds_count"}'
    - '{__name__="pipelinerun_gap_between_taskruns_milliseconds_sum"}'
    - '{__name__="pipelinerun_gap_between_taskruns_milliseconds_count"}'
    - '{__name__="pipelinerun_kickoff_not_attempted_count"}'
    - '{__name__="pending_resolutionrequest_count"}'
    - '{__name__="taskrun_pod_create_not_attempted_or_pending_count"}'
    - '{__name__="tekton_pipelines_controller_pipelinerun_count"}'
    - '{__name__="tekton_pipelines_controller_running_pipelineruns_count"}'
    - '{__name__="tekton_pipelines_controller_running_taskruns_throttled_by_quota_count"}'
    - '{__name__="tekton_pipelines_controller_running_taskruns_throttled_by_node_count"}'
    - '{__name__="tekton_pipelines_controller_running_taskruns_throttled_by_quota"}'
    - '{__name__="tekton_pipelines_controller_running_taskruns_throttled_by_node"}'
    - '{__name__="tekton_pipelines_controller_pipelinerun_duration_seconds_sum"}'
    - '{__name__="tekton_pipelines_controller_pipelinerun_duration_seconds_count"}'
    - '{__name__="watcher_workqueue_depth"}'
    - '{__name__="watcher_client_latency_bucket"}'
    - '{__name__="pac_watcher_work_queue_depth"}'
    - '{__name__="pac_watcher_client_latency_bucket"}'
    - '{__name__="grpc_server_handled_total", namespace=~"tekton-results|openshift-pipelines"}'
    - '{__name__="grpc_server_handled_total", namespace=~"openshift-etcd"}'
    - '{__name__="grpc_server_handling_seconds_bucket", namespace=~"tekton-results|openshift-pipelines"}'
    - '{__name__="grpc_server_handling_seconds_bucket", namespace="openshift-etcd"}'
    - '{__name__="grpc_server_msg_received_total", namespace="openshift-etcd"}'
    - '{__name__="controller_runtime_reconcile_errors_total", namespace!~".*-tenant|openshift-.*|kube-.*"}'
    - '{__name__="controller_runtime_reconcile_total", namespace!~".*-tenant|openshift-.*|kube-.*"}'
    - '{__name__="kube_lease_owner", namespace="openshift-pipelines", lease=~"controller.tektonresolverframework.bundleresolver..*"}'
    - '{__name__="kube_lease_owner", namespace="openshift-pipelines", lease=~"tekton-pipelines-controller.github.com.tektoncd.pipeline.pkg.reconciler..*"}'
    - '{__name__="kube_pod_status_unschedulable", namespace!~".*-tenant|openshift-.*|kube-.*"}'
    - '{__name__="kube_pod_container_status_restarts_total", namespace=~"openshift-pipelines|release-service"}'
    - '{__name__="kube_pod_container_status_waiting_reason", namespace!~".*-tenant|openshift-.*|kube-.*"}'
    - '{__name__="kube_pod_status_phase", namespace!~".*-tenant|openshift-.*|kube-.*"}'
    - '{__name__="kube_pod_container_resource_limits", namespace="release-service"}'
    - '{__name__="kube_pod_container_status_terminated_reason", namespace="release-service"}'
    - '{__name__="kube_pod_container_status_last_terminated_reason", namespace="release-service"}'
    - '{__name__="kube_pod_container_status_ready", namespace="release-service"}'
    - '{__name__="kube_persistentvolume_status_phase", namespace!~".*-tenant|openshift-.*|kube-.*"}'
    - '{__name__="kube_resourcequota", namespace!~".*-tenant|openshift-.*|kube-.*"}'
    - '{__name__="kube_statefulset_status_replicas_ready", namespace="gitops-service-argocd"}'
    - '{__name__="kube_statefulset_replicas", namespace="gitops-service-argocd"}'
    - '{__name__="openshift_route_status", namespace="gitops-service-argocd"}'

    - '{__name__="kube_deployment_status_replicas_ready", namespace="gitops-service-argocd"}'
    - '{__name__="kube_deployment_spec_replicas", namespace=~"gitops-service-argocd"}'

    # Namespace (expression):  "build-service"
    - '{__name__="kube_deployment_status_replicas_ready", namespace="build-service"}'
    - '{__name__="kube_deployment_status_replicas_available", namespace="build-service"}'
    - '{__name__="kube_deployment_spec_replicas", namespace="build-service"}'

    # Namespace (expression):  "integration-service"
    - '{__name__="kube_deployment_status_replicas_ready", namespace="integration-service"}'
    - '{__name__="kube_deployment_status_replicas_available", namespace="integration-service"}'
    - '{__name__="kube_deployment_spec_replicas", namespace="integration-service"}'

    # Namespace (expression):  "konflux-ui"
    - '{__name__="kube_deployment_status_replicas_ready", namespace="konflux-ui"}'
    - '{__name__="kube_deployment_status_replicas_available", namespace="konflux-ui"}'
    - '{__name__="kube_deployment_spec_replicas", namespace="konflux-ui"}'
    - '{__name__="kube_running_pods_ready", namespace="konflux-ui"}'
    - '{__name__="kube_endpoint_address", namespace="konflux-ui"}'
    - '{__name__="kube_pod_container_status_restarts_total", namespace="konflux-ui"}'

    # Namespace (expression):  "mintmaker"
    - '{__name__="kube_deployment_status_replicas_ready", namespace="mintmaker"}'
    - '{__name__="kube_deployment_status_replicas_available", namespace="mintmaker"}'
    - '{__name__="kube_deployment_spec_replicas", namespace="mintmaker"}'
    - '{__name__="cluster_ram_requested_perc"}'
    - '{__name__="node_memory_pressured_perc"}'
    - '{__name__="redis_node_memory_usage_perc"}'

    # Namespace (expression):  ~".*monitoring.*"
    - '{__name__="kube_deployment_status_replicas_ready", namespace=~".*monitoring.*"}'
    - '{__name__="kube_deployment_status_replicas_available", namespace=~".*monitoring.*"}'
    - '{__name__="kube_deployment_spec_replicas", namespace=~".*monitoring.*"}'

    # Namespace (expression):  "multi-platform-controller"
    - '{__name__="kube_deployment_status_replicas_ready", namespace="multi-platform-controller"}'
    - '{__name__="kube_deployment_status_replicas_available", namespace="multi-platform-controller"}'
    - '{__name__="kube_deployment_spec_replicas", namespace="multi-platform-controller"}'

    # Namespace (expression):  "namespace-lister"
    - '{__name__="kube_deployment_status_replicas_ready", namespace="namespace-lister"}'
    - '{__name__="kube_deployment_status_replicas_available", namespace="namespace-lister"}'
    - '{__name__="kube_deployment_spec_replicas", namespace="namespace-lister"}'

    # Namespace (expression):  "openshift-pipelines"
    - '{__name__="kube_deployment_status_replicas_ready", namespace="openshift-pipelines"}'
    - '{__name__="kube_deployment_status_replicas_available", namespace="openshift-pipelines"}'
    - '{__name__="kube_deployment_spec_replicas", namespace="openshift-pipelines"}'

    # Namespace (expression):  "product-kubearchive"
    - '{__name__="kube_deployment_status_replicas_ready", namespace="product-kubearchive"}'
    - '{__name__="kube_deployment_status_replicas_available", namespace="product-kubearchive"}'
    - '{__name__="kube_deployment_spec_replicas", namespace="product-kubearchive"}'

    # Namespace (expression):  "project-controller"
    - '{__name__="kube_deployment_status_replicas_ready", namespace="project-controller"}'
    - '{__name__="kube_deployment_status_replicas_available", namespace="project-controller"}'
    - '{__name__="kube_deployment_spec_replicas", namespace="project-controller"}'

    # Namespace (expression):  "release-service"
    - '{__name__="kube_deployment_status_replicas_ready", namespace="release-service"}'
    - '{__name__="kube_deployment_status_replicas_available", namespace="release-service"}'
    - '{__name__="kube_deployment_spec_replicas", namespace="release-service"}'

    # Namespace (expression):  ~"smee.*"
    - '{__name__="kube_deployment_status_replicas_ready", namespace=~"smee.*"}'
    - '{__name__="kube_deployment_status_replicas_available", namespace=~"smee.*"}'
    - '{__name__="kube_deployment_spec_replicas", namespace=~"smee.*"}'

    # Namespace (expression):  "openshift-apiserver"
    - '{__name__="kube_deployment_status_replicas_ready", namespace="openshift-apiserver"}'
    - '{__name__="kube_deployment_status_replicas_available", namespace="openshift-apiserver"}'
    - '{__name__="kube_deployment_spec_replicas", namespace="openshift-apiserver"}'

    # Namespace (expression):  "openshift-oauth-apiserver"
    - '{__name__="kube_deployment_status_replicas_ready", namespace="openshift-oauth-apiserver"}'
    - '{__name__="kube_deployment_status_replicas_available", namespace="openshift-oauth-apiserver"}'
    - '{__name__="kube_deployment_spec_replicas", namespace="openshift-oauth-apiserver"}'

    # Namespace (expression):  "konflux-kyverno"
    - '{__name__="kube_deployment_status_replicas_ready", namespace="konflux-kyverno"}'
    - '{__name__="kube_deployment_status_replicas_available", namespace="konflux-kyverno"}'
    - '{__name__="kube_deployment_spec_replicas", namespace="konflux-kyverno"}'

    # Namespace (expression):  "openshift-kube-apiserver"
    - '{__name__="kube_deployment_status_replicas_ready", namespace="openshift-kube-apiserver"}'
    - '{__name__="kube_deployment_status_replicas_available", namespace="openshift-kube-apiserver"}'
    - '{__name__="kube_deployment_spec_replicas", namespace="openshift-kube-apiserver"}'

    # Namespace (expression):  "konflux-user-support"
    - '{__name__="kube_deployment_status_replicas_available", namespace="konflux-user-support"}'
    - '{__name__="kube_deployment_spec_replicas", namespace="konflux-user-support"}'

    - '{__name__="argocd_app_reconcile_bucket", namespace="gitops-service-argocd"}'
    - '{__name__="argocd_app_info", namespace="gitops-service-argocd"}'
    - '{__name__="container_cpu_usage_seconds_total", namespace="release-service"}'
    - '{__name__="container_cpu_usage_seconds_total", namespace="openshift-etcd"}'
    - '{__name__="container_memory_usage_bytes", namespace="release-service"}'
    - '{__name__="container_memory_usage_bytes", namespace="openshift-etcd"}'
    - '{__name__="etcd_disk_wal_fsync_duration_seconds_bucket"}'
    - '{__name__="etcd_disk_backend_commit_duration_seconds_bucket"}'
    - '{__name__="etcd_server_proposals_failed_total"}'
    - '{__name__="etcd_server_leader_changes_seen_total", namespace="openshift-etcd"}'
    - '{__name__="etcd_server_has_leader", namespace="openshift-etcd"}'
    - '{__name__="etcd_server_is_leader", namespace="openshift-etcd"}'
    - '{__name__="etcd_server_id", namespace="openshift-etcd"}'
    - '{__name__="etcd_server_quota_backend_bytes", namespace="openshift-etcd"}'
    - '{__name__="etcd_mvcc_db_total_size_in_bytes", namespace="openshift-etcd"}'
    - '{__name__="etcd_server_received_total", namespace="openshift-etcd"}'
    - '{__name__="etcd_network_active_peers", namespace="openshift-etcd"}'
    - '{__name__="etcd_network_peer_round_trip_time_seconds_bucket"}'
    - '{__name__="etcd_disk_defrag_inflight"}'
    - '{__name__="kube_job_spec_completions"}'
    - '{__name__="kube_job_status_succeeded"}'
    - '{__name__="kube_job_status_failed"}'
    - '{__name__="node_cpu_seconds_total", mode="idle"}'
    - '{__name__="node_memory_MemTotal_bytes"}'
    - '{__name__="node_memory_MemAvailable_bytes"}'
    - '{__name__="platform:hypershift_hostedclusters:max"}'
    - '{__name__="kube_node_role"}'
    - '{__name__="etcd_shield_trigger"}'
    - '{__name__="etcd_shield_alert_triggered"}'
    - '{__name__="apiserver_admission_webhook_rejection_count", name="vpipelineruns.konflux-ci.dev"}'
    - '{__name__="apiserver_watch_events_total"}'
    - '{__name__="apiserver_storage_objects"}'
    - '{__name__="apiserver_current_inflight_requests"}'
    - '{__name__="resource_verb:apiserver_request_total:rate5m"}'
    - '{__name__="code:apiserver_request_total:rate5m"}'
    - '{__name__="instance:apiserver_request_total:rate5m"}'
    - '{__name__="prometheus_ready"}'
    - '{__name__="process_cpu_seconds_total", job="apiserver"}'
    - '{__name__="namespace:container_memory_usage_bytes:sum", namespace=~"openshift-etcd|openshift-kube-apiserver|build-service|image-controller|integration-service|konflux-ui|product-kubearchive|openshift-kueue-operator|tekton-kueue|kueue-external-admission|mintmaker|multi-platform-controller|namespace-lister|openshift-pipelines|tekton-results|project-controller|smee|smee-client"}'
    - '{__name__="namespace:container_cpu_usage:sum", namespace=~"openshift-etcd|openshift-kube-apiserver|build-service|image-controller|integration-service|konflux-ui|product-kubearchive|openshift-kueue-operator|tekton-kueue|kueue-external-admission|mintmaker|multi-platform-controller|namespace-lister|openshift-pipelines|tekton-results|project-controller|smee|smee-client"}'
    - '{__name__="node_namespace_pod:kube_pod_info:", namespace=~"openshift-etcd|openshift-kube-apiserver|build-service|image-controller|integration-service|konflux-ui|product-kubearchive|openshift-kueue-operator|tekton-kueue|kueue-external-admission|mintmaker|multi-platform-controller|namespace-lister|openshift-pipelines|tekton-results|project-controller|smee|smee-client"}'
    - '{__name__="kube_node_status_allocatable", resource=~"cpu|memory"}'
    - '{__name__="kube_node_status_condition", condition="MemoryPressure", status="true"}'
    - '{__name__="namespace_memory:kube_pod_container_resource_requests:sum", namespace=~"openshift-etcd|openshift-kube-apiserver|build-service|image-controller|integration-service|konflux-ui|product-kubearchive|openshift-kueue-operator|tekton-kueue|kueue-external-admission|mintmaker|multi-platform-controller|namespace-lister|openshift-pipelines|tekton-results|project-controller|smee|smee-client"}'
    - '{__name__="namespace_cpu:kube_pod_container_resource_requests:sum", namespace=~"openshift-etcd|openshift-kube-apiserver|build-service|image-controller|integration-service|konflux-ui|product-kubearchive|openshift-kueue-operator|tekton-kueue|kueue-external-admission|mintmaker|multi-platform-controller|namespace-lister|openshift-pipelines|tekton-results|project-controller|smee|smee-client"}'
    - '{__name__="namespace_memory:kube_pod_container_resource_limits:sum", namespace=~"openshift-etcd|openshift-kube-apiserver|build-service|image-controller|integration-service|konflux-ui|product-kubearchive|openshift-kueue-operator|tekton-kueue|kueue-external-admission|mintmaker|multi-platform-controller|namespace-lister|openshift-pipelines|tekton-results|project-controller|smee|smee-client"}'
    - '{__name__="namespace_cpu:kube_pod_container_resource_limits:sum", namespace=~"openshift-etcd|openshift-kube-apiserver|build-service|image-controller|integration-service|konflux-ui|product-kubearchive|openshift-kueue-operator|tekton-kueue|kueue-external-admission|mintmaker|multi-platform-controller|namespace-lister|openshift-pipelines|tekton-results|project-controller|smee|smee-client"}'

New File: components/monitoring/prometheus/development/monitoringstack/writeRelabelConfigs.yaml
(Content should be identical to components/monitoring/prometheus/production/base/monitoringstack/writeRelabelConfigs.yaml initially, allowing for future divergence.)

---
- op: replace
  path: /spec/prometheusConfig/remoteWrite/0/writeRelabelConfigs
  value:
  - action: LabelKeep
    regex: "__name__|source_environment|source_cluster|namespace|app|pod|container|\
      label_pipelines_appstudio_openshift_io_type|health_status|dest_namespace|\
      controller|service|reason|phase|type|resource|resourcequota|le|app|image|\
      commit_hash|job|operation|tokenName|rateLimited|state|persistentvolumeclaim|\
      storageclass|volumename|release_reason|instance|result|deployment_reason|\
      validation_reason|strategy|succeeded|target|name|method|code|sp|le|\
      unexpected_status|failure|hostname|label_app_kubernetes_io_managed_by|status|\
      pipeline|pipelinename|pipelinerun|schedule|check|grpc_service|grpc_code|\
      grpc_method|lease|lease_holder|deployment|platform|mode|cpu|role|node|kind|\
      verb|request_kind|tested_cluster|resource_type|exported_job|http_method|\
      http_route|http_status_code|gin_errors|rule_result|rule_execution_cause|\
      policy_name|policy_background_mode|rule_type|policy_type|policy_validation_mode|\
      resource_request_operation|resource_kind|policy_change_type|event_type"

TominoFTW

/lgtm
/approve

openshift-ci · 2025-08-11T15:31:49Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ci-operator, pacho-rh, TominoFTW

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~components/monitoring/OWNERS~~ [TominoFTW,pacho-rh]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci bot requested review from eisraeli and kubasikus August 6, 2025 18:35

openshift-ci bot added the approved label Aug 6, 2025

openshift-ci bot assigned TominoFTW Aug 7, 2025

openshift-ci bot added the lgtm label Aug 7, 2025

TominoFTW approved these changes Aug 7, 2025

View reviewed changes

openshift-ci bot removed the lgtm label Aug 7, 2025

pacho-rh force-pushed the split-monitoringstack-env branch from 03412ad to b8046b8 Compare August 7, 2025 17:10

pacho-rh force-pushed the split-monitoringstack-env branch from 6abeb36 to 9bf4a12 Compare August 7, 2025 18:40

pacho-rh added 5 commits August 7, 2025 15:18

Share MonitoringStack definition for staging and development

df233eb

Move common resources for staging and env into their own dir

b63ce44

Use kustomize patches for env-specific configuration

f8044fa

Use safer defaults

6c8d608

pacho-rh force-pushed the split-monitoringstack-env branch from 9bf4a12 to 6c8d608 Compare August 7, 2025 19:18

pacho-rh changed the title ~~o11y: Split MonitoringStack by environment~~ Draft: o11y: Split MonitoringStack by environment Aug 7, 2025

pacho-rh changed the title ~~Draft: o11y: Split MonitoringStack by environment~~ o11y: Split MonitoringStack by environment Aug 8, 2025

ci-operator approved these changes Aug 11, 2025

View reviewed changes

TominoFTW approved these changes Aug 11, 2025

View reviewed changes

openshift-ci bot added the lgtm label Aug 11, 2025

openshift-merge-bot bot merged commit 2d1a5d9 into main Aug 11, 2025
9 checks passed

o11y: Split MonitoringStack by environment #7509

o11y: Split MonitoringStack by environment #7509

Uh oh!

Conversation

pacho-rh commented Aug 6, 2025

Uh oh!

pacho-rh commented Aug 6, 2025

Uh oh!

TominoFTW commented Aug 7, 2025

Uh oh!

TominoFTW commented Aug 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pacho-rh commented Aug 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Aug 7, 2025

Uh oh!

pacho-rh commented Aug 7, 2025

Uh oh!

github-actions bot commented Aug 7, 2025

Uh oh!

ci-operator commented Aug 7, 2025

Uh oh!

TominoFTW commented Aug 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pacho-rh commented Aug 7, 2025

Uh oh!

pacho-rh commented Aug 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TominoFTW commented Aug 7, 2025

Uh oh!

pacho-rh commented Aug 7, 2025

Uh oh!

TominoFTW commented Aug 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pacho-rh commented Aug 7, 2025

Uh oh!

TominoFTW commented Aug 7, 2025

Uh oh!

github-actions bot commented Aug 7, 2025

Uh oh!

pacho-rh commented Aug 7, 2025

Uh oh!

github-actions bot commented Aug 7, 2025

Code Review by Gemini

Duplication of MonitoringStack Definition

Resource Allocation for Staging/Development

Uh oh!

github-actions bot commented Aug 7, 2025

Uh oh!

github-actions bot commented Aug 7, 2025

Code Review by Gemini

Identified Issues and Suggested Changes:

1. File: components/monitoring/prometheus/base/monitoringstack/monitoringstack.yaml

2. File: components/monitoring/prometheus/development/monitoringstack/cluster-type-patch.yaml

3. File: components/monitoring/prometheus/development/monitoringstack/kustomization.yaml

4. Files: components/monitoring/prometheus/production/base/monitoringstack/kustomization.yaml and components/monitoring/prometheus/staging/base/monitoringstack/kustomization.yaml

Uh oh!

github-actions bot commented Aug 7, 2025

Code Review by Gemini

Issues and Suggestions

1. Bug: Incorrect Kustomize operation in cluster-type-patch.yaml for development

2. Architectural Inconsistency: Development environment's Kustomize base

Uh oh!

TominoFTW left a comment

Choose a reason for hiding this comment

Uh oh!

openshift-ci bot commented Aug 11, 2025

Uh oh!

Uh oh!

Uh oh!

TominoFTW commented Aug 7, 2025 •

edited

Loading

pacho-rh commented Aug 7, 2025 •

edited

Loading

TominoFTW commented Aug 7, 2025 •

edited

Loading

pacho-rh commented Aug 7, 2025 •

edited

Loading

TominoFTW commented Aug 7, 2025 •

edited

Loading

1. File: `components/monitoring/prometheus/base/monitoringstack/monitoringstack.yaml`

2. File: `components/monitoring/prometheus/development/monitoringstack/cluster-type-patch.yaml`

3. File: `components/monitoring/prometheus/development/monitoringstack/kustomization.yaml`

4. Files: `components/monitoring/prometheus/production/base/monitoringstack/kustomization.yaml` and `components/monitoring/prometheus/staging/base/monitoringstack/kustomization.yaml`

1. Bug: Incorrect Kustomize operation in `cluster-type-patch.yaml` for development