Skip to content

o11y: Split MonitoringStack by environment #7509

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Aug 11, 2025

Conversation

pacho-rh
Copy link
Contributor

@pacho-rh pacho-rh commented Aug 6, 2025

This is in attempt to speed up the pipeline whenever changes to this MonitoringStack definition is made. This is especially crucial for if we need to revert a breaking change to this definition. Additional, this allows us to test changes to this definition in dev or stage first before applying them to production.

@openshift-ci openshift-ci bot requested review from eisraeli and kubasikus August 6, 2025 18:35
@openshift-ci openshift-ci bot added the approved label Aug 6, 2025
@pacho-rh
Copy link
Contributor Author

pacho-rh commented Aug 6, 2025

cc @gcpsoares @mftb

@TominoFTW
Copy link
Contributor

/lgtm
/approve

@TominoFTW
Copy link
Contributor

TominoFTW commented Aug 7, 2025

Approved, I am just wondering for what is the development directory? Is that for local cluster development?

With that its just the question if pipeline of adding something should be just: update staging => check if its how you desired (change/revert) => update production?

@pacho-rh
Copy link
Contributor Author

pacho-rh commented Aug 7, 2025

Approved, I am just wondering for what is the development directory? Is that for local cluster development?

Yup, that directory is for deployment to development clusters (e.g. local dev cluster)

With that its just the question if pipeline of adding something should be just: update staging => check if its how you desired (change/revert) => update production?

Yup. Changes should be made to development in addition to staging too. But that makes me realize perhaps we ought to have the development layer refer to staging MonitoringStack definition as to cut out that disconnect.
I'll see if I can get the changes in here. If not, then as a follow-up PR.

EDIT: nvm, looks like we need do that another way. I get this error if I try to directly reference the staging MonitoringStack defintion:

$ kustomize build components/monitoring/prometheus/development/monitoringstack/
Error: accumulating resources: accumulation err='accumulating resources from '../../staging/base/monitoringstack/monitoringstack.yaml': security; file '.../infra-deployments/components/monitoring/prometheus/staging/base/monitoringstack/monitoringstack.yaml' is not in or below '.../infra-deployments/components/monitoring/prometheus/development/monitoringstack'': must build at directory: '.../infra-deployments/components/monitoring/prometheus/staging/base/monitoringstack/monitoringstack.yaml': file is not directory

@openshift-ci openshift-ci bot removed the lgtm label Aug 7, 2025
Copy link
Contributor

github-actions bot commented Aug 7, 2025

ERROR:

@pacho-rh
Copy link
Contributor Author

pacho-rh commented Aug 7, 2025

But that makes me realize perhaps we ought to have the development layer refer to staging MonitoringStack definition as to cut out that disconnect.

I kept the old base MonitoringStack to be referred to by staging and development. Instead having production with its own MonitoringStack definition.

Copy link
Contributor

github-actions bot commented Aug 7, 2025

ERROR:

@ci-operator
Copy link
Contributor

The outcome of just the staging/developments pointing to the base while production points to one outside the base effectively inverts the expectation that "base is Production and anything that shares its config" - instead of a follow-up PR, could we include the change also in this PR (so both Development and Stage point to the same config in their own directory, and not to the base)?

@TominoFTW
Copy link
Contributor

TominoFTW commented Aug 7, 2025

I am thinking if we can just create a "symlink" to the file in staging perhaps? with ../../../staging/base/monitoringstack/monitoringstack.yaml path for development? I think something similar worked for playbooks

@pacho-rh
Copy link
Contributor Author

pacho-rh commented Aug 7, 2025

I am thinking if we can just create a "symlink" to the file in staging perhaps? with ../../../staging/base/monitoringstack/monitoringstack.yaml path for development? I think something similar worked for playbooks

Using symlinks, I get the same error I posted previously.

@pacho-rh
Copy link
Contributor Author

pacho-rh commented Aug 7, 2025

could we include the change also in this PR (so both Development and Stage point to the same config in their own directory, and not to the base)?

@ci-operator @TominoFTW - Revising this but any idea on how to go about this?

I cannot have development's kustomization.yaml refer directly to resources outside of its directory (e.g. ../../staging/base/monitoringstack/monitoringstack.yaml). Instead, I am allowed to refer to the directory holding the staging base definitions (../../staging/base/monitoringstack).
My issue with that is it includes more than just monitoringstack.yaml. It includes some patching which will also get applied to development. Additional changes made to staging base may inadvertently affect development.

An idea I have is using the following structure but what do you guys think?

.
...
├── development
    ...
│   └── monitoringstack
│       ├── cluster-type-patch.yaml
│       ├── kustomization.yaml (points to ../../staging/base/monitoringstack)
│       └── remote-write-env-details.yaml
└── staging
    ├── base
    ...
    │   ├── monitoringstack
    │   │   ├── kustomization.yaml
    │   │   └── monitoringstack.yaml
    ├── monitoringstack
    │   ├── cluster-type-patch.yaml
    │   ├── kustomization.yaml (points to ../base/monitoringstack)
    │   └── remote-write-env-details.yaml
    ├── stone-stage-p01
    ...

@TominoFTW
Copy link
Contributor

Hmm, I am worried that it gets more and more confusing with that change 🤔

But I don't really see any other way out of that tho, so if that is working I am okay with using that. 👍

@pacho-rh
Copy link
Contributor Author

pacho-rh commented Aug 7, 2025

Or better yet:

.
...
├── development
    ...
│   └── monitoringstack
│       ├── cluster-type-patch.yaml
│       ├── kustomization.yaml (points to ../../staging/base/monitoringstackbase)
│       └── remote-write-env-details.yaml
└── staging
    ├── base
    │   ├── kustomization.yaml
    │   ├── monitoringstack
    │   │   ├── cluster-type-patch.yaml
    │   │   ├── kustomization.yaml (points to ../monitoringstackbase)
    │   │   └── remote-write-env-details.yaml
    │   ├── monitoringstackbase
    │   │   ├── kustomization.yaml
    │   │   └── monitoringstack.yaml
    │   └── rhobs-secret-path.yaml
    ├── kflux-stg-es01
    ...

@TominoFTW
Copy link
Contributor

TominoFTW commented Aug 7, 2025

Or better yet:

One more thought for this:

.
├── base/  # Production only
├── shared/  # Shared config for dev/staging
│   └── monitoringstack/
│       ├── monitoringstack.yaml
│       └── kustomization.yaml
├── development/ # Ref to shared
└── staging/ # Ref to shared

Could something like this be done?

@pacho-rh
Copy link
Contributor Author

pacho-rh commented Aug 7, 2025

Could something like this be done?

I think this runs into the same problem. It could be assumed that shared also include resources for production

@TominoFTW
Copy link
Contributor

Could something like this be done?

I think this runs into the same problem. It could be assumed that shared also include resources for production

then stg-dev-(shared)-monitoringstack 😆

I just want to completely omit the idea of having nested folders when symlinks don't work 🤔

Copy link
Contributor

github-actions bot commented Aug 7, 2025

ERROR:

@pacho-rh
Copy link
Contributor Author

pacho-rh commented Aug 7, 2025

then stg-dev-(shared)-monitoringstack 😆

I just want to completely omit the idea of having nested folders when symlinks don't work 🤔

I agree with avoiding nested folders. Made a new folder for holding staging and development common resources and moved the MonitoringStack definition under there.

@pacho-rh pacho-rh force-pushed the split-monitoringstack-env branch from 03412ad to b8046b8 Compare August 7, 2025 17:10
Copy link
Contributor

github-actions bot commented Aug 7, 2025

Code Review by Gemini

The changes aim to split the MonitoringStack definition by environment to improve pipeline speed and enable environment-specific testing.

However, there are a couple of points to consider:

Duplication of MonitoringStack Definition

The MonitoringStack and ServiceMonitor definitions in components/monitoring/prometheus/production/base/monitoringstack/monitoringstack.yaml (the renamed original file) and components/monitoring/prometheus/stg-dev-common/monitoringstack/monitoringstack.yaml (the new file) are almost entirely identical. The only functional difference observed in this diff is the resources block within the MonitoringStack kind.

This duplication means that any future changes to the core MonitoringStack configuration (e.g., retention, logLevel, remoteWrite relabelings, or the ServiceMonitor scrape configurations) will need to be applied to both files. This increases maintenance overhead and the risk of inconsistencies between environments.

Suggestion:
Consider creating a truly common base/monitoringstack/monitoringstack.yaml that contains the shared definition of the MonitoringStack, ClusterRoleBinding, and ServiceMonitor resources. Then, use Kustomize overlays in components/monitoring/prometheus/production/base/monitoringstack/kustomization.yaml and components/monitoring/prometheus/stg-dev-common/monitoringstack/kustomization.yaml to apply environment-specific patches (e.g., resource limits, remoteWrite URLs/audiences if they differ, or any other future divergences).

This approach would reduce duplication and make it easier to manage common configurations while still allowing for environment-specific overrides.

Resource Allocation for Staging/Development

In components/monitoring/prometheus/stg-dev-common/monitoringstack/monitoringstack.yaml, the MonitoringStack is configured with memory: 16Gi for both requests and limits:

# components/monitoring/prometheus/stg-dev-common/monitoringstack/monitoringstack.yaml
    requests:
      cpu: 500m
      memory: 16Gi
    limits:
      memory: 16Gi

While this might be appropriate for production, 16Gi of memory seems quite high for a development or staging environment. If the goal is to speed up pipelines and enable testing, having lower resource requirements for non-production environments could be beneficial, especially if resources are constrained.

Suggestion:
Review if 16Gi memory is truly necessary for the stg-dev-common environment. Consider reducing this value to a more appropriate level for non-production usage, or making it configurable via an overlay if different staging/dev clusters have varying resource capacities.

Copy link
Contributor

github-actions bot commented Aug 7, 2025

ERROR:

@pacho-rh pacho-rh force-pushed the split-monitoringstack-env branch from 6abeb36 to 9bf4a12 Compare August 7, 2025 18:40
Copy link
Contributor

github-actions bot commented Aug 7, 2025

Code Review by Gemini

The changes aim to split the MonitoringStack definition by environment, which is a good approach for improving pipeline speed and enabling environment-specific testing. However, there are several critical issues in the implementation that need to be addressed.

Identified Issues and Suggested Changes:

1. File: components/monitoring/prometheus/base/monitoringstack/monitoringstack.yaml

Issue 1: Dangerous default for writeRelabelConfigs regex.
The regex for LabelKeep is changed to an empty string (""). In Prometheus, LabelKeep with an empty regex means "keep no labels", effectively stripping all labels from metrics before remote writing. This will make the metrics largely unusable for analysis and alerting. If the intent is to make this configuration environment-specific, this block should be removed from the base entirely, and then added by environment-specific overlays.

Suggested Change:
Remove the writeRelabelConfigs block entirely from the base MonitoringStack definition.

--- a/components/monitoring/prometheus/base/monitoringstack/monitoringstack.yaml
+++ b/components/monitoring/prometheus/base/monitoringstack/monitoringstack.yaml
@@ -34,19 +34,11 @@
           audience: # added by overlays
         tokenUrl: https://sso.redhat.com/auth/realms/redhat-external/protocol/openid-connect/token
       url: # added by overlays
-      writeRelabelConfigs:
-      - action: LabelKeep
-        regex: "__name__|source_environment|source_cluster|namespace|app|pod|container|\
-          label_pipelines_appstudio_openshift_io_type|health_status|dest_namespace|\
-          controller|service|reason|phase|type|resource|resourcequota|le|app|image|\
-          commit_hash|job|operation|tokenName|rateLimited|state|persistentvolumeclaim|\
-          storageclass|volumename|release_reason|instance|result|deployment_reason|\
-          validation_reason|strategy|succeeded|target|name|method|code|sp|le|\
-          unexpected_status|failure|hostname|label_app_kubernetes_io_managed_by|status|\
-          pipeline|pipelinerun|schedule|check|grpc_service|grpc_code|\
-          grpc_method|lease|lease_holder|deployment|platform|mode|cpu|role|node|kind|\
-          verb|request_kind|tested_cluster|resource_type|exported_job|http_method|\
-          http_route|http_status_code|gin_errors|rule_result|rule_execution_cause|\
-          policy_name|policy_background_mode|rule_type|policy_type|policy_validation_mode|\
-          resource_request_operation|resource_kind|policy_change_type|event_type"
-          
-
+      # writeRelabelConfigs: This block should be added by environment-specific overlays
 ---
 # Grant permission to Federate In-Cluster Prometheus
 apiVersion: rbac.authorization.k8s.io/v1
 kind: ClusterRoleBinding
 metadata:
   name: appstudio-federate-ms-view
   labels:

Issue 2: Dangerous default for ServiceMonitor match[] parameters.
The match[] array under endpoints/0/params is changed to an empty array ([]). In Prometheus federation, an empty match[] parameter means "match all series". This will cause the ServiceMonitor to scrape and remote write all metrics from the in-cluster Prometheus, leading to excessive data volume, high costs, and potential performance issues for both the local Prometheus and the remote endpoint. Similar to writeRelabelConfigs, this should be removed from the base and added by environment-specific overlays.

Suggested Change:
Remove the match[] array entirely from the base ServiceMonitor definition.

--- a/components/monitoring/prometheus/base/monitoringstack/monitoringstack.yaml
+++ b/components/monitoring/prometheus/base/monitoringstack/monitoringstack.yaml
@@ -85,108 +85,10 @@
       app.kubernetes.io/managed-by: observability-operator
       app.kubernetes.io/name: appstudio-federate-ms-prometheus
   endpoints:
   - params:
-      'match[]': []  # scrape only required metrics from in-cluster prometheus
     relabelings:
     # override the target's address by the prometheus-k8s service name.
     - action: replace
       targetLabel: __address__
       replacement: prometheus-k8s.openshift-monitoring.svc:9091
     # remove the default target labels as they aren't relevant in case of federation.
     - action: labeldrop
       regex: pod|namespace|service|endpoint|container
     # 30s interval creates 4 scrapes per minute
     # prometheus-k8s.svc x 2 ms-prometheus x (60s/ 30s) = 4

2. File: components/monitoring/prometheus/development/monitoringstack/cluster-type-patch.yaml

Issue: Incorrect patch operation (op: replace).
The patch operation for cluster-type-patch.yaml was changed from op: add to op: replace at path: /spec/endpoints/0/relabelings/0. The ServiceMonitor in the base already has a relabeling at index 0 (for __address__). Changing to op: replace will overwrite this essential relabeling instead of adding a new one. The original op: add at index 0 would insert the new relabeling and shift existing ones, which is a valid way to add.

Suggested Change:
Revert op: replace back to op: add.

--- a/components/monitoring/prometheus/development/monitoringstack/cluster-type-patch.yaml
+++ b/components/monitoring/prometheus/development/monitoringstack/cluster-type-patch.yaml
@@ -1,6 +1,6 @@
 ---
-- op: add
+- op: add
   path: /spec/endpoints/0/relabelings/0
   value:
     targetLabel: source_environment
     replacement: development-cluster

3. File: components/monitoring/prometheus/development/monitoringstack/kustomization.yaml

Issue 1: Incorrect base resource dependency.
The development kustomization now uses ../../staging/base/monitoringstack as its base. This breaks the logical hierarchy where base should be the most generic definition, and environments (dev, staging, prod) build upon it.

Suggested Change:
Change the base resource back to ../../base/monitoringstack.

--- a/components/monitoring/prometheus/development/monitoringstack/kustomization.yaml
+++ b/components/monitoring/prometheus/development/monitoringstack/kustomization.yaml
@@ -1,7 +1,7 @@
 apiVersion: kustomize.config.k8s.io/v1beta1
 kind: Kustomization
 resources:
   - ../../base/observability-operator
-  - ../../staging/base/monitoringstack
+  - ../../base/monitoringstack
 patches:
   - path: cluster-type-patch.yaml
     target:

Issue 2: Missing environment-specific patches.
The development kustomization is missing the patches for endpoints-params-match.yaml and writeRelabelConfigs-regex.yaml. This means it will inherit the problematic empty match[] (match all) and empty regex (keep no labels) from the modified base file, leading to incorrect metric scraping and label filtering. These patches should be applied to the development environment as well. Assuming development should use the same configurations as staging for these for now.

Suggested Change:
Add the endpoints-params-match.yaml and writeRelabelConfigs-regex.yaml patches, referencing the staging versions for consistency.

--- a/components/monitoring/prometheus/development/monitoringstack/kustomization.yaml
+++ b/components/monitoring/prometheus/development/monitoringstack/kustomization.yaml
@@ -10,3 +10,11 @@
       kind: ServiceMonitor
   - path: remote-write-env-details.yaml
     target:
       name: appstudio-federate-ms
       kind: MonitoringStack
+  - path: ../../staging/base/monitoringstack/endpoints-params-match.yaml
+    target:
+      name: appstudio-federate-smon
+      kind: ServiceMonitor
+  - path: ../../staging/base/monitoringstack/writeRelabelConfigs-regex.yaml
+    target:
+      name: appstudio-federate-ms
+      kind: MonitoringStack

4. Files: components/monitoring/prometheus/production/base/monitoringstack/kustomization.yaml and components/monitoring/prometheus/staging/base/monitoringstack/kustomization.yaml

Issue: Incorrect patch operation (op: replace).
If match[] and writeRelabelConfigs are removed from the base monitoringstack.yaml (as suggested in point 1), then the patches that add these fields in the environment-specific kustomizations should use op: add instead of op: replace.

Suggested Change (for both production and staging kustomizations):
Change op: replace to op: add for the endpoints-params-match.yaml and writeRelabelConfigs-regex.yaml patches.

--- a/components/monitoring/prometheus/production/base/monitoringstack/kustomization.yaml
+++ b/components/monitoring/prometheus/production/base/monitoringstack/kustomization.yaml
@@ -5,18 +5,22 @@
   - path: cluster-type-patch.yaml
     target:
       name: appstudio-federate-smon
       kind: ServiceMonitor
-  - path: endpoints-params-match.yaml
+  - op: add # Change from 'replace' to 'add'
+    path: endpoints-params-match.yaml
     target:
       name: appstudio-federate-smon
       kind: ServiceMonitor
   - path: cluster-type-patch.yaml
     target:
       name: appstudio-federate-uwm-smon
       kind: ServiceMonitor
   - path: remote-write-env-details.yaml
     target:
       name: appstudio-federate-ms
       kind: MonitoringStack
-  - path: writeRelabelConfigs-regex.yaml
+  - op: add # Change from 'replace' to 'add'
+    path: writeRelabelConfigs-regex.yaml
     target:
       name: appstudio-federate-ms
       kind: MonitoringStack

This is in attempt to speed up the pipeline whenever changes to this
MonitoringStack definition is made. This is especially crucial for if we
need to revert a breaking change to this definition.
Additional, this allows us to test changes to this definition in dev or
stage first before applying them to production.
@pacho-rh pacho-rh force-pushed the split-monitoringstack-env branch from 9bf4a12 to 6c8d608 Compare August 7, 2025 19:18
Copy link
Contributor

github-actions bot commented Aug 7, 2025

Code Review by Gemini

The changes aim to split the MonitoringStack definition by environment to improve pipeline speed and enable environment-specific testing. While the overall approach of moving environment-specific configurations to overlays is good, there are a couple of issues that need to be addressed.

Issues and Suggestions

1. Bug: Incorrect Kustomize operation in cluster-type-patch.yaml for development

The patch in components/monitoring/prometheus/development/monitoringstack/cluster-type-patch.yaml changes the operation from add to replace for a ServiceMonitor relabeling rule. This is problematic because the base ServiceMonitor already has a critical relabeling rule at index 0 (for __address__). Changing op: replace will overwrite this existing rule instead of adding the source_environment label as intended. The original op: add was correct for inserting the new rule.

File: components/monitoring/prometheus/development/monitoringstack/cluster-type-patch.yaml

--- a/components/monitoring/prometheus/development/monitoringstack/cluster-type-patch.yaml
+++ b/components/monitoring/prometheus/development/monitoringstack/cluster-type-patch.yaml
@@ -1,6 +1,6 @@
 ---
-- op: replace
+- op: add
   path: /spec/endpoints/0/relabelings/0
   value:
     targetLabel: source_environment
     replacement: development-cluster

2. Architectural Inconsistency: Development environment's Kustomize base

The development environment's kustomization.yaml now bases its MonitoringStack resource on ../../staging/base/monitoringstack. This creates an inconsistent hierarchy compared to staging and production, which both base directly on ../../../base/monitoringstack.

This setup means that development will implicitly inherit the endpoints-params and writeRelabelConfigs from staging without explicitly defining them. This contradicts the goal of "testing changes to this definition in dev or stage first" if dev and stage are effectively identical in terms of these configurations and dev cannot easily diverge.

For better clarity, independence, and adherence to Kustomize best practices for environment overlays, development should also base directly on the common ../../../base/monitoringstack and apply its own specific patches, similar to staging and production.

File: components/monitoring/prometheus/development/monitoringstack/kustomization.yaml

--- a/components/monitoring/prometheus/development/monitoringstack/kustomization.yaml
+++ b/components/monitoring/prometheus/development/monitoringstack/kustomization.yaml
@@ -3,12 +3,20 @@
 resources:
   - ../../base/observability-operator
-  - ../../staging/base/monitoringstack
+  - ../../../base/monitoringstack # Base off the common base
 patches:
   - path: cluster-type-patch.yaml
     target:
       name: appstudio-federate-smon
       kind: ServiceMonitor
   - path: cluster-type-patch.yaml
     target:
       name: appstudio-federate-uwm-smon
       kind: ServiceMonitor
   - path: remote-write-env-details.yaml
+    target:
+      name: appstudio-federate-ms
+      kind: MonitoringStack
+  - path: endpoints-params.yaml # Add dev-specific endpoints-params
+    target:
+      name: appstudio-federate-smon
+      kind: ServiceMonitor
+  - path: writeRelabelConfigs.yaml # Add dev-specific writeRelabelConfigs
+    target:
+      name: appstudio-federate-ms
+      kind: MonitoringStack

New File: components/monitoring/prometheus/development/monitoringstack/endpoints-params.yaml
(Content should be identical to components/monitoring/prometheus/production/base/monitoringstack/endpoints-params.yaml initially, allowing for future divergence.)

---
- op: replace
  path: /spec/endpoints/0/params
  value:
    'match[]':  # scrape only required metrics from in-cluster prometheus
    - '{__name__="pipeline_service_schedule_overhead_percentage_sum"}'
    - '{__name__="pipeline_service_schedule_overhead_percentage_count"}'
    - '{__name__="pipeline_service_execution_overhead_percentage_sum"}'
    - '{__name__="pipeline_service_execution_overhead_percentage_count"}'
    - '{__name__="pipelinerun_duration_scheduled_seconds_sum"}'
    - '{__name__="pipelinerun_duration_scheduled_seconds_count"}'
    - '{__name__="pipelinerun_gap_between_taskruns_milliseconds_sum"}'
    - '{__name__="pipelinerun_gap_between_taskruns_milliseconds_count"}'
    - '{__name__="pipelinerun_kickoff_not_attempted_count"}'
    - '{__name__="pending_resolutionrequest_count"}'
    - '{__name__="taskrun_pod_create_not_attempted_or_pending_count"}'
    - '{__name__="tekton_pipelines_controller_pipelinerun_count"}'
    - '{__name__="tekton_pipelines_controller_running_pipelineruns_count"}'
    - '{__name__="tekton_pipelines_controller_running_taskruns_throttled_by_quota_count"}'
    - '{__name__="tekton_pipelines_controller_running_taskruns_throttled_by_node_count"}'
    - '{__name__="tekton_pipelines_controller_running_taskruns_throttled_by_quota"}'
    - '{__name__="tekton_pipelines_controller_running_taskruns_throttled_by_node"}'
    - '{__name__="tekton_pipelines_controller_pipelinerun_duration_seconds_sum"}'
    - '{__name__="tekton_pipelines_controller_pipelinerun_duration_seconds_count"}'
    - '{__name__="watcher_workqueue_depth"}'
    - '{__name__="watcher_client_latency_bucket"}'
    - '{__name__="pac_watcher_work_queue_depth"}'
    - '{__name__="pac_watcher_client_latency_bucket"}'
    - '{__name__="grpc_server_handled_total", namespace=~"tekton-results|openshift-pipelines"}'
    - '{__name__="grpc_server_handled_total", namespace=~"openshift-etcd"}'
    - '{__name__="grpc_server_handling_seconds_bucket", namespace=~"tekton-results|openshift-pipelines"}'
    - '{__name__="grpc_server_handling_seconds_bucket", namespace="openshift-etcd"}'
    - '{__name__="grpc_server_msg_received_total", namespace="openshift-etcd"}'
    - '{__name__="controller_runtime_reconcile_errors_total", namespace!~".*-tenant|openshift-.*|kube-.*"}'
    - '{__name__="controller_runtime_reconcile_total", namespace!~".*-tenant|openshift-.*|kube-.*"}'
    - '{__name__="kube_lease_owner", namespace="openshift-pipelines", lease=~"controller.tektonresolverframework.bundleresolver..*"}'
    - '{__name__="kube_lease_owner", namespace="openshift-pipelines", lease=~"tekton-pipelines-controller.github.com.tektoncd.pipeline.pkg.reconciler..*"}'
    - '{__name__="kube_pod_status_unschedulable", namespace!~".*-tenant|openshift-.*|kube-.*"}'
    - '{__name__="kube_pod_container_status_restarts_total", namespace=~"openshift-pipelines|release-service"}'
    - '{__name__="kube_pod_container_status_waiting_reason", namespace!~".*-tenant|openshift-.*|kube-.*"}'
    - '{__name__="kube_pod_status_phase", namespace!~".*-tenant|openshift-.*|kube-.*"}'
    - '{__name__="kube_pod_container_resource_limits", namespace="release-service"}'
    - '{__name__="kube_pod_container_status_terminated_reason", namespace="release-service"}'
    - '{__name__="kube_pod_container_status_last_terminated_reason", namespace="release-service"}'
    - '{__name__="kube_pod_container_status_ready", namespace="release-service"}'
    - '{__name__="kube_persistentvolume_status_phase", namespace!~".*-tenant|openshift-.*|kube-.*"}'
    - '{__name__="kube_resourcequota", namespace!~".*-tenant|openshift-.*|kube-.*"}'
    - '{__name__="kube_statefulset_status_replicas_ready", namespace="gitops-service-argocd"}'
    - '{__name__="kube_statefulset_replicas", namespace="gitops-service-argocd"}'
    - '{__name__="openshift_route_status", namespace="gitops-service-argocd"}'

    - '{__name__="kube_deployment_status_replicas_ready", namespace="gitops-service-argocd"}'
    - '{__name__="kube_deployment_spec_replicas", namespace=~"gitops-service-argocd"}'

    # Namespace (expression):  "build-service"
    - '{__name__="kube_deployment_status_replicas_ready", namespace="build-service"}'
    - '{__name__="kube_deployment_status_replicas_available", namespace="build-service"}'
    - '{__name__="kube_deployment_spec_replicas", namespace="build-service"}'

    # Namespace (expression):  "integration-service"
    - '{__name__="kube_deployment_status_replicas_ready", namespace="integration-service"}'
    - '{__name__="kube_deployment_status_replicas_available", namespace="integration-service"}'
    - '{__name__="kube_deployment_spec_replicas", namespace="integration-service"}'

    # Namespace (expression):  "konflux-ui"
    - '{__name__="kube_deployment_status_replicas_ready", namespace="konflux-ui"}'
    - '{__name__="kube_deployment_status_replicas_available", namespace="konflux-ui"}'
    - '{__name__="kube_deployment_spec_replicas", namespace="konflux-ui"}'
    - '{__name__="kube_running_pods_ready", namespace="konflux-ui"}'
    - '{__name__="kube_endpoint_address", namespace="konflux-ui"}'
    - '{__name__="kube_pod_container_status_restarts_total", namespace="konflux-ui"}'

    # Namespace (expression):  "mintmaker"
    - '{__name__="kube_deployment_status_replicas_ready", namespace="mintmaker"}'
    - '{__name__="kube_deployment_status_replicas_available", namespace="mintmaker"}'
    - '{__name__="kube_deployment_spec_replicas", namespace="mintmaker"}'
    - '{__name__="cluster_ram_requested_perc"}'
    - '{__name__="node_memory_pressured_perc"}'
    - '{__name__="redis_node_memory_usage_perc"}'

    # Namespace (expression):  ~".*monitoring.*"
    - '{__name__="kube_deployment_status_replicas_ready", namespace=~".*monitoring.*"}'
    - '{__name__="kube_deployment_status_replicas_available", namespace=~".*monitoring.*"}'
    - '{__name__="kube_deployment_spec_replicas", namespace=~".*monitoring.*"}'

    # Namespace (expression):  "multi-platform-controller"
    - '{__name__="kube_deployment_status_replicas_ready", namespace="multi-platform-controller"}'
    - '{__name__="kube_deployment_status_replicas_available", namespace="multi-platform-controller"}'
    - '{__name__="kube_deployment_spec_replicas", namespace="multi-platform-controller"}'

    # Namespace (expression):  "namespace-lister"
    - '{__name__="kube_deployment_status_replicas_ready", namespace="namespace-lister"}'
    - '{__name__="kube_deployment_status_replicas_available", namespace="namespace-lister"}'
    - '{__name__="kube_deployment_spec_replicas", namespace="namespace-lister"}'

    # Namespace (expression):  "openshift-pipelines"
    - '{__name__="kube_deployment_status_replicas_ready", namespace="openshift-pipelines"}'
    - '{__name__="kube_deployment_status_replicas_available", namespace="openshift-pipelines"}'
    - '{__name__="kube_deployment_spec_replicas", namespace="openshift-pipelines"}'

    # Namespace (expression):  "product-kubearchive"
    - '{__name__="kube_deployment_status_replicas_ready", namespace="product-kubearchive"}'
    - '{__name__="kube_deployment_status_replicas_available", namespace="product-kubearchive"}'
    - '{__name__="kube_deployment_spec_replicas", namespace="product-kubearchive"}'

    # Namespace (expression):  "project-controller"
    - '{__name__="kube_deployment_status_replicas_ready", namespace="project-controller"}'
    - '{__name__="kube_deployment_status_replicas_available", namespace="project-controller"}'
    - '{__name__="kube_deployment_spec_replicas", namespace="project-controller"}'

    # Namespace (expression):  "release-service"
    - '{__name__="kube_deployment_status_replicas_ready", namespace="release-service"}'
    - '{__name__="kube_deployment_status_replicas_available", namespace="release-service"}'
    - '{__name__="kube_deployment_spec_replicas", namespace="release-service"}'

    # Namespace (expression):  ~"smee.*"
    - '{__name__="kube_deployment_status_replicas_ready", namespace=~"smee.*"}'
    - '{__name__="kube_deployment_status_replicas_available", namespace=~"smee.*"}'
    - '{__name__="kube_deployment_spec_replicas", namespace=~"smee.*"}'

    # Namespace (expression):  "openshift-apiserver"
    - '{__name__="kube_deployment_status_replicas_ready", namespace="openshift-apiserver"}'
    - '{__name__="kube_deployment_status_replicas_available", namespace="openshift-apiserver"}'
    - '{__name__="kube_deployment_spec_replicas", namespace="openshift-apiserver"}'

    # Namespace (expression):  "openshift-oauth-apiserver"
    - '{__name__="kube_deployment_status_replicas_ready", namespace="openshift-oauth-apiserver"}'
    - '{__name__="kube_deployment_status_replicas_available", namespace="openshift-oauth-apiserver"}'
    - '{__name__="kube_deployment_spec_replicas", namespace="openshift-oauth-apiserver"}'

    # Namespace (expression):  "konflux-kyverno"
    - '{__name__="kube_deployment_status_replicas_ready", namespace="konflux-kyverno"}'
    - '{__name__="kube_deployment_status_replicas_available", namespace="konflux-kyverno"}'
    - '{__name__="kube_deployment_spec_replicas", namespace="konflux-kyverno"}'

    # Namespace (expression):  "openshift-kube-apiserver"
    - '{__name__="kube_deployment_status_replicas_ready", namespace="openshift-kube-apiserver"}'
    - '{__name__="kube_deployment_status_replicas_available", namespace="openshift-kube-apiserver"}'
    - '{__name__="kube_deployment_spec_replicas", namespace="openshift-kube-apiserver"}'

    # Namespace (expression):  "konflux-user-support"
    - '{__name__="kube_deployment_status_replicas_available", namespace="konflux-user-support"}'
    - '{__name__="kube_deployment_spec_replicas", namespace="konflux-user-support"}'

    - '{__name__="argocd_app_reconcile_bucket", namespace="gitops-service-argocd"}'
    - '{__name__="argocd_app_info", namespace="gitops-service-argocd"}'
    - '{__name__="container_cpu_usage_seconds_total", namespace="release-service"}'
    - '{__name__="container_cpu_usage_seconds_total", namespace="openshift-etcd"}'
    - '{__name__="container_memory_usage_bytes", namespace="release-service"}'
    - '{__name__="container_memory_usage_bytes", namespace="openshift-etcd"}'
    - '{__name__="etcd_disk_wal_fsync_duration_seconds_bucket"}'
    - '{__name__="etcd_disk_backend_commit_duration_seconds_bucket"}'
    - '{__name__="etcd_server_proposals_failed_total"}'
    - '{__name__="etcd_server_leader_changes_seen_total", namespace="openshift-etcd"}'
    - '{__name__="etcd_server_has_leader", namespace="openshift-etcd"}'
    - '{__name__="etcd_server_is_leader", namespace="openshift-etcd"}'
    - '{__name__="etcd_server_id", namespace="openshift-etcd"}'
    - '{__name__="etcd_server_quota_backend_bytes", namespace="openshift-etcd"}'
    - '{__name__="etcd_mvcc_db_total_size_in_bytes", namespace="openshift-etcd"}'
    - '{__name__="etcd_server_received_total", namespace="openshift-etcd"}'
    - '{__name__="etcd_network_active_peers", namespace="openshift-etcd"}'
    - '{__name__="etcd_network_peer_round_trip_time_seconds_bucket"}'
    - '{__name__="etcd_disk_defrag_inflight"}'
    - '{__name__="kube_job_spec_completions"}'
    - '{__name__="kube_job_status_succeeded"}'
    - '{__name__="kube_job_status_failed"}'
    - '{__name__="node_cpu_seconds_total", mode="idle"}'
    - '{__name__="node_memory_MemTotal_bytes"}'
    - '{__name__="node_memory_MemAvailable_bytes"}'
    - '{__name__="platform:hypershift_hostedclusters:max"}'
    - '{__name__="kube_node_role"}'
    - '{__name__="etcd_shield_trigger"}'
    - '{__name__="etcd_shield_alert_triggered"}'
    - '{__name__="apiserver_admission_webhook_rejection_count", name="vpipelineruns.konflux-ci.dev"}'
    - '{__name__="apiserver_watch_events_total"}'
    - '{__name__="apiserver_storage_objects"}'
    - '{__name__="apiserver_current_inflight_requests"}'
    - '{__name__="resource_verb:apiserver_request_total:rate5m"}'
    - '{__name__="code:apiserver_request_total:rate5m"}'
    - '{__name__="instance:apiserver_request_total:rate5m"}'
    - '{__name__="prometheus_ready"}'
    - '{__name__="process_cpu_seconds_total", job="apiserver"}'
    - '{__name__="namespace:container_memory_usage_bytes:sum", namespace=~"openshift-etcd|openshift-kube-apiserver|build-service|image-controller|integration-service|konflux-ui|product-kubearchive|openshift-kueue-operator|tekton-kueue|kueue-external-admission|mintmaker|multi-platform-controller|namespace-lister|openshift-pipelines|tekton-results|project-controller|smee|smee-client"}'
    - '{__name__="namespace:container_cpu_usage:sum", namespace=~"openshift-etcd|openshift-kube-apiserver|build-service|image-controller|integration-service|konflux-ui|product-kubearchive|openshift-kueue-operator|tekton-kueue|kueue-external-admission|mintmaker|multi-platform-controller|namespace-lister|openshift-pipelines|tekton-results|project-controller|smee|smee-client"}'
    - '{__name__="node_namespace_pod:kube_pod_info:", namespace=~"openshift-etcd|openshift-kube-apiserver|build-service|image-controller|integration-service|konflux-ui|product-kubearchive|openshift-kueue-operator|tekton-kueue|kueue-external-admission|mintmaker|multi-platform-controller|namespace-lister|openshift-pipelines|tekton-results|project-controller|smee|smee-client"}'
    - '{__name__="kube_node_status_allocatable", resource=~"cpu|memory"}'
    - '{__name__="kube_node_status_condition", condition="MemoryPressure", status="true"}'
    - '{__name__="namespace_memory:kube_pod_container_resource_requests:sum", namespace=~"openshift-etcd|openshift-kube-apiserver|build-service|image-controller|integration-service|konflux-ui|product-kubearchive|openshift-kueue-operator|tekton-kueue|kueue-external-admission|mintmaker|multi-platform-controller|namespace-lister|openshift-pipelines|tekton-results|project-controller|smee|smee-client"}'
    - '{__name__="namespace_cpu:kube_pod_container_resource_requests:sum", namespace=~"openshift-etcd|openshift-kube-apiserver|build-service|image-controller|integration-service|konflux-ui|product-kubearchive|openshift-kueue-operator|tekton-kueue|kueue-external-admission|mintmaker|multi-platform-controller|namespace-lister|openshift-pipelines|tekton-results|project-controller|smee|smee-client"}'
    - '{__name__="namespace_memory:kube_pod_container_resource_limits:sum", namespace=~"openshift-etcd|openshift-kube-apiserver|build-service|image-controller|integration-service|konflux-ui|product-kubearchive|openshift-kueue-operator|tekton-kueue|kueue-external-admission|mintmaker|multi-platform-controller|namespace-lister|openshift-pipelines|tekton-results|project-controller|smee|smee-client"}'
    - '{__name__="namespace_cpu:kube_pod_container_resource_limits:sum", namespace=~"openshift-etcd|openshift-kube-apiserver|build-service|image-controller|integration-service|konflux-ui|product-kubearchive|openshift-kueue-operator|tekton-kueue|kueue-external-admission|mintmaker|multi-platform-controller|namespace-lister|openshift-pipelines|tekton-results|project-controller|smee|smee-client"}'

New File: components/monitoring/prometheus/development/monitoringstack/writeRelabelConfigs.yaml
(Content should be identical to components/monitoring/prometheus/production/base/monitoringstack/writeRelabelConfigs.yaml initially, allowing for future divergence.)

---
- op: replace
  path: /spec/prometheusConfig/remoteWrite/0/writeRelabelConfigs
  value:
  - action: LabelKeep
    regex: "__name__|source_environment|source_cluster|namespace|app|pod|container|\
      label_pipelines_appstudio_openshift_io_type|health_status|dest_namespace|\
      controller|service|reason|phase|type|resource|resourcequota|le|app|image|\
      commit_hash|job|operation|tokenName|rateLimited|state|persistentvolumeclaim|\
      storageclass|volumename|release_reason|instance|result|deployment_reason|\
      validation_reason|strategy|succeeded|target|name|method|code|sp|le|\
      unexpected_status|failure|hostname|label_app_kubernetes_io_managed_by|status|\
      pipeline|pipelinename|pipelinerun|schedule|check|grpc_service|grpc_code|\
      grpc_method|lease|lease_holder|deployment|platform|mode|cpu|role|node|kind|\
      verb|request_kind|tested_cluster|resource_type|exported_job|http_method|\
      http_route|http_status_code|gin_errors|rule_result|rule_execution_cause|\
      policy_name|policy_background_mode|rule_type|policy_type|policy_validation_mode|\
      resource_request_operation|resource_kind|policy_change_type|event_type"

@pacho-rh pacho-rh changed the title o11y: Split MonitoringStack by environment Draft: o11y: Split MonitoringStack by environment Aug 7, 2025
@pacho-rh pacho-rh changed the title Draft: o11y: Split MonitoringStack by environment o11y: Split MonitoringStack by environment Aug 8, 2025
Copy link
Contributor

@TominoFTW TominoFTW left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

@openshift-ci openshift-ci bot added the lgtm label Aug 11, 2025
Copy link

openshift-ci bot commented Aug 11, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ci-operator, pacho-rh, TominoFTW

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-bot openshift-merge-bot bot merged commit 2d1a5d9 into main Aug 11, 2025
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants