MON-4517: Introduce missing minimal monitors with "full" CP metrics#2814
MON-4517: Introduce missing minimal monitors with "full" CP metrics#2814rexagod wants to merge 7 commits intoopenshift:mainfrom
Conversation
|
@rexagod: This pull request references MON-4517 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "4.22.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: rexagod The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
22a4323 to
d65d951
Compare
|
@rexagod: This pull request references MON-4517 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "4.22.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/cc @simonpasquier |
IMHO it should be the other way around: unless defined more precisely, minimal should be the same as full (because of dashboards and alerting rules). |
Does not entail the dynamically generating the whitelist and updating the monitoring from that [1], as well as, the `minimal` monitors' revision [2]. [1]:openshift#2694 [2]:openshift#2814
Does not entail the dynamically generating the whitelist and updating the monitoring from that [1], as well as, the `minimal` monitors' revision [2]. [1]:openshift#2694 [2]:openshift#2814
Does not entail the dynamically generating the whitelist and updating the monitoring from that [1], as well as, the `minimal` monitors' revision [2]. [1]:openshift#2694 [2]:openshift#2814
| local run(sm, metrics, profile) = | ||
| local run(smWithDrop, metrics, profile, shouldRemoveDrop) = | ||
| local sm = if shouldRemoveDrop then removeDrop(smWithDrop) else smWithDrop; |
There was a problem hiding this comment.
~/bases/work/cluster-monitoring-operator/assets MON-4517 ≡
❯ for i in alertmanager cluster-monitoring-operator openshift-state-metrics prometheus-k8s telemeter-client; do diff $i/minimal-service-monitor.yaml $i/service-monitor.yaml; done
11,12c11
< monitoring.openshift.io/collection-profile: minimal
< name: alertmanager-main-minimal
---
> name: alertmanager-main
8,9c8
< monitoring.openshift.io/collection-profile: minimal
< name: cluster-monitoring-operator-minimal
---
> name: cluster-monitoring-operator
8,9c8
< monitoring.openshift.io/collection-profile: minimal
< name: openshift-state-metrics-minimal
---
> name: openshift-state-metrics
11,12c11
< monitoring.openshift.io/collection-profile: minimal
< name: prometheus-k8s-minimal
---
> name: prometheus-k8s
8,9c8
< monitoring.openshift.io/collection-profile: minimal
< name: telemeter-client-minimal
---
> name: telemeter-client
There was a problem hiding this comment.
was this a diff of an output for the libsonnet generation? xD
There was a problem hiding this comment.
Ah sorry. I pasted this here to show that nothing was changed in the newly generated minimal monitors from their full counterparts except names and the CP label.
| @@ -6,7 +6,8 @@ | |||
| // 2. Add the profile prefix to the ServiceMonitor name | |||
| // 3. Add the profile label "monitoring.openshift.io/collection-profile: <profile>" | |||
| // 4. Add a metricRelabelings with action keep and regex equal to metrics | |||
There was a problem hiding this comment.
I'd replace this with something such as "// 4. If metrics is non-null, add a metricRelabelings with action keep and regex equal to metrics" as the logic somehow changed
1182437 to
361d6a7
Compare
|
@rexagod: This pull request references MON-4517 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "4.22.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
Does not entail the dynamically generating the whitelist and updating the monitoring from that [1], as well as, the `minimal` monitors' revision [2]. [1]:openshift#2694 [2]:openshift#2814
| }, | ||
|
|
||
| minimal(sm, metrics): minimal(removeDrop(sm), metrics), | ||
| minimal(sm, metrics, removeDrop=true): run(sm, metrics, profiles[0], removeDrop), |
There was a problem hiding this comment.
(nit) I feel that adding more arguments makes it harder to understand what happens at the caller site. Can we have a more explicit API?
// Returns a copy of the input service monitor with the correct annotation.
serviceMonitorForMinimalProfile(sm): ...
serviceMonitorForTelemetryProfile(sm): ...
// Returns a copy of the input service monitor with the list of metrics.
keepOnlyMetrics(sm, metrics): ...
so we would call it like:
utils.serviceMonitorForTelemetryProfile(
utils.keepOnlyMetrics(sm, ["foo", "bar"]),
)
or
// just use the full service monitor (no customization for minimal).
utils.serviceMonitorForMinimalProfile(sm)
pkg/tasks/alertmanager.go
Outdated
| if err != nil { | ||
| return fmt.Errorf("reconciling Alertmanager ServiceMonitor failed: %w", err) | ||
| for _, smam := range smams { | ||
| err = t.client.CreateOrUpdateServiceMonitor(ctx, smam) |
There was a problem hiding this comment.
should we have a CreateOrUpdateServiceMonitors() function to reduce duplication?
| interval: 30s | ||
| metricRelabelings: | ||
| - action: keep | ||
| regex: (ALERTS|prometheus_tsdb_head_samples_appended_total|prometheus_tsdb_head_series|scrape_samples_post_metric_relabeling|scrape_series_added|up) |
There was a problem hiding this comment.
(nit) up and scrape_* metrics aren't coming from the scraped target.
Does not entail the dynamically generating the whitelist and updating the monitoring from that [1], as well as, the `minimal` monitors' revision [2]. [1]:openshift#2694 [2]:openshift#2814
Bases minimal monitors on the same specs as the "full" ones. Signed-off-by: Pranshu Srivastava <rexagod@gmail.com>
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Repository: openshift/coderabbit/.coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (6)
🚧 Files skipped from review as they are similar to previous changes (3)
WalkthroughAdds telemetry and minimal ServiceMonitor variants and a profile-driven generator; updates Jsonnet components to export the new monitors; extends manifests Factory/types/tests to produce grouped full/minimal/telemetry outputs; and changes client/tasks to create/update/delete ServiceMonitors in batch. Changes
Sequence Diagram(s)sequenceDiagram
participant Operator as ClusterMonitoring Operator
participant Factory as Manifests Factory
participant Generator as Jsonnet Profile Generator
participant Client as API Client
Operator->>Factory: Request ServiceMonitors(component)
Factory->>Factory: Load base ServiceMonitor asset
Factory->>Generator: keepOnlyMetrics(base, metricList)
Generator-->>Factory: filtered ServiceMonitor
Factory->>Generator: serviceMonitorForMinimalProfile(filtered)
Generator-->>Factory: minimal variant
Factory->>Generator: serviceMonitorForTelemetryProfile(filtered)
Generator-->>Factory: telemetry variant
Factory-->>Operator: Return [full, minimal, telemetry]
Operator->>Client: CreateOrUpdateServiceMonitors(ctx, sms)
Client->>Client: Iterate and create/update each ServiceMonitor
Client-->>Operator: Success / Error
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 golangci-lint (2.5.0)level=error msg="Running error: context loading failed: failed to load packages: failed to load packages: failed to load with go/packages: err: exit status 1: stderr: go: inconsistent vendoring in :\n\tgithub.com/Jeffail/gabs/v2@v2.6.1: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\tgithub.com/alecthomas/units@v0.0.0-20240927000941-0f3dac36c52b: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\tgithub.com/blang/semver/v4@v4.0.0: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\tgithub.com/ghodss/yaml@v1.0.0: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\tgithub.com/go-openapi/strfmt@v0.24.0: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\tgithub.com/google/uuid@v1.6.0: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\tgithub.com/imdario/mergo@v0.3.16: is explicitly ... [truncated 21195 characters] ... les.txt\n\tsigs.k8s.io/apiserver-network-proxy/konnectivity-client@v0.31.2: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\tsigs.k8s.io/kube-storage-version-migrator@v0.0.6-0.20230721195810-5c8923c5ff96: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\tsigs.k8s.io/randfill@v1.0.0: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\tsigs.k8s.io/structured-merge-diff/v6@v6.3.0: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\tgithub.com/onsi/ginkgo/v2: is replaced in go.mod, but not marked as replaced in vendor/modules.txt\n\n\tTo ignore the vendor directory, use -mod=readonly or -mod=mod.\n\tTo sync the vendor directory, run:\n\t\tgo mod vendor\n" Comment |
|
@rexagod: This pull request references MON-4517 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "4.22.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
There was a problem hiding this comment.
Actionable comments posted: 5
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
jsonnet/utils/generate-service-monitors.libsonnet (1)
9-12:⚠️ Potential issue | 🟠 MajorFilter incorrectly removes metricRelabelings without an explicit
actionfield.The condition
std.objectHas(x, 'action') && x.action != 'drop'will exclude any metricRelabeling that doesn't have anactionfield. In Prometheus, whenactionis omitted, it defaults toreplace. This means legitimate relabelings (likelabeldrop, or implicitreplace) without an explicitactionkey would be incorrectly removed.Proposed fix
- metricRelabelings: [x for x in e.metricRelabelings if std.objectHas(x, 'action') && x.action != 'drop'], + metricRelabelings: [x for x in e.metricRelabelings if !std.objectHas(x, 'action') || x.action != 'drop'],🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@jsonnet/utils/generate-service-monitors.libsonnet` around lines 9 - 12, The current filter "[x for x in e.metricRelabelings if std.objectHas(x, 'action') && x.action != 'drop']" drops relabelings that omit the action field (which should default to "replace"); change the predicate to keep entries that either do not have an action or whose action is not "drop" (e.g. use "if ! (std.objectHas(x, 'action') && x.action == 'drop')" or "if !std.objectHas(x,'action') || x.action != 'drop'") so metricRelabelings without an explicit action are retained; update the comprehension in generate-service-monitors.libsonnet where metricRelabelings is constructed.
🧹 Nitpick comments (1)
pkg/tasks/clustermonitoringoperator.go (1)
143-151: Error messages should reflect plural ServiceMonitors.The method calls have been updated to handle multiple ServiceMonitors, but the error messages on lines 145 and 150 still use singular "ServiceMonitor". For consistency and clearer debugging:
💡 Suggested fix
smscmo, err := t.factory.ClusterMonitoringOperatorServiceMonitors() if err != nil { - return fmt.Errorf("initializing Cluster Monitoring Operator ServiceMonitor failed: %w", err) + return fmt.Errorf("initializing Cluster Monitoring Operator ServiceMonitors failed: %w", err) } err = t.client.CreateOrUpdateServiceMonitors(ctx, smscmo) if err != nil { - return fmt.Errorf("reconciling Cluster Monitoring Operator ServiceMonitor failed: %w", err) + return fmt.Errorf("reconciling Cluster Monitoring Operator ServiceMonitors failed: %w", err) }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/tasks/clustermonitoringoperator.go` around lines 143 - 151, The error strings for ClusterMonitoringOperatorServiceMonitors are singular but the code works with multiple items; update the two fmt.Errorf messages that reference Cluster Monitoring Operator ServiceMonitor to use the plural "ServiceMonitors" instead. Specifically, adjust the errors returned in the block where you call t.factory.ClusterMonitoringOperatorServiceMonitors() and where you call t.client.CreateOrUpdateServiceMonitors(ctx, smscmo) so both messages read "... Cluster Monitoring Operator ServiceMonitors failed: %w".
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@assets/cluster-monitoring-operator/telemetry-service-monitor.yaml`:
- Around line 12-24: The endpoint block under endpoints (the map containing
bearerTokenFile, metricRelabelings, port: https, scheme: https and tlsConfig) is
missing an explicit interval; add an interval field (e.g., interval: 30s or
another project-consistent value) to this endpoint to ensure it doesn't rely on
Prometheus defaults and matches the other ServiceMonitor entries, keeping the
scrapeClass tls-client-certificate-auth and existing metricRelabelings intact.
In `@assets/control-plane/telemetry-service-monitor-kubelet.yaml`:
- Around line 60-110: The /metrics/probes scrape (path: /metrics/probes) and the
CRI-O scrape (where replacement: crio and targetLabel: job) are using the
kubelet/cAdvisor keep regex in metricRelabelings which does not include prober_*
or crio_* families, so they will yield zero metrics; either remove those two
scrape blocks entirely or update the metricRelabelings regex (the regex under
metricRelabelings / sourceLabels: __name__) to include prober_probe_total and
the crio_* metric family names (e.g., add prober_probe_total|crio_.*) so the
intended probe and CRI-O metrics are allowed through.
In `@pkg/client/client.go`:
- Around line 1691-1698: The helper CreateOrUpdateServiceMonitors currently
returns raw errors and hides which ServiceMonitor failed; update it to wrap the
error with the failing ServiceMonitor identity (at minimum Namespace/Name) when
CreateOrUpdateServiceMonitor(ctx, sm) returns non-nil so callers can see which
monitor failed; use fmt.Errorf or errors.Wrapf to produce a message like
"reconciling ServiceMonitor <namespace>/<name>: %w" referencing the sm variable
inside CreateOrUpdateServiceMonitors.
In `@pkg/tasks/alertmanager.go`:
- Around line 226-231: The teardown currently deletes only the singular
AlertmanagerServiceMonitor while the setup creates a full set via
AlertmanagerServiceMonitors(), which leaves orphaned telemetry/minimal monitors;
update destroy() to obtain the full set by calling
t.factory.AlertmanagerServiceMonitors(), handle the error, and pass that set to
the plural deletion method (e.g., t.client.DeleteServiceMonitors(ctx, smams))
instead of deleting only AlertmanagerServiceMonitor(); ensure any existing
singular Delete call is removed or replaced and error handling/logging matches
the create path.
In `@pkg/tasks/telemeter.go`:
- Around line 190-195: The destroy() path currently deletes only the singular
TelemeterClientServiceMonitor; update destroy() to remove all monitors returned
by t.factory.TelemeterClientServiceMonitors() (the same slice used by
CreateOrUpdateServiceMonitors). Call the client-side removal for the whole slice
(e.g., use t.client.DeleteServiceMonitors(ctx, sms) if such a bulk delete
exists) or iterate over the slice and call the singular delete method for each
monitor, propagating and handling errors accordingly so all ServiceMonitors
created by TelemeterClientServiceMonitors() are cleaned up.
---
Outside diff comments:
In `@jsonnet/utils/generate-service-monitors.libsonnet`:
- Around line 9-12: The current filter "[x for x in e.metricRelabelings if
std.objectHas(x, 'action') && x.action != 'drop']" drops relabelings that omit
the action field (which should default to "replace"); change the predicate to
keep entries that either do not have an action or whose action is not "drop"
(e.g. use "if ! (std.objectHas(x, 'action') && x.action == 'drop')" or "if
!std.objectHas(x,'action') || x.action != 'drop'") so metricRelabelings without
an explicit action are retained; update the comprehension in
generate-service-monitors.libsonnet where metricRelabelings is constructed.
---
Nitpick comments:
In `@pkg/tasks/clustermonitoringoperator.go`:
- Around line 143-151: The error strings for
ClusterMonitoringOperatorServiceMonitors are singular but the code works with
multiple items; update the two fmt.Errorf messages that reference Cluster
Monitoring Operator ServiceMonitor to use the plural "ServiceMonitors" instead.
Specifically, adjust the errors returned in the block where you call
t.factory.ClusterMonitoringOperatorServiceMonitors() and where you call
t.client.CreateOrUpdateServiceMonitors(ctx, smscmo) so both messages read "...
Cluster Monitoring Operator ServiceMonitors failed: %w".
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: e2d6f304-0fa7-472f-a841-32380cbd9161
📒 Files selected for processing (35)
assets/alertmanager/minimal-service-monitor.yamlassets/alertmanager/telemetry-service-monitor.yamlassets/cluster-monitoring-operator/minimal-service-monitor.yamlassets/cluster-monitoring-operator/telemetry-service-monitor.yamlassets/control-plane/telemetry-service-monitor-kubelet.yamlassets/kube-state-metrics/telemetry-service-monitor.yamlassets/node-exporter/telemetry-service-monitor.yamlassets/openshift-state-metrics/minimal-service-monitor.yamlassets/openshift-state-metrics/telemetry-service-monitor.yamlassets/prometheus-k8s/minimal-service-monitor.yamlassets/prometheus-k8s/telemetry-service-monitor.yamlassets/telemeter-client/minimal-service-monitor.yamlassets/telemeter-client/telemetry-service-monitor.yamljsonnet/components/alertmanager.libsonnetjsonnet/components/cluster-monitoring-operator.libsonnetjsonnet/components/control-plane.libsonnetjsonnet/components/kube-state-metrics.libsonnetjsonnet/components/node-exporter.libsonnetjsonnet/components/openshift-state-metrics.libsonnetjsonnet/components/prometheus.libsonnetjsonnet/components/telemeter-client.libsonnetjsonnet/utils/configure-authentication-for-monitors.libsonnetjsonnet/utils/generate-service-monitors.libsonnetpkg/client/client.gopkg/manifests/manifests.gopkg/manifests/manifests_test.gopkg/manifests/types.gopkg/tasks/alertmanager.gopkg/tasks/clustermonitoringoperator.gopkg/tasks/controlplane.gopkg/tasks/kubestatemetrics.gopkg/tasks/nodeexporter.gopkg/tasks/openshiftstatemetrics.gopkg/tasks/prometheus.gopkg/tasks/telemeter.go
| endpoints: | ||
| - bearerTokenFile: "" | ||
| metricRelabelings: | ||
| - action: keep | ||
| regex: (cluster_monitoring_operator_collection_profile) | ||
| sourceLabels: | ||
| - __name__ | ||
| port: https | ||
| scheme: https | ||
| tlsConfig: | ||
| insecureSkipVerify: false | ||
| serverName: cluster-monitoring-operator.openshift-monitoring.svc | ||
| scrapeClass: tls-client-certificate-auth |
There was a problem hiding this comment.
Missing interval field in endpoint configuration.
Other telemetry ServiceMonitors in this PR explicitly specify a scrape interval (e.g., 30s for telemeter-client, 2m for openshift-state-metrics). This endpoint lacks an interval field and will use the Prometheus default. Consider adding an explicit interval for consistency.
Proposed fix
spec:
endpoints:
- bearerTokenFile: ""
+ interval: 30s
metricRelabelings:📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| endpoints: | |
| - bearerTokenFile: "" | |
| metricRelabelings: | |
| - action: keep | |
| regex: (cluster_monitoring_operator_collection_profile) | |
| sourceLabels: | |
| - __name__ | |
| port: https | |
| scheme: https | |
| tlsConfig: | |
| insecureSkipVerify: false | |
| serverName: cluster-monitoring-operator.openshift-monitoring.svc | |
| scrapeClass: tls-client-certificate-auth | |
| endpoints: | |
| - bearerTokenFile: "" | |
| interval: 30s | |
| metricRelabelings: | |
| - action: keep | |
| regex: (cluster_monitoring_operator_collection_profile) | |
| sourceLabels: | |
| - __name__ | |
| port: https | |
| scheme: https | |
| tlsConfig: | |
| insecureSkipVerify: false | |
| serverName: cluster-monitoring-operator.openshift-monitoring.svc | |
| scrapeClass: tls-client-certificate-auth |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@assets/cluster-monitoring-operator/telemetry-service-monitor.yaml` around
lines 12 - 24, The endpoint block under endpoints (the map containing
bearerTokenFile, metricRelabelings, port: https, scheme: https and tlsConfig) is
missing an explicit interval; add an interval field (e.g., interval: 30s or
another project-consistent value) to this endpoint to ensure it doesn't rely on
Prometheus defaults and matches the other ServiceMonitor entries, keeping the
scrapeClass tls-client-certificate-auth and existing metricRelabelings intact.
There was a problem hiding this comment.
It'd make sense doing this in a separate PR, as this is pulled from CMO's default SM, which is not affected by this patch.
There was a problem hiding this comment.
@rexagod, that makes sense — since the telemetry ServiceMonitor inherits from the default CMO ServiceMonitor which doesn't set an explicit interval, aligning them in a follow-up PR that touches the root definition would be the cleaner approach. Noted!
| - bearerTokenFile: "" | ||
| honorLabels: true | ||
| interval: 30s | ||
| metricRelabelings: | ||
| - action: keep | ||
| regex: (apiserver_current_inflight_requests|apiserver_request_total|apiserver_storage_objects|container_cpu_usage_seconds_total|container_memory_working_set_bytes|kubelet_containers_per_pod_count_sum|kubelet_volume_stats_used_bytes|pv_collector_total_pv_count|selinux_warning_controller_selinux_volume_conflict|volume_manager_selinux_pod_context_mismatch_errors_total|volume_manager_selinux_pod_context_mismatch_warnings_total|volume_manager_selinux_volume_context_mismatch_errors_total|volume_manager_selinux_volume_context_mismatch_warnings_total|volume_manager_selinux_volumes_admitted_total) | ||
| sourceLabels: | ||
| - __name__ | ||
| path: /metrics/probes | ||
| port: https-metrics | ||
| relabelings: | ||
| - action: replace | ||
| sourceLabels: | ||
| - __metrics_path__ | ||
| targetLabel: metrics_path | ||
| scheme: https | ||
| scrapeTimeout: 30s | ||
| tlsConfig: | ||
| caFile: /etc/prometheus/configmaps/kubelet-serving-ca-bundle/ca-bundle.crt | ||
| insecureSkipVerify: false | ||
| - bearerTokenFile: "" | ||
| interval: 30s | ||
| metricRelabelings: | ||
| - action: keep | ||
| regex: (apiserver_current_inflight_requests|apiserver_request_total|apiserver_storage_objects|container_cpu_usage_seconds_total|container_memory_working_set_bytes|kubelet_containers_per_pod_count_sum|kubelet_volume_stats_used_bytes|pv_collector_total_pv_count|selinux_warning_controller_selinux_volume_conflict|volume_manager_selinux_pod_context_mismatch_errors_total|volume_manager_selinux_pod_context_mismatch_warnings_total|volume_manager_selinux_volume_context_mismatch_errors_total|volume_manager_selinux_volume_context_mismatch_warnings_total|volume_manager_selinux_volumes_admitted_total) | ||
| sourceLabels: | ||
| - __name__ | ||
| port: https-metrics | ||
| relabelings: | ||
| - action: keep | ||
| regex: (linux|) | ||
| sourceLabels: | ||
| - __meta_kubernetes_node_label_kubernetes_io_os | ||
| - action: replace | ||
| regex: (.+)(?::\d+) | ||
| replacement: $1:9637 | ||
| sourceLabels: | ||
| - __address__ | ||
| targetLabel: __address__ | ||
| - action: replace | ||
| replacement: crio | ||
| sourceLabels: | ||
| - endpoint | ||
| targetLabel: endpoint | ||
| - action: replace | ||
| replacement: crio | ||
| targetLabel: job | ||
| scheme: https | ||
| tlsConfig: | ||
| caFile: /etc/prometheus/configmaps/kubelet-serving-ca-bundle/ca-bundle.crt | ||
| insecureSkipVerify: false |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Expected result: both checks below print "False".
python - <<'PY'
from pathlib import Path
import re
text = Path("assets/control-plane/telemetry-service-monitor-kubelet.yaml").read_text()
allow = set()
for pattern in re.findall(r'regex: \(([^)]*)\)', text):
allow.update(pattern.split('|'))
print("Contains prober/probe metrics:", any(m.startswith(("prober_", "probe_")) for m in allow))
print("Contains crio metrics:", any(m.startswith("crio_") for m in allow))
PY
echo
echo "Kubernetes metrics reference:"
curl -fsSL https://kubernetes.io/docs/reference/instrumentation/metrics/ \
| grep -nE 'prober_probe_total|prober_probe_duration_seconds' | head
echo
echo "CRI-O metrics guide:"
curl -fsSL https://raw.githubusercontent.com/cri-o/cri-o/main/tutorials/metrics.md \
| grep -nE '^`crio_' | headRepository: openshift/cluster-monitoring-operator
Length of output: 50394
Remove the probe and CRI-O telemetry scrapes or add their metric families to the allowlist.
Lines 60-110 define /metrics/probes and a crio endpoint that reuse the kubelet/cAdvisor keep regex. However, that regex contains no prober_* or crio_* entries. Kubernetes documents kubelet probe metrics under prober_probe_total, and CRI-O documents runtime metrics under the crio_* prefix. These two endpoints will retain zero metrics as written. Either remove them from the telemetry profile or add the intended probe and CRI-O metric families to the regex.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@assets/control-plane/telemetry-service-monitor-kubelet.yaml` around lines 60
- 110, The /metrics/probes scrape (path: /metrics/probes) and the CRI-O scrape
(where replacement: crio and targetLabel: job) are using the kubelet/cAdvisor
keep regex in metricRelabelings which does not include prober_* or crio_*
families, so they will yield zero metrics; either remove those two scrape blocks
entirely or update the metricRelabelings regex (the regex under
metricRelabelings / sourceLabels: __name__) to include prober_probe_total and
the crio_* metric family names (e.g., add prober_probe_total|crio_.*) so the
intended probe and CRI-O metrics are allowed through.
There was a problem hiding this comment.
I think this should be addressed in a separate PR.
There was a problem hiding this comment.
@rexagod, understood! Would you like me to open a follow-up GitHub issue to track this — ensuring the /metrics/probes and CRI-O scrape endpoints in assets/control-plane/telemetry-service-monitor-kubelet.yaml have their metricRelabelings regex updated to include the relevant prober_* and crio_* metric families?
|
@rexagod: This pull request references MON-4517 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "4.22.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/tasks/openshiftstatemetrics.go`:
- Around line 108-116: Update the two error messages to use the plural
"ServiceMonitors" to match the updated methods and other tasks: change the error
returned after calling t.factory.OpenShiftStateMetricsServiceMonitors() and the
error returned after t.client.CreateOrUpdateServiceMonitors(ctx, sms) so both
messages read "... openshift-state-metrics ServiceMonitors failed: %w"
(referencing the functions OpenShiftStateMetricsServiceMonitors and
CreateOrUpdateServiceMonitors to locate the strings).
In `@pkg/tasks/prometheus.go`:
- Around line 406-414: The error messages in the Prometheus ServiceMonitor block
use the singular "ServiceMonitor" but the code deals with multiple resources;
update the fmt.Errorf messages around
t.factory.PrometheusK8sPrometheusServiceMonitors() and
t.client.CreateOrUpdateServiceMonitors(...) to say "ServiceMonitors" (plural) so
they match the resource collection and other files in the PR; locate these in
the function calling t.factory.PrometheusK8sPrometheusServiceMonitors and change
the two returned error strings accordingly.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 63e4bc26-42fd-4f09-a856-094ef7d96ebd
📒 Files selected for processing (11)
assets/prometheus-k8s/telemetry-service-monitor.yamljsonnet/components/prometheus.libsonnetpkg/client/client.gopkg/tasks/alertmanager.gopkg/tasks/clustermonitoringoperator.gopkg/tasks/controlplane.gopkg/tasks/kubestatemetrics.gopkg/tasks/nodeexporter.gopkg/tasks/openshiftstatemetrics.gopkg/tasks/prometheus.gopkg/tasks/telemeter.go
🚧 Files skipped from review as they are similar to previous changes (6)
- pkg/tasks/clustermonitoringoperator.go
- pkg/client/client.go
- pkg/tasks/alertmanager.go
- pkg/tasks/telemeter.go
- jsonnet/components/prometheus.libsonnet
- assets/prometheus-k8s/telemetry-service-monitor.yaml
|
@rexagod: This pull request references MON-4517 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "4.22.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
pkg/tasks/alertmanager.go (1)
226-234: Inconsistent error messages: use plural "ServiceMonitors" to match the factory/client calls.Lines 228 and 233 still reference singular "ServiceMonitor" while the factory and client methods now operate on a slice. The destroy path (lines 388, 393) correctly uses the plural form.
✏️ Suggested fix for consistency
smams, err := t.factory.AlertmanagerServiceMonitors() if err != nil { - return fmt.Errorf("initializing Alertmanager ServiceMonitor failed: %w", err) + return fmt.Errorf("initializing Alertmanager ServiceMonitors failed: %w", err) } err = t.client.CreateOrUpdateServiceMonitors(ctx, smams) if err != nil { - return fmt.Errorf("reconciling Alertmanager ServiceMonitor failed: %w", err) + return fmt.Errorf("reconciling Alertmanager ServiceMonitors failed: %w", err) }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/tasks/alertmanager.go` around lines 226 - 234, The error messages use singular "ServiceMonitor" but the factory and client methods operate on slices; update the fmt.Errorf messages in the Alertmanager reconciliation path so they use the plural "ServiceMonitors". Specifically, change the error text around calls to t.factory.AlertmanagerServiceMonitors() and t.client.CreateOrUpdateServiceMonitors(ctx, smams) to read "initializing Alertmanager ServiceMonitors failed" and "reconciling Alertmanager ServiceMonitors failed" respectively so they match the method names and the destroy-path wording.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@pkg/tasks/alertmanager.go`:
- Around line 226-234: The error messages use singular "ServiceMonitor" but the
factory and client methods operate on slices; update the fmt.Errorf messages in
the Alertmanager reconciliation path so they use the plural "ServiceMonitors".
Specifically, change the error text around calls to
t.factory.AlertmanagerServiceMonitors() and
t.client.CreateOrUpdateServiceMonitors(ctx, smams) to read "initializing
Alertmanager ServiceMonitors failed" and "reconciling Alertmanager
ServiceMonitors failed" respectively so they match the method names and the
destroy-path wording.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 3aa12788-4f55-4f4c-86b0-904bf906b829
📒 Files selected for processing (3)
pkg/client/client.gopkg/tasks/alertmanager.gopkg/tasks/telemeter.go
🚧 Files skipped from review as they are similar to previous changes (1)
- pkg/client/client.go
|
@rexagod: This pull request references MON-4517 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "4.22.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
@rexagod: This pull request references MON-4517 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target either version "4.22." or "openshift-4.22.", but it targets "openshift-5.0" instead. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
@rexagod: The following test failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
|
||
| // Returns a copy of the input service monitor with the provided list of metrics. | ||
| // The metrics parameter is an array of metric names (e.g., ["metric1", "metric2", "metric3"]). | ||
| // This function removes existing "drop" metricRelabelings before adding the keep filter. |
There was a problem hiding this comment.
(nit) can we explicit the motivation for removing the drop action(s)? I believe that it's for optimization purposes (no need to drop if we know that we have a keep-only strategy after).
|
|
||
| func (f *Factory) AlertmanagerServiceMonitors() ([]*monv1.ServiceMonitor, error) { | ||
| return serviceMonitors( | ||
| f.AlertmanagerServiceMonitor, |
There was a problem hiding this comment.
we should be able to get rid of the f.Alertmanager*ServiceMonitor() methods and pass file paths (e.g. AlertmanagerServiceMonitor, ...) directly to serviceMonitors().
| func serviceMonitors(fullServiceMonitor, minimalServiceMonitor, telemetryServiceMonitor func() (*monv1.ServiceMonitor, error)) ([]*monv1.ServiceMonitor, error) { | ||
| var sms []*monv1.ServiceMonitor | ||
|
|
||
| if fullServiceMonitor != nil { |
There was a problem hiding this comment.
i wouldn't check for nil: our contract is that when supporting collection profiles, a component must implement 1 service monitor for each.
Bases minimal monitors on the same specs as the "full" ones.
This is rebased over #2821
Summary by CodeRabbit
New Features
Tests