MON-4361,MON-4380: Optional Monitoring Capability #2675

rexagod · 2025-09-18T00:10:43Z

Metric rules and metrics exporters have not been opted-in to keep the telemetry rules functioning.

I added CHANGELOG entry for this change.
No user facing changes, so no entry in CHANGELOG was needed.

openshift-ci-robot · 2025-09-18T00:10:47Z

@rexagod: This pull request references MON-4361 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "4.21.0" version, but no target version was set.

In response to this:

Metric rules and metrics exporters have not been opted-in to keep the telemetry rules functioning.

I added CHANGELOG entry for this change.

No user facing changes, so no entry in CHANGELOG was needed.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci · 2025-09-18T00:11:18Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: rexagod

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [rexagod]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

simonpasquier

It might be good to wait for #2649 since it's migrating all the dashboards to static assets.

On the jsonnet implementation side, I wonder if it wouldn't easier to read/maintain if we inject the annotation into each component that needs it.
E.g. here for components for which all resources are OptionalMonitoring

cluster-monitoring-operator/jsonnet/main.jsonnet

Lines 520 to 539 in ea9a533

    
           { ['alertmanager/' + name]: inCluster.alertmanager[name] for name in std.objectFields(inCluster.alertmanager) } + 
        
           { ['alertmanager-user-workload/' + name]: userWorkload.alertmanager[name] for name in std.objectFields(userWorkload.alertmanager) } + 
        
           { ['cluster-monitoring-operator/' + name]: inCluster.clusterMonitoringOperator[name] for name in std.objectFields(inCluster.clusterMonitoringOperator) } + 
        
           { ['dashboards/' + name]: inCluster.dashboards[name] for name in std.objectFields(inCluster.dashboards) } + 
        
           { ['kube-state-metrics/' + name]: inCluster.kubeStateMetrics[name] for name in std.objectFields(inCluster.kubeStateMetrics) } + 
        
           { ['node-exporter/' + name]: inCluster.nodeExporter[name] for name in std.objectFields(inCluster.nodeExporter) } + 
        
           { ['openshift-state-metrics/' + name]: inCluster.openshiftStateMetrics[name] for name in std.objectFields(inCluster.openshiftStateMetrics) } + 
        
           { ['prometheus-k8s/' + name]: inCluster.prometheus[name] for name in std.objectFields(inCluster.prometheus) } + 
        
           { ['admission-webhook/' + name]: inCluster.admissionWebhook[name] for name in std.objectFields(inCluster.admissionWebhook) } + 
        
           { ['prometheus-operator/' + name]: inCluster.prometheusOperator[name] for name in std.objectFields(inCluster.prometheusOperator) } + 
        
           { ['prometheus-operator-user-workload/' + name]: userWorkload.prometheusOperator[name] for name in std.objectFields(userWorkload.prometheusOperator) } + 
        
           { ['prometheus-user-workload/' + name]: userWorkload.prometheus[name] for name in std.objectFields(userWorkload.prometheus) } + 
        
           { ['metrics-server/' + name]: inCluster.metricsServer[name] for name in std.objectFields(inCluster.metricsServer) } + 
        
           // needs to be removed once remote-write is allowed for sending telemetry 
        
           { ['telemeter-client/' + name]: inCluster.telemeterClient[name] for name in std.objectFields(inCluster.telemeterClient) } + 
        
           { ['monitoring-plugin/' + name]: inCluster.monitoringPlugin[name] for name in std.objectFields(inCluster.monitoringPlugin) } + 
        
           { ['thanos-querier/' + name]: inCluster.thanosQuerier[name] for name in std.objectFields(inCluster.thanosQuerier) } + 
        
           { ['thanos-ruler/' + name]: inCluster.thanosRuler[name] for name in std.objectFields(inCluster.thanosRuler) } + 
        
           { ['control-plane/' + name]: inCluster.controlPlane[name] for name in std.objectFields(inCluster.controlPlane) } + 
        
           { ['manifests/' + name]: inCluster.manifests[name] for name in std.objectFields(inCluster.manifests) } +

Or at the level of the jsonnet component file in case it's per resource.

simonpasquier · 2025-09-19T07:27:23Z

CHANGELOG.md

  - `KubePdbNotEnoughHealthyPods`
  - `KubeNodePressure`
  - `KubeNodeEviction`
+- []() Allow cluster-admins to opt-into optional monitoring using the `OptionalMonitoring` capability.


I realize that adding the annotation to the manifests under the assets/ directory will have no direct effect since there's no logic in CMO to deploy these resources conditionally, right?

I've raised a PR for that: https://github.com/openshift/cluster-monitoring-operator/pull/2688/files.

rexagod · 2025-09-30T17:22:33Z

Reverted the capability.openshift.io/name: Console to respect being able to support dashboards in optional monitoring since we'll still be scraping all targets anyway (to not break any telemetry rules).

simonpasquier · 2025-10-01T08:40:24Z

assets/admission-webhook/alertmanager-config-validating-webhook.yaml

 kind: ValidatingWebhookConfiguration
 metadata:
  annotations:
+    capability.openshift.io/name: OptionalMonitoring


If the service is optional (*), shouldn't we apply the annotation to all admission-webhook resources?

(*) there could be an argument that we still want the admission webhook for PrometheusRule resources because of telemetry?

Yes that was my understanding as well, so I limited this to AM strictly.

Ah I didn't realize that this webhook configuration was specifically for AlertmanagerConfig resources. I would still recommend that we document the rationale for not having all admission-webhook resources marked as optional (not sure where it should happen though).

Added a description annotation on the object.

simonpasquier · 2025-10-01T08:42:28Z

assets/monitoring-plugin/deployment.yaml

Not directly related to this change but if the console is disabled, wouldn't it be logical to avoid deploying the monitoring plugin resources?

True, I've moved the plugin to be independent of the OptionalMonitoring capability and instead made it dependent on the Console one, which is in-line with its task's behavior, PTAL at commit: 55d6da0.

simonpasquier · 2025-10-01T08:43:32Z

manifests/0000_50_cluster-monitoring-operator_00_0alertingrules-custom-resource-definition.yaml

  annotations:
    api-approved.openshift.io: https://github.com/openshift/api/pull/1406
    api.openshift.io/merged-by-featuregates: "true"
+    capability.openshift.io/name: OptionalMonitoring


not sure that CMO will start if the CRDs aren't present.

simonpasquier · 2025-10-01T08:46:25Z

.../0000_50_cluster-monitoring-operator_00_0alertmanager-config-custom-resource-definition.yaml

 kind: CustomResourceDefinition
 metadata:
  annotations:
+    capability.openshift.io/name: OptionalMonitoring


same here: IIRC the Prometheus operator will (at the minimum) log errors if the CRDs aren't installed.

Metric rules and metrics exporters have not been opted-in to keep the telemetry rules functioning. Optional components include: * Alertmanager * AlertmanagerUWM * ClusterMonitoringOperatorDeps (partially, for AM) * MonitoringPlugin * PromtheusOperator (partially, for AM) * PromtheusOperatorUWM * ThanosRuler Signed-off-by: Pranshu Srivastava <[email protected]>

Drop `monitoring-plugin` from optional components as its deployment should only rely on the `Console` capability. This is in-line with its corresponding task's behavior as well. Also refactored the `jsonnet` code a bit. Signed-off-by: Pranshu Srivastava <[email protected]>

Enabling the `OptionalMonitoring` capability translates to enabling all optional monitoring components under CMO. Note that since capabilities cannot be disabled once enabled, so cleanup for optional monitoring resources is not necessary. To clarify further, there are two possible paths at install time: * capability is disabled -> enabled (no need to cleanup) * capability is enabled -/> (cannot be disabled) (no need to cleanup) Signed-off-by: Pranshu Srivastava <[email protected]>

rexagod · 2025-10-12T18:42:41Z

x-posting from #2688:

Speaking out loud but wouldn't it be possible for the operator to look at the capability.openshift.io/name annotation on each managed resource and skip the action if it's marked as OptionalMonitoring and the capability is actually disabled? This way we wouldn't have to change the tasks orchestration.

I believe that would require us to bake that behavior into several t.client methods, but as of now, most of the exclusions are per-component, with per-object exclusions only being limited to CMO and PO tasks, for which I've appended the capability check into their surrounding conditionals.

I also think a per-component exclusion at component initialization lets us dodge the need for a cache-based hasOptionalMonitoringCapability check so it's not querying the API for all object requests under a component.

EDIT: I've refactored the patch to not modify the tasks' orchestration, but the tasks themselves as I realised that we don't actually need to think about cleaning these up, quoting the third commit:

Enabling the OptionalMonitoring capability translates to enabling all
optional monitoring components under CMO. Note that since capabilities
cannot be disabled once enabled, so cleanup for optional monitoring
resources is not necessary. To clarify further, there are two possible
paths at install time:

capability is disabled -> enabled (no need to cleanup)

capability is enabled -/> (cannot be disabled) (no need to cleanup)

openshift-ci-robot · 2025-10-12T18:48:19Z

@rexagod: This pull request references MON-4361 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "4.21.0" version, but no target version was set.

This pull request references MON-4380 which is a valid jira issue.

In response to this:

Metric rules and metrics exporters have not been opted-in to keep the telemetry rules functioning.

I added CHANGELOG entry for this change.

No user facing changes, so no entry in CHANGELOG was needed.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Excluding CRDs from optional monitoring as their absence can cause CMO and PO to crash or throw errors at the very least. Signed-off-by: Pranshu Srivastava <[email protected]>

rexagod · 2025-10-12T20:31:19Z

pkg/client/client.go

 }

+func (c *Client) HasOptionalMonitoringCapability(ctx context.Context) (bool, error) {
+	return c.HasClusterCapability(ctx, "")


/hold

Needs openshift/api#2468

simonpasquier · 2025-10-13T09:49:29Z

assets/admission-webhook/alertmanager-config-validating-webhook.yaml

 kind: ValidatingWebhookConfiguration
 metadata:
  annotations:
+    capability.openshift.io/name: OptionalMonitoring


Ah I didn't realize that this webhook configuration was specifically for AlertmanagerConfig resources. I would still recommend that we document the rationale for not having all admission-webhook resources marked as optional (not sure where it should happen though).

simonpasquier · 2025-10-13T09:51:44Z

jsonnet/utils/opt-into-capability.libsonnet

+        },
+      },
+    },
+  local addAnnotationToChildren(parent, annotationKeyCapability, annotationValueOptionalMonitoringCapability) =


Suggested change

local addAnnotationToChildren(parent, annotationKeyCapability, annotationValueOptionalMonitoringCapability) =

local addAnnotationToChildren(parent, key, value) =

simonpasquier · 2025-10-13T09:51:55Z

jsonnet/utils/opt-into-capability.libsonnet

@@ -0,0 +1,31 @@
+{
+  local addAnnotationToChild(parent, annotationKeyCapability, annotationValueOptionalMonitoringCapability) =


(nit) o (for object) instead of parent?

Suggested change

local addAnnotationToChild(parent, annotationKeyCapability, annotationValueOptionalMonitoringCapability) =

local addAnnotationToChild(o, key, value) =

simonpasquier · 2025-10-13T09:54:46Z

jsonnet/utils/opt-into-capability.libsonnet

+  local annotationKeyCapability = 'capability.openshift.io/name',
+  local annotationValueConsoleCapability = 'Console',
+  local annotationValueOptionalMonitoringCapability = 'OptionalMonitoring',
+  consoleForObject(o): addAnnotationToChild(o, annotationKeyCapability, annotationValueConsoleCapability),


(suggestion) can we add a small comment before each function to describe what it does?

simonpasquier · 2025-10-13T10:01:59Z

pkg/tasks/alertmanager.go


 func (t *AlertmanagerTask) Run(ctx context.Context) error {
-	if t.config.ClusterMonitoringConfiguration.AlertmanagerMainConfig.IsEnabled() {
+	optionalMonitoringEnabled, err := t.client.HasOptionalMonitoringCapability(ctx)


Can we have the operator retrieve the information about monitoring capability once before reconciling and provide the information to each task runner?
In the same vein, we create a small helper which would wrap a "Task" and run it only if the monitoring capability is enabled (e.g. useful for "Task"s for which all managed resources are dependent on the capability).

simonpasquier · 2025-10-13T10:02:36Z

jsonnet/versions.yaml

    nodeExporter: 1.9.1
    promLabelProxy: 0.12.1
-    prometheus: 3.5.0
+    prometheus: 3.6.0


(nit) don't bother about updating the versions.yaml file in this PR.

Signed-off-by: Pranshu Srivastava <[email protected]>

wking · 2025-10-13T15:57:50Z

manifests/0000_50_cluster-monitoring-operator_00_0alertmanager-custom-resource-definition.yaml

 kind: CustomResourceDefinition
 metadata:
  annotations:
+    capability.openshift.io/name: OptionalMonitoring


Nit: OptionalMonitoring seems like a fairly redundant name for an optional capability? Is there no more specific/descriptive name we can use for the portion of the monitoring component that's covered by this capability name? If the naming has already been hashed out in an enhancement or something, just point me at that discussion. I'm not trying to re-open something that's already settled.

OptionalMonitoring indicates a capability that targets the optional components of the monitoring stack. Internally, we've used the same terminology for all components that are not required for telemetry. In retrospect, this seemed better than, say, NonTelemetryComponents, but open to ideas.

(I followed these steps to go about implementing this capability, so I didn't create an EP, but PLMK if that's necessary)

Signed-off-by: Pranshu Srivastava <[email protected]>

rexagod · 2025-10-13T17:53:37Z

Trying to spin up a cluster for the better part of today, but no luck, so I still need to test out the latest commit(s) just to be sure.

openshift-ci · 2025-10-13T20:03:28Z

@rexagod: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/e2e-aws-ovn-single-node	`366ad50`	link	false	`/test e2e-aws-ovn-single-node`
ci/prow/e2e-aws-ovn-techpreview	`2dff195`	link	true	`/test e2e-aws-ovn-techpreview`
ci/prow/versions	`2dff195`	link	false	`/test versions`
ci/prow/ginkgo-tests	`2dff195`	link	false	`/test ginkgo-tests`
ci/prow/e2e-agnostic-operator	`2dff195`	link	true	`/test e2e-agnostic-operator`
ci/prow/e2e-hypershift-conformance	`2dff195`	link	true	`/test e2e-hypershift-conformance`
ci/prow/okd-scos-e2e-aws-ovn	`2dff195`	link	false	`/test okd-scos-e2e-aws-ovn`
ci/prow/e2e-aws-ovn	`2dff195`	link	true	`/test e2e-aws-ovn`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

rexagod · 2025-10-14T09:33:58Z

Can we have the operator retrieve the information about monitoring capability once before reconciling and provide the information to each task runner?

Missed this part, on it.

openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Sep 18, 2025

openshift-ci bot requested review from danielmellado and machine424 September 18, 2025 00:11

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 18, 2025

rexagod force-pushed the MON-4361 branch 3 times, most recently from c4db7d8 to a322ff8 Compare September 18, 2025 14:43

simonpasquier reviewed Sep 18, 2025

View reviewed changes

simonpasquier reviewed Sep 19, 2025

View reviewed changes

rexagod force-pushed the MON-4361 branch 2 times, most recently from 461f6b9 to cbdd203 Compare September 30, 2025 15:59

rexagod changed the title ~~MON-4361: [WIP] Annotate optional monitoring manifests~~ MON-4361: Annotate optional monitoring manifests Sep 30, 2025

rexagod force-pushed the MON-4361 branch from cbdd203 to 83bb61e Compare September 30, 2025 17:20

rexagod force-pushed the MON-4361 branch from 83bb61e to 366ad50 Compare September 30, 2025 17:58

simonpasquier reviewed Oct 1, 2025

View reviewed changes

rexagod added 3 commits October 12, 2025 23:47

rexagod mentioned this pull request Oct 12, 2025

[WIP] MON-4380: Add optional monitoring logic #2688

Closed

2 tasks

rexagod force-pushed the MON-4361 branch from 366ad50 to 07101a1 Compare October 12, 2025 18:37

rexagod changed the title ~~MON-4361: Annotate optional monitoring manifests~~ MON-4361,MON-4380: Optional Monitoring Capability Oct 12, 2025

MON-4361,MON-4380: Make CRDs non-optional

b3df6bf

Excluding CRDs from optional monitoring as their absence can cause CMO and PO to crash or throw errors at the very least. Signed-off-by: Pranshu Srivastava <[email protected]>

rexagod commented Oct 12, 2025

View reviewed changes

openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 12, 2025

simonpasquier reviewed Oct 13, 2025

View reviewed changes

MON-4361,MON-4380: Add webhook description and refactor jsonnet/

0909bd7

Signed-off-by: Pranshu Srivastava <[email protected]>

wking reviewed Oct 13, 2025

View reviewed changes

MON-4361,MON-4380: Refactor optional monitoring logic

2dff195

Signed-off-by: Pranshu Srivastava <[email protected]>

rexagod force-pushed the MON-4361 branch from 6a3af75 to 2dff195 Compare October 13, 2025 17:51

	{ ['alertmanager/' + name]: inCluster.alertmanager[name] for name in std.objectFields(inCluster.alertmanager) } +
	{ ['alertmanager-user-workload/' + name]: userWorkload.alertmanager[name] for name in std.objectFields(userWorkload.alertmanager) } +
	{ ['cluster-monitoring-operator/' + name]: inCluster.clusterMonitoringOperator[name] for name in std.objectFields(inCluster.clusterMonitoringOperator) } +
	{ ['dashboards/' + name]: inCluster.dashboards[name] for name in std.objectFields(inCluster.dashboards) } +
	{ ['kube-state-metrics/' + name]: inCluster.kubeStateMetrics[name] for name in std.objectFields(inCluster.kubeStateMetrics) } +
	{ ['node-exporter/' + name]: inCluster.nodeExporter[name] for name in std.objectFields(inCluster.nodeExporter) } +
	{ ['openshift-state-metrics/' + name]: inCluster.openshiftStateMetrics[name] for name in std.objectFields(inCluster.openshiftStateMetrics) } +
	{ ['prometheus-k8s/' + name]: inCluster.prometheus[name] for name in std.objectFields(inCluster.prometheus) } +
	{ ['admission-webhook/' + name]: inCluster.admissionWebhook[name] for name in std.objectFields(inCluster.admissionWebhook) } +
	{ ['prometheus-operator/' + name]: inCluster.prometheusOperator[name] for name in std.objectFields(inCluster.prometheusOperator) } +
	{ ['prometheus-operator-user-workload/' + name]: userWorkload.prometheusOperator[name] for name in std.objectFields(userWorkload.prometheusOperator) } +
	{ ['prometheus-user-workload/' + name]: userWorkload.prometheus[name] for name in std.objectFields(userWorkload.prometheus) } +
	{ ['metrics-server/' + name]: inCluster.metricsServer[name] for name in std.objectFields(inCluster.metricsServer) } +
	// needs to be removed once remote-write is allowed for sending telemetry
	{ ['telemeter-client/' + name]: inCluster.telemeterClient[name] for name in std.objectFields(inCluster.telemeterClient) } +
	{ ['monitoring-plugin/' + name]: inCluster.monitoringPlugin[name] for name in std.objectFields(inCluster.monitoringPlugin) } +
	{ ['thanos-querier/' + name]: inCluster.thanosQuerier[name] for name in std.objectFields(inCluster.thanosQuerier) } +
	{ ['thanos-ruler/' + name]: inCluster.thanosRuler[name] for name in std.objectFields(inCluster.thanosRuler) } +
	{ ['control-plane/' + name]: inCluster.controlPlane[name] for name in std.objectFields(inCluster.controlPlane) } +
	{ ['manifests/' + name]: inCluster.manifests[name] for name in std.objectFields(inCluster.manifests) } +

	local addAnnotationToChildren(parent, annotationKeyCapability, annotationValueOptionalMonitoringCapability) =
	local addAnnotationToChildren(parent, key, value) =

		@@ -0,0 +1,31 @@
		{
		local addAnnotationToChild(parent, annotationKeyCapability, annotationValueOptionalMonitoringCapability) =

	local addAnnotationToChild(parent, annotationKeyCapability, annotationValueOptionalMonitoringCapability) =
	local addAnnotationToChild(o, key, value) =

MON-4361,MON-4380: Optional Monitoring Capability #2675

Are you sure you want to change the base?

MON-4361,MON-4380: Optional Monitoring Capability #2675

Uh oh!

Conversation

rexagod commented Sep 18, 2025

Uh oh!

openshift-ci-robot commented Sep 18, 2025 • edited by openshift-ci bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openshift-ci bot commented Sep 18, 2025

Uh oh!

simonpasquier left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rexagod commented Sep 30, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rexagod commented Oct 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openshift-ci-robot commented Oct 12, 2025 • edited by openshift-ci bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rexagod commented Oct 13, 2025

Uh oh!

openshift-ci bot commented Oct 13, 2025

Uh oh!

rexagod commented Oct 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

openshift-ci-robot commented Sep 18, 2025 •

edited by openshift-ci bot

Loading

rexagod commented Oct 12, 2025 •

edited

Loading

openshift-ci-robot commented Oct 12, 2025 •

edited by openshift-ci bot

Loading