Skip to content

Conversation

rexagod
Copy link
Member

@rexagod rexagod commented Sep 18, 2025

Metric rules and metrics exporters have not been opted-in to keep the telemetry rules functioning.

  • I added CHANGELOG entry for this change.
  • No user facing changes, so no entry in CHANGELOG was needed.

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Sep 18, 2025
@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Sep 18, 2025

@rexagod: This pull request references MON-4361 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "4.21.0" version, but no target version was set.

In response to this:

Metric rules and metrics exporters have not been opted-in to keep the telemetry rules functioning.

  • I added CHANGELOG entry for this change.
  • No user facing changes, so no entry in CHANGELOG was needed.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Copy link
Contributor

openshift-ci bot commented Sep 18, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: rexagod

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 18, 2025
@rexagod rexagod force-pushed the MON-4361 branch 3 times, most recently from c4db7d8 to a322ff8 Compare September 18, 2025 14:43
Copy link
Contributor

@simonpasquier simonpasquier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be good to wait for #2649 since it's migrating all the dashboards to static assets.

On the jsonnet implementation side, I wonder if it wouldn't easier to read/maintain if we inject the annotation into each component that needs it.
E.g. here for components for which all resources are OptionalMonitoring

{ ['alertmanager/' + name]: inCluster.alertmanager[name] for name in std.objectFields(inCluster.alertmanager) } +
{ ['alertmanager-user-workload/' + name]: userWorkload.alertmanager[name] for name in std.objectFields(userWorkload.alertmanager) } +
{ ['cluster-monitoring-operator/' + name]: inCluster.clusterMonitoringOperator[name] for name in std.objectFields(inCluster.clusterMonitoringOperator) } +
{ ['dashboards/' + name]: inCluster.dashboards[name] for name in std.objectFields(inCluster.dashboards) } +
{ ['kube-state-metrics/' + name]: inCluster.kubeStateMetrics[name] for name in std.objectFields(inCluster.kubeStateMetrics) } +
{ ['node-exporter/' + name]: inCluster.nodeExporter[name] for name in std.objectFields(inCluster.nodeExporter) } +
{ ['openshift-state-metrics/' + name]: inCluster.openshiftStateMetrics[name] for name in std.objectFields(inCluster.openshiftStateMetrics) } +
{ ['prometheus-k8s/' + name]: inCluster.prometheus[name] for name in std.objectFields(inCluster.prometheus) } +
{ ['admission-webhook/' + name]: inCluster.admissionWebhook[name] for name in std.objectFields(inCluster.admissionWebhook) } +
{ ['prometheus-operator/' + name]: inCluster.prometheusOperator[name] for name in std.objectFields(inCluster.prometheusOperator) } +
{ ['prometheus-operator-user-workload/' + name]: userWorkload.prometheusOperator[name] for name in std.objectFields(userWorkload.prometheusOperator) } +
{ ['prometheus-user-workload/' + name]: userWorkload.prometheus[name] for name in std.objectFields(userWorkload.prometheus) } +
{ ['metrics-server/' + name]: inCluster.metricsServer[name] for name in std.objectFields(inCluster.metricsServer) } +
// needs to be removed once remote-write is allowed for sending telemetry
{ ['telemeter-client/' + name]: inCluster.telemeterClient[name] for name in std.objectFields(inCluster.telemeterClient) } +
{ ['monitoring-plugin/' + name]: inCluster.monitoringPlugin[name] for name in std.objectFields(inCluster.monitoringPlugin) } +
{ ['thanos-querier/' + name]: inCluster.thanosQuerier[name] for name in std.objectFields(inCluster.thanosQuerier) } +
{ ['thanos-ruler/' + name]: inCluster.thanosRuler[name] for name in std.objectFields(inCluster.thanosRuler) } +
{ ['control-plane/' + name]: inCluster.controlPlane[name] for name in std.objectFields(inCluster.controlPlane) } +
{ ['manifests/' + name]: inCluster.manifests[name] for name in std.objectFields(inCluster.manifests) } +

Or at the level of the jsonnet component file in case it's per resource.

CHANGELOG.md Outdated
- `KubePdbNotEnoughHealthyPods`
- `KubeNodePressure`
- `KubeNodeEviction`
- []() Allow cluster-admins to opt-into optional monitoring using the `OptionalMonitoring` capability.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realize that adding the annotation to the manifests under the assets/ directory will have no direct effect since there's no logic in CMO to deploy these resources conditionally, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@rexagod rexagod force-pushed the MON-4361 branch 2 times, most recently from 461f6b9 to cbdd203 Compare September 30, 2025 15:59
@rexagod rexagod changed the title MON-4361: [WIP] Annotate optional monitoring manifests MON-4361: Annotate optional monitoring manifests Sep 30, 2025
@rexagod
Copy link
Member Author

rexagod commented Sep 30, 2025

Reverted the capability.openshift.io/name: Console to respect being able to support dashboards in optional monitoring since we'll still be scraping all targets anyway (to not break any telemetry rules).

kind: ValidatingWebhookConfiguration
metadata:
annotations:
capability.openshift.io/name: OptionalMonitoring
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the service is optional (*), shouldn't we apply the annotation to all admission-webhook resources?

(*) there could be an argument that we still want the admission webhook for PrometheusRule resources because of telemetry?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that was my understanding as well, so I limited this to AM strictly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I didn't realize that this webhook configuration was specifically for AlertmanagerConfig resources. I would still recommend that we document the rationale for not having all admission-webhook resources marked as optional (not sure where it should happen though).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a description annotation on the object.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not directly related to this change but if the console is disabled, wouldn't it be logical to avoid deploying the monitoring plugin resources?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, I've moved the plugin to be independent of the OptionalMonitoring capability and instead made it dependent on the Console one, which is in-line with its task's behavior, PTAL at commit: 55d6da0.

annotations:
api-approved.openshift.io: https://github.com/openshift/api/pull/1406
api.openshift.io/merged-by-featuregates: "true"
capability.openshift.io/name: OptionalMonitoring
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure that CMO will start if the CRDs aren't present.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excluded.

kind: CustomResourceDefinition
metadata:
annotations:
capability.openshift.io/name: OptionalMonitoring
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here: IIRC the Prometheus operator will (at the minimum) log errors if the CRDs aren't installed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excluded.

Metric rules and metrics exporters have not been opted-in to keep the
telemetry rules functioning. Optional components include:
* Alertmanager
* AlertmanagerUWM
* ClusterMonitoringOperatorDeps (partially, for AM)
* MonitoringPlugin
* PromtheusOperator (partially, for AM)
* PromtheusOperatorUWM
* ThanosRuler

Signed-off-by: Pranshu Srivastava <[email protected]>
Drop `monitoring-plugin` from optional components as its deployment
should only rely on the `Console` capability. This is in-line with its
corresponding task's behavior as well. Also refactored the `jsonnet`
code a bit.

Signed-off-by: Pranshu Srivastava <[email protected]>
Enabling the `OptionalMonitoring` capability translates to enabling all
optional monitoring components under CMO. Note that since capabilities
cannot be disabled once enabled, so cleanup for optional monitoring
resources is not necessary. To clarify further, there are two possible
paths at install time:
* capability is disabled -> enabled (no need to cleanup)
* capability is enabled -/> (cannot be disabled) (no need to cleanup)

Signed-off-by: Pranshu Srivastava <[email protected]>
@rexagod
Copy link
Member Author

rexagod commented Oct 12, 2025

x-posting from #2688:

Speaking out loud but wouldn't it be possible for the operator to look at the capability.openshift.io/name annotation on each managed resource and skip the action if it's marked as OptionalMonitoring and the capability is actually disabled? This way we wouldn't have to change the tasks orchestration.

I believe that would require us to bake that behavior into several t.client methods, but as of now, most of the exclusions are per-component, with per-object exclusions only being limited to CMO and PO tasks, for which I've appended the capability check into their surrounding conditionals.

I also think a per-component exclusion at component initialization lets us dodge the need for a cache-based hasOptionalMonitoringCapability check so it's not querying the API for all object requests under a component.

EDIT: I've refactored the patch to not modify the tasks' orchestration, but the tasks themselves as I realised that we don't actually need to think about cleaning these up, quoting the third commit:

Enabling the OptionalMonitoring capability translates to enabling all
optional monitoring components under CMO. Note that since capabilities
cannot be disabled once enabled, so cleanup for optional monitoring
resources is not necessary. To clarify further, there are two possible
paths at install time:

  • capability is disabled -> enabled (no need to cleanup)
  • capability is enabled -/> (cannot be disabled) (no need to cleanup)

@rexagod rexagod changed the title MON-4361: Annotate optional monitoring manifests MON-4361,MON-4380: Optional Monitoring Capability Oct 12, 2025
@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Oct 12, 2025

@rexagod: This pull request references MON-4361 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "4.21.0" version, but no target version was set.

This pull request references MON-4380 which is a valid jira issue.

In response to this:

Metric rules and metrics exporters have not been opted-in to keep the telemetry rules functioning.

  • I added CHANGELOG entry for this change.
  • No user facing changes, so no entry in CHANGELOG was needed.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Excluding CRDs from optional monitoring as their absence can cause CMO
and PO to crash or throw errors at the very least.

Signed-off-by: Pranshu Srivastava <[email protected]>
}

func (c *Client) HasOptionalMonitoringCapability(ctx context.Context) (bool, error) {
return c.HasClusterCapability(ctx, "")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/hold

Needs openshift/api#2468

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 12, 2025
kind: ValidatingWebhookConfiguration
metadata:
annotations:
capability.openshift.io/name: OptionalMonitoring
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I didn't realize that this webhook configuration was specifically for AlertmanagerConfig resources. I would still recommend that we document the rationale for not having all admission-webhook resources marked as optional (not sure where it should happen though).

},
},
},
local addAnnotationToChildren(parent, annotationKeyCapability, annotationValueOptionalMonitoringCapability) =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
local addAnnotationToChildren(parent, annotationKeyCapability, annotationValueOptionalMonitoringCapability) =
local addAnnotationToChildren(parent, key, value) =

@@ -0,0 +1,31 @@
{
local addAnnotationToChild(parent, annotationKeyCapability, annotationValueOptionalMonitoringCapability) =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(nit) o (for object) instead of parent?

Suggested change
local addAnnotationToChild(parent, annotationKeyCapability, annotationValueOptionalMonitoringCapability) =
local addAnnotationToChild(o, key, value) =

local annotationKeyCapability = 'capability.openshift.io/name',
local annotationValueConsoleCapability = 'Console',
local annotationValueOptionalMonitoringCapability = 'OptionalMonitoring',
consoleForObject(o): addAnnotationToChild(o, annotationKeyCapability, annotationValueConsoleCapability),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(suggestion) can we add a small comment before each function to describe what it does?


func (t *AlertmanagerTask) Run(ctx context.Context) error {
if t.config.ClusterMonitoringConfiguration.AlertmanagerMainConfig.IsEnabled() {
optionalMonitoringEnabled, err := t.client.HasOptionalMonitoringCapability(ctx)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have the operator retrieve the information about monitoring capability once before reconciling and provide the information to each task runner?
In the same vein, we create a small helper which would wrap a "Task" and run it only if the monitoring capability is enabled (e.g. useful for "Task"s for which all managed resources are dependent on the capability).

nodeExporter: 1.9.1
promLabelProxy: 0.12.1
prometheus: 3.5.0
prometheus: 3.6.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(nit) don't bother about updating the versions.yaml file in this PR.

kind: CustomResourceDefinition
metadata:
annotations:
capability.openshift.io/name: OptionalMonitoring
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: OptionalMonitoring seems like a fairly redundant name for an optional capability? Is there no more specific/descriptive name we can use for the portion of the monitoring component that's covered by this capability name? If the naming has already been hashed out in an enhancement or something, just point me at that discussion. I'm not trying to re-open something that's already settled.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OptionalMonitoring indicates a capability that targets the optional components of the monitoring stack. Internally, we've used the same terminology for all components that are not required for telemetry. In retrospect, this seemed better than, say, NonTelemetryComponents, but open to ideas.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I followed these steps to go about implementing this capability, so I didn't create an EP, but PLMK if that's necessary)

@rexagod
Copy link
Member Author

rexagod commented Oct 13, 2025

Trying to spin up a cluster for the better part of today, but no luck, so I still need to test out the latest commit(s) just to be sure.

Copy link
Contributor

openshift-ci bot commented Oct 13, 2025

@rexagod: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-aws-ovn-single-node 366ad50 link false /test e2e-aws-ovn-single-node
ci/prow/e2e-aws-ovn-techpreview 2dff195 link true /test e2e-aws-ovn-techpreview
ci/prow/versions 2dff195 link false /test versions
ci/prow/ginkgo-tests 2dff195 link false /test ginkgo-tests
ci/prow/e2e-agnostic-operator 2dff195 link true /test e2e-agnostic-operator
ci/prow/e2e-hypershift-conformance 2dff195 link true /test e2e-hypershift-conformance
ci/prow/okd-scos-e2e-aws-ovn 2dff195 link false /test okd-scos-e2e-aws-ovn
ci/prow/e2e-aws-ovn 2dff195 link true /test e2e-aws-ovn

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@rexagod
Copy link
Member Author

rexagod commented Oct 14, 2025

Can we have the operator retrieve the information about monitoring capability once before reconciling and provide the information to each task runner?

Missed this part, on it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants