[wip] MON-4321: Send telemetry via remote-write #2643

jan--f · 2025-08-13T08:56:39Z

We have a few things still to discuss. I have outlined this in https://issues.redhat.com/browse/MON-3875. To quote:

Will we reuse the existing config fields or introduce new ones and deprecate the old fields.
Will we keep generating the recording rule and remote write config from the current match list or use another mechanism of defining the cluster side telemetry allow list? We could for example offload the allow listing fully to the server side. Then we can add new telemetry metrics cluster side by simply adding the telemetry: prefix to a recording rule name.
Should we keep allow-listing telemetry metrics cluster side? The telemeter-client implementation explicitly matches the configured telemetry metrics. Cluster admin can't accidentally add telemetry metrics (though they still can intentionally, we have server side allowlisting to catch this). The cluster side allow listing makes for a burdensome telemetry addition process.

I added CHANGELOG entry for this change.
No user facing changes, so no entry in CHANGELOG was needed.

openshift-ci-robot · 2025-08-13T08:56:43Z

@jan--f: This pull request references MON-4321 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.20.0" version, but no target version was set.

In response to this:

We have a few things still to discuss. I have outlined this in https://issues.redhat.com/browse/MON-3875. To quote:

Will we reuse the existing config fields or introduce new ones and deprecate the old fields.

Will we keep generating the recording rule and remote write config from the current match list or use another mechanism of defining the cluster side telemetry allow list? We could for example offload the allow listing fully to the server side. Then we can add new telemetry metrics cluster side by simply adding the telemetry: prefix to a recording rule name.

Should we keep allow-listing telemetry metrics cluster side? The telemeter-client implementation explicitly matches the configured telemetry metrics. Cluster admin can't accidentally add telemetry metrics (though they still can intentionally, we have server side allowlisting to catch this). The cluster side allow listing makes for a burdensome telemetry addition process.

I added CHANGELOG entry for this change.

No user facing changes, so no entry in CHANGELOG was needed.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci · 2025-08-13T08:56:45Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

openshift-ci · 2025-08-13T08:57:47Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jan--f

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [jan--f]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

juzhao · 2025-08-14T03:06:04Z

pkg/tasks/prometheus.go

 	}

-	if t.config.ClusterMonitoringConfiguration.TelemeterClientConfig.IsEnabled() && t.config.RemoteWrite {
+	if t.config.ClusterMonitoringConfiguration.TelemeterClientConfig.IsEnabled() {


I see TelemeterClient deployment is removed, how to check TelemeterClientConfig is enabled

simonpasquier

Will we reuse the existing config fields or introduce new ones and deprecate the old fields.

I'd be inclined to say:

Evaluate which fields are effectively useful.
Deprecate old fields
Remove them eventually.
In any case, we'd need to double-check with Service Delivery about potential use of the existing config options of the telemeter client.

Will we keep generating the recording rule and remote write config from the current match list or use another mechanism of defining the cluster side telemetry allow list? We could for example offload the allow listing fully to the server side. Then we can add new telemetry metrics cluster side by simply adding the telemetry: prefix to a recording rule name.

I would keep this as a distinct effort to explore later.

Should we keep allow-listing telemetry metrics cluster side? The telemeter-client implementation explicitly matches the configured telemetry metrics. Cluster admin can't accidentally add telemetry metrics (though they still can intentionally, we have server side allowlisting to catch this). The cluster side allow listing makes for a burdensome telemetry addition process.

The side-effect advantage of the current workflow is that any telemetry addition gets reviewed by OCP monitoring which avoids (to some extent) "buggy" or redundant metrics to land in telemeter. It also ensures that new metrics have "some" description attached to them (though not well structured).

pkg/tasks/telemeter.go

simonpasquier · 2025-08-14T08:49:49Z

pkg/manifests/manifests.go

+				// original metric name in the recording rule.
+				// Here we reinstate the original name and drop
+				// the temp name.
+				// See also jsonnet/components/telemetry-recording-rules.libsonnet


I was a bit confused by the name_label label at first. After reviewing more deeply, I understand why it's needed.

I have an alternative implementation in mind (though I haven't explored it fully): what if the telemetry metrics were all recorded with the same metric name (e.g. telemetry:metric) with the original name persisted in another label (e.g. __telemetry_name__)? It would avoid the special case with label_name and not be different in terms of cardinality.

I opted for using the original_metric_name labe for now. While I know we use labels with the __ prefix in other places (__id), the prometheus docs say

Label names beginning with __ (two underscores) MUST be reserved for internal Prometheus use.

So not sure we should extend that practice. But overall no strong feeling about the label name.

Personally I like that the __ prefix indicates a label which doesn't come from the target but rather from relabeling.

added the __ prefix.

simonpasquier · 2025-08-14T08:57:07Z

Side comment: we could also "optimize" some of the legacy metrics which are transmitted from the raw source without any recording rule to aggregate away the "noisy" labels like pod or instance (e.g. cluster_operator_conditions).

jan--f · 2025-08-15T07:05:26Z

Side comment: we could also "optimize" some of the legacy metrics which are transmitted from the raw source without any recording rule to aggregate away the "noisy" labels like pod or instance (e.g. cluster_operator_conditions).

How about adding aggregating away those labels regardless?

jan--f · 2025-08-15T14:52:46Z

I opted to introduce a new config field called TelemetryConfig and deprecate TelemeterClientConfig. I believe this is clearer then reusing some parts of TelemeterClientConfig and deprecating others.

jan--f · 2025-08-19T09:00:53Z

Side comment: we could also "optimize" some of the legacy metrics which are transmitted from the raw source without any recording rule to aggregate away the "noisy" labels like pod or instance (e.g. cluster_operator_conditions).

How about adding aggregating away those labels regardless?

Perhaps just sum without(pod, container, ...) ()?

simonpasquier · 2025-08-20T08:37:52Z

Side comment: we could also "optimize" some of the legacy metrics which are transmitted from the raw source without any recording rule to aggregate away the "noisy" labels like pod or instance (e.g. cluster_operator_conditions).

How about adding aggregating away those labels regardless?

Perhaps just sum without(pod, container, ...) ()?

I assume that in most cases, we'd want max without(pod, container) (...). But I'd be worried to do it unilaterally as it may break some hidden contract.

simonpasquier · 2025-08-20T09:48:27Z

@jan--f for awareness, Service Delivery uses this config for telemeter client:
https://github.com/openshift/managed-cluster-config/blob/3d39d600714aeae1743ba831f0ff3fe5a30e519b/resources/cluster-monitoring-config/config.yaml#L60-L67

openshift-ci · 2025-09-16T12:40:38Z

@jan--f: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/ginkgo-tests	`7e85831`	link	true	`/test ginkgo-tests`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

This commit adds generation code and manifest marshalling for a telemetry recording rule resource. This takes the telemetry matches and generates a recording rule that is evaluated every $telemetry_interval (currently 4m30s). every match creates a new time series with the `telemetry:` prefix. The telemetry metrics are send via remote_write after relabel rules have removed the telmetry prefix and other unneeded labels. Signed-off-by: Jan Fajerski <[email protected]>

In openshift-4.21 we can then remove the related artifacts and code from the repo. Signed-off-by: Jan Fajerski <[email protected]>

jan--f · 2025-10-29T15:32:15Z

I assume that in most cases, we'd want max without(pod, container) (...). But I'd be worried to do it unilaterally as it may break some hidden contract.

Fair point. Perhaps we better leave this for a follow up PR.

This gets rid of a special case and we just track telemetry metric names in a label for all metrics. Before remote writing the original name is reinstated. Signed-off-by: Jan Fajerski <[email protected]>

This introduces a new config field `TelemetryConfig` and deprecates `TelemeterClientConfig`. The deprecated field will remain and act as a fallback if no `TelemetryConfig` is provided. Some tests relating to the deprecated field where changed to make tests pass while keeping changes minimal. Signed-off-by: Jan Fajerski <[email protected]>

openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Aug 13, 2025

openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Aug 13, 2025

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 13, 2025

jan--f force-pushed the telemetry-recording-rule branch from 9d44065 to c97736f Compare August 13, 2025 09:00

jan--f requested review from machine424 and simonpasquier August 13, 2025 09:00

juzhao reviewed Aug 14, 2025

View reviewed changes

simonpasquier reviewed Aug 14, 2025

View reviewed changes

jan--f force-pushed the telemetry-recording-rule branch from c97736f to d7d4db7 Compare August 15, 2025 11:10

jan--f force-pushed the telemetry-recording-rule branch from 83ff0c7 to 5b98690 Compare August 19, 2025 08:29

jan--f force-pushed the telemetry-recording-rule branch from 5b98690 to 6ad8af4 Compare August 19, 2025 11:23

openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 16, 2025

jan--f added 2 commits October 29, 2025 15:54

telemeter: start telemeter deprecation by removing all objects

9389358

In openshift-4.21 we can then remove the related artifacts and code from the repo. Signed-off-by: Jan Fajerski <[email protected]>

jan--f added 2 commits October 29, 2025 16:35

telemetry: track original metric name in label

9ac1fad

This gets rid of a special case and we just track telemetry metric names in a label for all metrics. Before remote writing the original name is reinstated. Signed-off-by: Jan Fajerski <[email protected]>

jan--f force-pushed the telemetry-recording-rule branch from 7e85831 to f188c16 Compare October 29, 2025 15:35

openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 29, 2025

[wip] MON-4321: Send telemetry via remote-write #2643

Are you sure you want to change the base?

[wip] MON-4321: Send telemetry via remote-write #2643

Uh oh!

Conversation

jan--f commented Aug 13, 2025

Uh oh!

openshift-ci-robot commented Aug 13, 2025 • edited by openshift-ci bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openshift-ci bot commented Aug 13, 2025

Uh oh!

openshift-ci bot commented Aug 13, 2025

Uh oh!

juzhao Aug 14, 2025

Choose a reason for hiding this comment

Uh oh!

simonpasquier left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

simonpasquier Aug 14, 2025

Choose a reason for hiding this comment

Uh oh!

jan--f Aug 15, 2025

Choose a reason for hiding this comment

Uh oh!

simonpasquier Aug 20, 2025

Choose a reason for hiding this comment

Uh oh!

jan--f Oct 29, 2025

Choose a reason for hiding this comment

Uh oh!

simonpasquier commented Aug 14, 2025

Uh oh!

jan--f commented Aug 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jan--f commented Aug 15, 2025

Uh oh!

jan--f commented Aug 19, 2025

Uh oh!

simonpasquier commented Aug 20, 2025

Uh oh!

simonpasquier commented Aug 20, 2025

Uh oh!

openshift-ci bot commented Sep 16, 2025

Uh oh!

jan--f commented Oct 29, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

openshift-ci-robot commented Aug 13, 2025 •

edited by openshift-ci bot

Loading

jan--f commented Aug 15, 2025 •

edited

Loading