Skip to content

Conversation

jan--f
Copy link
Contributor

@jan--f jan--f commented Aug 13, 2025

We have a few things still to discuss. I have outlined this in https://issues.redhat.com/browse/MON-3875. To quote:

  1. Will we reuse the existing config fields or introduce new ones and deprecate the old fields.
  2. Will we keep generating the recording rule and remote write config from the current match list or use another mechanism of defining the cluster side telemetry allow list? We could for example offload the allow listing fully to the server side. Then we can add new telemetry metrics cluster side by simply adding the telemetry: prefix to a recording rule name.
  3. Should we keep allow-listing telemetry metrics cluster side? The telemeter-client implementation explicitly matches the configured telemetry metrics. Cluster admin can't accidentally add telemetry metrics (though they still can intentionally, we have server side allowlisting to catch this). The cluster side allow listing makes for a burdensome telemetry addition process.
  • I added CHANGELOG entry for this change.
  • No user facing changes, so no entry in CHANGELOG was needed.

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Aug 13, 2025
@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Aug 13, 2025

@jan--f: This pull request references MON-4321 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.20.0" version, but no target version was set.

In response to this:

We have a few things still to discuss. I have outlined this in https://issues.redhat.com/browse/MON-3875. To quote:

  1. Will we reuse the existing config fields or introduce new ones and deprecate the old fields.
  2. Will we keep generating the recording rule and remote write config from the current match list or use another mechanism of defining the cluster side telemetry allow list? We could for example offload the allow listing fully to the server side. Then we can add new telemetry metrics cluster side by simply adding the telemetry: prefix to a recording rule name.
  3. Should we keep allow-listing telemetry metrics cluster side? The telemeter-client implementation explicitly matches the configured telemetry metrics. Cluster admin can't accidentally add telemetry metrics (though they still can intentionally, we have server side allowlisting to catch this). The cluster side allow listing makes for a burdensome telemetry addition process.
  • I added CHANGELOG entry for this change.
  • No user facing changes, so no entry in CHANGELOG was needed.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Aug 13, 2025
Copy link
Contributor

openshift-ci bot commented Aug 13, 2025

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

Copy link
Contributor

openshift-ci bot commented Aug 13, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jan--f

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 13, 2025
This commit adds generation code and manifest marshalling for a
telemetry recording rule resource. This takes the telemetry matches and
generates a recording rule that is evaluated every $telemetry_interval
(currently 4m30s). every match creates a new time series with the
`telemetry:` prefix. The telemetry metrics are send via remote_write after
relabel rules have removed the telmetry prefix and other unneeded labels.

Signed-off-by: Jan Fajerski <[email protected]>
@jan--f jan--f force-pushed the telemetry-recording-rule branch from 9d44065 to c97736f Compare August 13, 2025 09:00
}

if t.config.ClusterMonitoringConfiguration.TelemeterClientConfig.IsEnabled() && t.config.RemoteWrite {
if t.config.ClusterMonitoringConfiguration.TelemeterClientConfig.IsEnabled() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see TelemeterClient deployment is removed, how to check TelemeterClientConfig is enabled

Copy link
Contributor

@simonpasquier simonpasquier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will we reuse the existing config fields or introduce new ones and deprecate the old fields.

I'd be inclined to say:

  • Evaluate which fields are effectively useful.
  • Deprecate old fields
  • Remove them eventually.
    In any case, we'd need to double-check with Service Delivery about potential use of the existing config options of the telemeter client.

Will we keep generating the recording rule and remote write config from the current match list or use another mechanism of defining the cluster side telemetry allow list? We could for example offload the allow listing fully to the server side. Then we can add new telemetry metrics cluster side by simply adding the telemetry: prefix to a recording rule name.

I would keep this as a distinct effort to explore later.

Should we keep allow-listing telemetry metrics cluster side? The telemeter-client implementation explicitly matches the configured telemetry metrics. Cluster admin can't accidentally add telemetry metrics (though they still can intentionally, we have server side allowlisting to catch this). The cluster side allow listing makes for a burdensome telemetry addition process.

The side-effect advantage of the current workflow is that any telemetry addition gets reviewed by OCP monitoring which avoids (to some extent) "buggy" or redundant metrics to land in telemeter. It also ensures that new metrics have "some" description attached to them (though not well structured).

// original metric name in the recording rule.
// Here we reinstate the original name and drop
// the temp name.
// See also jsonnet/components/telemetry-recording-rules.libsonnet
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was a bit confused by the name_label label at first. After reviewing more deeply, I understand why it's needed.

I have an alternative implementation in mind (though I haven't explored it fully): what if the telemetry metrics were all recorded with the same metric name (e.g. telemetry:metric) with the original name persisted in another label (e.g. __telemetry_name__)? It would avoid the special case with label_name and not be different in terms of cardinality.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I opted for using the original_metric_name labe for now. While I know we use labels with the __ prefix in other places (__id), the prometheus docs say

Label names beginning with __ (two underscores) MUST be reserved for internal Prometheus use.

So not sure we should extend that practice. But overall no strong feeling about the label name.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally I like that the __ prefix indicates a label which doesn't come from the target but rather from relabeling.

@simonpasquier
Copy link
Contributor

Side comment: we could also "optimize" some of the legacy metrics which are transmitted from the raw source without any recording rule to aggregate away the "noisy" labels like pod or instance (e.g. cluster_operator_conditions).

@jan--f
Copy link
Contributor Author

jan--f commented Aug 15, 2025

Side comment: we could also "optimize" some of the legacy metrics which are transmitted from the raw source without any recording rule to aggregate away the "noisy" labels like pod or instance (e.g. cluster_operator_conditions).

How about adding aggregating away those labels regardless?

In openshift-4.21 we can then remove the related artifacts and code from
the repo.

Signed-off-by: Jan Fajerski <[email protected]>
@jan--f jan--f force-pushed the telemetry-recording-rule branch from c97736f to d7d4db7 Compare August 15, 2025 11:10
@jan--f
Copy link
Contributor Author

jan--f commented Aug 15, 2025

I opted to introduce a new config field called TelemetryConfig and deprecate TelemeterClientConfig. I believe this is clearer then reusing some parts of TelemeterClientConfig and deprecating others.

This gets rid of a special case and we just track telemetry metric names
in a label for all metrics. Before remote writing the original name is
reinstated.

Signed-off-by: Jan Fajerski <[email protected]>
@jan--f jan--f force-pushed the telemetry-recording-rule branch from 83ff0c7 to 5b98690 Compare August 19, 2025 08:29
@jan--f
Copy link
Contributor Author

jan--f commented Aug 19, 2025

Side comment: we could also "optimize" some of the legacy metrics which are transmitted from the raw source without any recording rule to aggregate away the "noisy" labels like pod or instance (e.g. cluster_operator_conditions).

How about adding aggregating away those labels regardless?

Perhaps just sum without(pod, container, ...) ()?

This introduces a new config field `TelemetryConfig` and deprecates
`TelemeterClientConfig`. The deprecated field will remain and act as a
fallback if no `TelemetryConfig` is provided.
Some tests relating to the deprecated field where changed to make tests
pass while keeping changes minimal.

Signed-off-by: Jan Fajerski <[email protected]>
@jan--f jan--f force-pushed the telemetry-recording-rule branch from 5b98690 to 6ad8af4 Compare August 19, 2025 11:23
@simonpasquier
Copy link
Contributor

Side comment: we could also "optimize" some of the legacy metrics which are transmitted from the raw source without any recording rule to aggregate away the "noisy" labels like pod or instance (e.g. cluster_operator_conditions).

How about adding aggregating away those labels regardless?

Perhaps just sum without(pod, container, ...) ()?

I assume that in most cases, we'd want max without(pod, container) (...). But I'd be worried to do it unilaterally as it may break some hidden contract.

@simonpasquier
Copy link
Contributor

Copy link
Contributor

openshift-ci bot commented Sep 16, 2025

@jan--f: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/ginkgo-tests 7e85831 link true /test ginkgo-tests

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 16, 2025
@openshift-merge-robot
Copy link
Contributor

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants