-
Notifications
You must be signed in to change notification settings - Fork 380
[wip] MON-4321: Send telemetry via remote-write #2643
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
@jan--f: This pull request references MON-4321 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.20.0" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
Skipping CI for Draft Pull Request. |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: jan--f The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
This commit adds generation code and manifest marshalling for a telemetry recording rule resource. This takes the telemetry matches and generates a recording rule that is evaluated every $telemetry_interval (currently 4m30s). every match creates a new time series with the `telemetry:` prefix. The telemetry metrics are send via remote_write after relabel rules have removed the telmetry prefix and other unneeded labels. Signed-off-by: Jan Fajerski <[email protected]>
9d44065
to
c97736f
Compare
pkg/tasks/prometheus.go
Outdated
} | ||
|
||
if t.config.ClusterMonitoringConfiguration.TelemeterClientConfig.IsEnabled() && t.config.RemoteWrite { | ||
if t.config.ClusterMonitoringConfiguration.TelemeterClientConfig.IsEnabled() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see TelemeterClient deployment is removed, how to check TelemeterClientConfig is enabled
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will we reuse the existing config fields or introduce new ones and deprecate the old fields.
I'd be inclined to say:
- Evaluate which fields are effectively useful.
- Deprecate old fields
- Remove them eventually.
In any case, we'd need to double-check with Service Delivery about potential use of the existing config options of the telemeter client.
Will we keep generating the recording rule and remote write config from the current match list or use another mechanism of defining the cluster side telemetry allow list? We could for example offload the allow listing fully to the server side. Then we can add new telemetry metrics cluster side by simply adding the telemetry: prefix to a recording rule name.
I would keep this as a distinct effort to explore later.
Should we keep allow-listing telemetry metrics cluster side? The telemeter-client implementation explicitly matches the configured telemetry metrics. Cluster admin can't accidentally add telemetry metrics (though they still can intentionally, we have server side allowlisting to catch this). The cluster side allow listing makes for a burdensome telemetry addition process.
The side-effect advantage of the current workflow is that any telemetry addition gets reviewed by OCP monitoring which avoids (to some extent) "buggy" or redundant metrics to land in telemeter. It also ensures that new metrics have "some" description attached to them (though not well structured).
// original metric name in the recording rule. | ||
// Here we reinstate the original name and drop | ||
// the temp name. | ||
// See also jsonnet/components/telemetry-recording-rules.libsonnet |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was a bit confused by the name_label
label at first. After reviewing more deeply, I understand why it's needed.
I have an alternative implementation in mind (though I haven't explored it fully): what if the telemetry metrics were all recorded with the same metric name (e.g. telemetry:metric
) with the original name persisted in another label (e.g. __telemetry_name__
)? It would avoid the special case with label_name
and not be different in terms of cardinality.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I opted for using the original_metric_name
labe for now. While I know we use labels with the __
prefix in other places (__id
), the prometheus docs say
Label names beginning with __ (two underscores) MUST be reserved for internal Prometheus use.
So not sure we should extend that practice. But overall no strong feeling about the label name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally I like that the __ prefix indicates a label which doesn't come from the target but rather from relabeling.
Side comment: we could also "optimize" some of the legacy metrics which are transmitted from the raw source without any recording rule to aggregate away the "noisy" labels like pod or instance (e.g. |
How about adding aggregating away those labels regardless? |
In openshift-4.21 we can then remove the related artifacts and code from the repo. Signed-off-by: Jan Fajerski <[email protected]>
c97736f
to
d7d4db7
Compare
I opted to introduce a new config field called |
This gets rid of a special case and we just track telemetry metric names in a label for all metrics. Before remote writing the original name is reinstated. Signed-off-by: Jan Fajerski <[email protected]>
83ff0c7
to
5b98690
Compare
Perhaps just |
This introduces a new config field `TelemetryConfig` and deprecates `TelemeterClientConfig`. The deprecated field will remain and act as a fallback if no `TelemetryConfig` is provided. Some tests relating to the deprecated field where changed to make tests pass while keeping changes minimal. Signed-off-by: Jan Fajerski <[email protected]>
5b98690
to
6ad8af4
Compare
Signed-off-by: Jan Fajerski <[email protected]>
I assume that in most cases, we'd want |
@jan--f for awareness, Service Delivery uses this config for telemeter client: |
@jan--f: The following test failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
PR needs rebase. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
We have a few things still to discuss. I have outlined this in https://issues.redhat.com/browse/MON-3875. To quote: