Skip to content

Conversation

rexagod
Copy link
Member

@rexagod rexagod commented May 7, 2025

Re-opening the KEP PR to backfill on the required proposal context.

Signed-off-by: Pranshu Srivastava [email protected]


Continues: #1298

@openshift-ci openshift-ci bot requested review from jan--f and simonpasquier May 7, 2025 07:33
@rexagod rexagod force-pushed the metrics-collection-profiles branch from 559e546 to 34d451e Compare May 8, 2025 08:06
@rexagod
Copy link
Member Author

rexagod commented May 8, 2025

/cc @JoaoBraveCoding

Requesting a review here. If things look good to you, I'll request the API folks to take a look. 🙂

@openshift-ci openshift-ci bot requested a review from JoaoBraveCoding May 8, 2025 08:52
@rexagod rexagod force-pushed the metrics-collection-profiles branch 3 times, most recently from 4670400 to 577a948 Compare May 12, 2025 08:34
Copy link
Contributor

@JoaoBraveCoding JoaoBraveCoding left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍 Thank you for resuming this work 🙌

@openshift-bot
Copy link

Inactive enhancement proposals go stale after 28d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Mark the proposal as fresh by commenting /remove-lifecycle stale.
Stale proposals rot after an additional 7d of inactivity and eventually close.
Exclude this proposal from closing by commenting /lifecycle frozen.

If this proposal is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 11, 2025
data:
config.yaml: |
prometheusK8s:
collectionProfile: full
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Enum values should be PascalCase

Copy link
Member Author

@rexagod rexagod Jun 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. I'll send out a patch to support camel and pascal cases (instead of just the former), and deprecate it in a later release. Does that sound good?

Also, for future references, is there a guide that outlines such practices that we follow across OpenShift?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we can get it updated to support both, but document only the PascalCase versions going forward that's the best approach, SGTM

And yep, an upstream K8s convention

// metrics that are exposed by the platform components. In the `minimal`
// profile, Prometheus only collects metrics necessary for the default
// platform alerts, recording rules, telemetry and console dashboards.
CollectionProfile CollectionProfile `json:"collectionProfile,omitempty"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this field required or optional?

What happens when it is not set?

How does the upgrade work for existing clusters, is there any action needed?

Copy link
Member Author

@rexagod rexagod Jun 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this field required or optional?
What happens when it is not set?

The field's optional, when unset, the operator behaves the same way as it did before this change (which is also exactly same as the full collection profile in all respects).

How does the upgrade work for existing clusters, is there any action needed?

Upon upgrading, setting this field to full has no change in the behavior of the operator compared to as it was before.

However, setting this to minimal will, in addition to kubelet, etcd, kube-state-metrics, and node-exporter, prompt the operator to look for any similar "minimal" marked service or pod monitors and apply those targets only.

Besides setting the field itself, there's no action needed. OOTB the field will be unset which has the same implications as the full profile, i.e., the same behavior as earlier (all targets are discovered).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Explaining in the godoc what the behaviours are when it's unset would be helpful for both our generated docs on the main docs site, and also for those using oc explain to understand their APIs


OpenShift teams can decide if they want to adopt this feature. Without any
change to a monitor, if a user picks a profile in the CMO config, things
will work as they did before. When an OpenShift team wants to implement
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you decide later to add an additional profile, how would that impact existing teams? What would you have to do before you could introduce the new profile?

Copy link
Member Author

@rexagod rexagod Jun 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Profiles need service or pod monitors to accompany them into transitioning the cluster's metrics targets' scope to the desired one.

If a third profile is planned, the monitoring team will need to make sure we have the adequate set of monitors that will be deployed to create as much of a complete base experience as possible, as expected from that profile, before introducing it. These monitors will need to be created by the component owners.

Once it goes live, all teams that initially created (and other that will do so later on) monitors for that profile will be able to support it for their workloads should the cluster admin choose to use that profile.

As such, teams that do not have monitors for the newer profile will have their metric targets excluded, until they deploy a corresponding monitor.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is useful context and something you may want to capture in a "What is required when we expand the profiles in the future" heading or something similar

It may also be worth you creating something in origin that checks that every monitor that defines a profile, has a mirror in the payload for every profile that you support. That way, when you do add another, you can update the test, add exceptions, and then work from that list to remove all of the exceptions that are missing the new profile

Copy link
Member Author

@rexagod rexagod Jul 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, a historical profile-monitor mapping will certainly prove useful. I'm not sure how this data will be preserved/persisted as a source of truth between runs, though (is there a similar case I can look at?).

Also, could you please elaborate a bit on what you meant by "a mirror" in this context?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By mirror, I meant one of the metrics collection resources for each of the profiles.

And I don't think you need to have data stored across runs, I think you just need to have a test that is aware of all profiles, and, when you update the test, it would check that if the component defines one profile, it defines all profiles. It will only fail because you deliberately updated it to include the new profile, and when you do that, you'll get a list of all those that have defined the previous profiles, because they won't be defining the new profile (most likely)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the pointer, this makes a lot of sense.

I think at this point we were the only ones in OpenShift shipping these profile-specific monitors out, so the idea of making sure they exist between payloads was missed by me, but this will definitely be helpful, will do!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be a use-case for a utility I made back in the day, for various operations linked to these profiles (read, a MetricsCollectionProfilesCTL-esque CLI).

data:
config.yaml: |
prometheusK8s:
collectionProfile: full
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rexagod rexagod force-pushed the metrics-collection-profiles branch from 9d6a0a2 to 0bf8482 Compare June 17, 2025 10:55
@openshift-bot
Copy link

Stale enhancement proposals rot after 7d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Mark the proposal as fresh by commenting /remove-lifecycle rotten.
Rotten proposals close after an additional 7d of inactivity.
Exclude this proposal from closing by commenting /lifecycle frozen.

If this proposal is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

@openshift-ci openshift-ci bot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jun 25, 2025
@openshift-bot
Copy link

Rotten enhancement proposals close after 7d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Reopen the proposal by commenting /reopen.
Mark the proposal as fresh by commenting /remove-lifecycle rotten.
Exclude this proposal from closing again by commenting /lifecycle frozen.

/close

@openshift-ci openshift-ci bot closed this Jul 2, 2025
Copy link
Contributor

openshift-ci bot commented Jul 2, 2025

@openshift-bot: Closed this PR.

In response to this:

Rotten enhancement proposals close after 7d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Reopen the proposal by commenting /reopen.
Mark the proposal as fresh by commenting /remove-lifecycle rotten.
Exclude this proposal from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@rexagod
Copy link
Member Author

rexagod commented Jul 2, 2025

/reopen
/remove-lifecycle rotten

@openshift-ci openshift-ci bot reopened this Jul 2, 2025
Copy link
Contributor

openshift-ci bot commented Jul 2, 2025

@rexagod: Reopened this PR.

In response to this:

/reopen
/remove-lifecycle rotten

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@openshift-ci openshift-ci bot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Jul 2, 2025
@rexagod rexagod force-pushed the metrics-collection-profiles branch from 7e4f324 to eb5b85b Compare July 2, 2025 13:15
@rexagod
Copy link
Member Author

rexagod commented Jul 2, 2025

Also opened openshift/cluster-monitoring-operator#2613 for CMO-side changes.

@rexagod rexagod requested a review from JoelSpeed July 9, 2025 07:45
@rexagod
Copy link
Member Author

rexagod commented Jul 14, 2025

Pinging @JoelSpeed for another look here.

Comment on lines 190 to 192
- `full` (same as today)
- `minimal` (only collect metrics necessary for recording rules, alerts,
dashboards, HPA, VPA and telemetry)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given we've discussed changing these to Full and Minimal to match K8s conventions, can we fix the EP to represent the PascalCase versions?


### Open Questions

## Test Plan
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So perhaps we add here

E2E test that ensures that for every monitor that is labelled as `Full` collection profile, there also exists one for `Minimal, and vice versa

Though I'm not sure how you'd actually work out that pairing?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe rexagod/cpv can help us with that. :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this be integrated into the testing plan?

@rexagod rexagod force-pushed the metrics-collection-profiles branch from 57608ea to 5901987 Compare July 29, 2025 23:11
@rexagod rexagod requested a review from JoelSpeed July 29, 2025 23:11
Copy link
Contributor

@JoelSpeed JoelSpeed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no further feedback, I think we will probably want some adjustments to the APi when we come to making it first party rather than configmap based, but that can be discussed at that point

@rexagod rexagod force-pushed the metrics-collection-profiles branch from 5901987 to c13077f Compare August 12, 2025 08:16
Re-opening the KEP PR to backfill on the required proposal context.

This commit squashes over 25 previous ones from its predecessor.

Signed-off-by: Pranshu Srivastava <[email protected]>
@rexagod rexagod force-pushed the metrics-collection-profiles branch from c13077f to 8257f54 Compare August 12, 2025 08:39
@rexagod
Copy link
Member Author

rexagod commented Aug 12, 2025

Squashed.

@rexagod rexagod requested a review from JoelSpeed August 12, 2025 08:40
@rexagod
Copy link
Member Author

rexagod commented Aug 13, 2025

/cc @simonpasquier

@openshift-ci openshift-ci bot requested a review from simonpasquier August 13, 2025 11:30
@rexagod
Copy link
Member Author

rexagod commented Aug 20, 2025

Pinging @simonpasquier for an LGTM here (if all looks good) 🙂

@rexagod
Copy link
Member Author

rexagod commented Aug 31, 2025

Re-ping @simonpasquier for a look here 🙇🏼

Copy link
Contributor

@simonpasquier simonpasquier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Sep 1, 2025
@simonpasquier
Copy link
Contributor

/approve
/hold

Letting @jan--f the opportunity to review it once more.

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 1, 2025
Copy link
Contributor

openshift-ci bot commented Sep 1, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: JoaoBraveCoding, simonpasquier

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 1, 2025
@rexagod rexagod changed the title MetricsCollectionProfiles: Reword and update KEP MON-2692: Reword and update KEP Sep 2, 2025
@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Sep 2, 2025
@openshift-ci-robot
Copy link

openshift-ci-robot commented Sep 2, 2025

@rexagod: This pull request references MON-2692 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.21.0" version, but no target version was set.

In response to this:

Re-opening the KEP PR to backfill on the required proposal context.

Signed-off-by: Pranshu Srivastava [email protected]


Continues: #1298

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@rexagod
Copy link
Member Author

rexagod commented Sep 2, 2025

/jira refresh

@openshift-ci-robot
Copy link

openshift-ci-robot commented Sep 2, 2025

@rexagod: This pull request references MON-2692 which is a valid jira issue.

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@rexagod
Copy link
Member Author

rexagod commented Sep 10, 2025

(bump)

@simonpasquier
Copy link
Contributor

/hold cancel

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 10, 2025
Copy link
Contributor

openshift-ci bot commented Sep 10, 2025

@rexagod: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@openshift-merge-bot openshift-merge-bot bot merged commit 59eed35 into openshift:master Sep 10, 2025
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants