Skip to content

feat(target-allocator): Add support for scrape-classes#4216

Merged
jaronoff97 merged 3 commits intoopen-telemetry:mainfrom
ChristianCiach:allocator-scrapeclasses
Dec 18, 2025
Merged

feat(target-allocator): Add support for scrape-classes#4216
jaronoff97 merged 3 commits intoopen-telemetry:mainfrom
ChristianCiach:allocator-scrapeclasses

Conversation

@ChristianCiach
Copy link
Contributor

@ChristianCiach ChristianCiach commented Jul 24, 2025

Description:

Adds support for ScrapeClasses as supported by the Prometheus Operator. Users can use these to add global configurations to multiple (or even all) PodMonitors and ServiceMonitors.

I need this feature to get rid of the default labels (pod, container, namespace, ...) that the Prometheus-Operator automatically adds to all PodMonitors and ServiceMonitors as described in https://github.com/prometheus-operator/prometheus-operator/blob/main/Documentation/user-guides/running-exporters.md#podmonitors. I consider these labels problematic and redundant, because they are already present as resource-attributes using proper Otel Semconv names. But I cannot safely drop these labels at the collector, because at this stage I cannot distinguish actual metric labels from the labels added by the Prometheus Operator. So I want to use a default ScrapeClass to globally drop these problematic labels before the metrics are scraped.

In #3600 (comment) @swiatekm raised a concern:

One point worth mentioning is that unlike in prometheus-operator, in a target allocator + otel collector setup, service discovery and scraping happen in different applications. Right now we solve the issue around credentials by encrypting traffic between the target allocator and prometheus receiver, and simply exposing them via target allocator endpoints. Not sure if that makes any difference for scrape classes specifically.

Looking at the code, I believe this concern does not apply. The configuration of the scrape-classes are simply merged with the configurations found in PodMonitors and ServiceMonitors. The TargetAllocator simply retrieves the merged configuration from the Prometheus Operator code, so the Target Allocator cannot even distinguish whether the configuration originates from a PodMonitor or a ScrapeClass.

Link to tracking Issue(s): #3600

Testing:

I added a test-case that adds a simple scrape-class to the PrometheusCR configuration and let a PodMonitor reference it. The configuration of the scrape-class is correctly added to the resulting prometheus scrape-config.

Documentation:

Mentioned in README and re-generated API docs.

@ChristianCiach ChristianCiach requested a review from a team as a code owner July 24, 2025 18:44
@ChristianCiach
Copy link
Contributor Author

@nicolastakashi I just see that you offered to work on this in #3600 (comment). This was a while ago though and the changes are pretty small, so I hope you are okay with me taking the plunge on this.

@ChristianCiach ChristianCiach force-pushed the allocator-scrapeclasses branch 3 times, most recently from e2652bc to ae7b1bd Compare July 24, 2025 19:13
@nicolastakashi
Copy link
Contributor

@nicolastakashi I just see that you offered to work on this in #3600 (comment). This was a while ago though and the changes are pretty small, so I hope you are okay with me taking the plunge on this.

Thank you very much for working on that @ChristianCiach 🙏🏽

Copy link
Contributor

@nicolastakashi nicolastakashi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ChristianCiach
Copy link
Contributor Author

ChristianCiach commented Jul 24, 2025

The OpenTelemetryCollector CRD doesn't yet offer to configure these ScrapeClasses via .spec.targetAllocator.prometheusCR.scrapeClasses. I use the TargetAllocator standalone, so I don't really care about the rest of the Operator and its CRDs. If possible I would like this PR being merged without touching the rest of the operator. If you want me to, I would gladly try to extend the operator (and the CRDs) in a follow-up PR.

@ChristianCiach
Copy link
Contributor Author

ChristianCiach commented Jul 24, 2025

Scratch the comment above. I've pushed a commit to extend the OpenTelemetryCollector CRD, added a test case, extended the README and re-generated the API docs.

Please don't be alarmed that I've touched the surrounding tests. Adding my test was a miserable experience, because some of the existing test cases changed some common variables without any regard for the following tests. I've cleaned this up a bit, so the test cases in this function are more independent. I've taken great care to ensure that the tests still test what they're designed to test.

@ChristianCiach
Copy link
Contributor Author

ChristianCiach commented Jul 25, 2025

Looks like I didn't properly generate the CRD yamls (hence the failed pipelines). Sorry, I know next to nothing about building Operators and CRDs. I will look into it.

Edit: Should be all good now. https://github.com/open-telemetry/opentelemetry-operator/blob/main/CONTRIBUTING.md#local-development-cheat-sheet is a life-saver for people new to Go.

@github-actions
Copy link
Contributor

github-actions bot commented Jul 25, 2025

E2E Test Results

 33 files  ±0  221 suites  ±0   3h 47m 19s ⏱️ - 3m 52s
 85 tests ±0   85 ✅ ±0  0 💤 ±0  0 ❌ ±0 
225 runs  ±0  225 ✅ ±0  0 💤 ±0  0 ❌ ±0 

Results for commit cabb803. ± Comparison against base commit 087f27e.

♻️ This comment has been updated with latest results.

@ChristianCiach ChristianCiach force-pushed the allocator-scrapeclasses branch 3 times, most recently from 8f9c35a to 86414e8 Compare July 25, 2025 14:24
@ChristianCiach ChristianCiach marked this pull request as draft July 25, 2025 14:26
@ChristianCiach ChristianCiach force-pushed the allocator-scrapeclasses branch from 86414e8 to 03037d7 Compare July 25, 2025 14:30
@ChristianCiach ChristianCiach marked this pull request as ready for review July 25, 2025 14:56
@ChristianCiach ChristianCiach force-pushed the allocator-scrapeclasses branch 4 times, most recently from b3a115b to bee85d1 Compare July 28, 2025 12:55
@frzifus frzifus added the discuss-at-sig This issue or PR should be discussed at the next SIG meeting label Aug 4, 2025
@ChristianCiach ChristianCiach force-pushed the allocator-scrapeclasses branch 3 times, most recently from 32f2850 to cabb803 Compare August 5, 2025 17:32
@swiatekm
Copy link
Contributor

Sorry for not reviewing this earlier @ChristianCiach. Your changes look good to me, in general. Something that only became clear to me once I saw this PR, though, is that scrape classes add a large definition to our CRDs, and also make us directly dependent on the prometheus-operator API package. This would be much easier to accept if a Scrape Class was an independent CR like ServiceMonitor, that could exist in the cluster without impacting our definitions directly. I think this is something we'll have to discuss during a SIG meeting and figure out if we're willing to accept the added maintenance burden.

I apologize for letting you implement this before realizing that this might be a problem.

@ChristianCiach
Copy link
Contributor Author

ChristianCiach commented Aug 12, 2025

@swiatekm No worries! I noticed the same thing when I added the import, but I couldn't think of any alternative.

Thank you for taking this to the SIG meeting. I look forward to the decision.

If we decide to not add scrape-classes like this, I would still like to see any kind of global relabeling rules in the future, for the reasons outlined in the PR description. There is currently no other way to remove the default labels added by Pod/ServiceMonitors.

@ChristianCiach
Copy link
Contributor Author

ChristianCiach commented Sep 8, 2025

I am back from vacation and I wonder if there have been any news regarding this PR.

As far as I understand it, the only point of contention is whether to expose part of the Prometheus-Operator-API inside the OpentelemetryCollector CRD, making the CRD a lot larger.

The more I think about this, the more I think that this is the right thing to do. If the size of the CRD is the main concern, I could probably expose the scrapeClasses attribute as type []map[string]any *runtime.RawExtension and then convert it back to []*monitoringv1.ScrapeClass internally. The main downside of this would be the lack of validation in editors and on admission.

@ChristianCiach
Copy link
Contributor Author

The main downside of this would be the lack of validation in editors and on admission.

But this is also the case when configuring your prometheus rules in spec.receivers.prometheus.config. There, too, you need to make sure that your prometheus receiver configuration is compatible with the prometheus-version the receiver is importing. In other words, exposing the scrape-classes as runtime.RawExtension is analogues to exposing the raw prometheus configuration as the config attribute in the prometheus-receiver. I would be fine with that.

@swiatekm
Copy link
Contributor

@ChristianCiach apologies for the late response. We've had a lot of long vacations and other life events among the maintainers as well recently, so we're not as prompt in responding as we would've liked.

For reference, I ran a quick check on the size increase of the TargetAllocator CRD. We go from ~140KB to ~150KB, with a practical limit of around 250KB (the maximum size of an annotation value in K8s). I think that's acceptable, but we'll need to properly evaluate both this change and the dependency it creates for us.

@nicolastakashi do you know what kind of stability guarantees we can expect for the struct we're importing in this PR? Us using it this way would also introduce more friction if prometheus-operator ever wanted to make breaking changes to it, too.

@simonpasquier
Copy link

👋 prometheus-operator maintainer here!

Regarding our change policy, we stick to the Kubernetes API conventions as described in https://prometheus-operator.dev/docs/community/contributing/#changes-to-the-apis

For stable API versions (e.g. v1), we don’t allow to break backward and forward compatibility.

Regarding the CRD size, going above the 250KB is fine as long as users know how to bypass the potential issue wrt annotations: https://prometheus-operator.dev/docs/platform/troubleshooting/#customresourcedefinition--is-invalid-metadataannotations-too-long-issue

@ChristianCiach
Copy link
Contributor Author

I am back from a prolonged absence and I still would like to see this merged eventually. Please let me know if there is still anything to discuss.

After having thought about this for a while, I kinda prefer my own suggestion from before to change the scrapeClasses CRD attribute to *runtime.RawExtension. I will try this locally sometime this or next week.

@ChristianCiach ChristianCiach force-pushed the allocator-scrapeclasses branch 2 times, most recently from 091b4bd to 3d90d7d Compare December 4, 2025 11:29
@ChristianCiach
Copy link
Contributor Author

I've experimented with using RawExtension for the scrapeClasses field, but it doesn't feel right.

scrapeClasses is an array, but runtime.RawExtension can only hold an object. I could change the type to []runtime.RawExtension, but this needs awkward unwrapping when unmarshalling the CR.

I could use a simple *string:

spec:
  targetAllocator:
    prometheusCR:
      enabled: true
      # scrapeClasses is a multiline yaml string!
      scrapeClasses: |
        - name: istio-mtls
          default: true
          tlsConfig:
            caFile: "/etc/istio-certs/root-cert.pem"
            certFile: "/etc/istio-certs/cert-chain.pem"
            keyFile: "/etc/istio-certs/key.pem"
            insecureSkipVerify: true

But this doesn't feel very operator'y.

Importing the ScrapeClass type of the Prometheus-CRD into our own CRD is at least honest, because this is the type that can be actually used. Should the imported ScrapeClass type ever change, our own CRD should change as well to indicate a broken configuration that would otherwise fail at runtime.

So, if the size of the CRD is not of major concern, I think this PR is good to go.

@ChristianCiach ChristianCiach force-pushed the allocator-scrapeclasses branch 2 times, most recently from f04e614 to 8be16dc Compare December 4, 2025 14:42
@swiatekm
Copy link
Contributor

swiatekm commented Dec 4, 2025

@ChristianCiach how about using []v1beta1.AnyConfig, the same as we do for scrape configs embedded in the target allocator?

The size of the CRD is a concern, and I, for one, would be against adding this to the OpenTelemetryCollector CRD, which is too big as-is. For TargetAllocator it's less of an issue. This may be fine, given that we prefer that users with more sophisticated needs use the TargetAllocator CRD regardless.

There's also the concern about breaking changes in prometheus-operator's struct definition. I suppose it's included in Prometheus, which is stable. This is something we can probably live with, but I'd like opinions from more maintainers and approvers here. @open-telemetry/operator-approvers

@jaronoff97
Copy link
Contributor

agreed with Mikolaj's opinion, i think it makes sense to keep this in the TA alone. If a user wants to take advantage of this, they probably know what they're doing and would benefit from standalone TA CRD anyway, this also limits the blast radius of the otel CR in case prometheus were to push breaking changes for whatever reason.

@ChristianCiach ChristianCiach force-pushed the allocator-scrapeclasses branch 2 times, most recently from 9151c3c to 357b06c Compare December 5, 2025 15:31
@ChristianCiach
Copy link
Contributor Author

ChristianCiach commented Dec 5, 2025

@swiatekm Thanks, I didn't know about v1beta1.AnyConfig. If I had known, I would've used that to begin with :)

I've just changed the CRD to use this type. Feel free to review!

@ChristianCiach ChristianCiach force-pushed the allocator-scrapeclasses branch from 357b06c to 1c29619 Compare December 10, 2025 08:57
Copy link
Contributor

@swiatekm swiatekm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes look good to me now. One thing this PR is missing is a e2e test showing the scrape class is actually applied to the scrape configs.

@ChristianCiach ChristianCiach marked this pull request as draft December 18, 2025 13:07
@ChristianCiach ChristianCiach force-pushed the allocator-scrapeclasses branch 3 times, most recently from dd06ce5 to 7fb86d7 Compare December 18, 2025 14:12
Signed-off-by: Christian Ciach <christian.ciach@gmail.com>
Signed-off-by: Christian Ciach <christian.ciach@gmail.com>
Signed-off-by: Christian Ciach <christian.ciach@gmail.com>
@ChristianCiach ChristianCiach force-pushed the allocator-scrapeclasses branch 2 times, most recently from c998fff to 1d98549 Compare December 18, 2025 14:38
@ChristianCiach ChristianCiach marked this pull request as ready for review December 18, 2025 15:04
@swiatekm swiatekm requested a review from frzifus December 18, 2025 16:11
@jaronoff97
Copy link
Contributor

Thank you very much for your contribution 🙇 I really appreciate the back and forth here.

@jaronoff97 jaronoff97 merged commit d8953ec into open-telemetry:main Dec 18, 2025
63 of 66 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

discuss-at-sig This issue or PR should be discussed at the next SIG meeting

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support Prometheus Operator ScrapeClass.

6 participants