Skip to content

Preparation for auto enable network traffic for Prometheus metrics scraping based on annotations#570

Merged
zohar7ch merged 9 commits intomainfrom
zohar7ch/support-auto-enable-metrics-scraping-servers
Mar 18, 2025
Merged

Preparation for auto enable network traffic for Prometheus metrics scraping based on annotations#570
zohar7ch merged 9 commits intomainfrom
zohar7ch/support-auto-enable-metrics-scraping-servers

Conversation

@zohar7ch
Copy link
Copy Markdown
Contributor

@zohar7ch zohar7ch commented Mar 1, 2025

Description

Before this change, if you had Prometheus server in your cluster that scrape metrics from multiple workloads - you had to configure client intents for all your workload in order to enable Prometheus to scrape them.
Now, you can just set the configuration to allow Otterize to detect which workloads need to be scraped (based on Prometheus's scrape annotations) - and Otterize will enable the communication to the scrape metric port on its own.

Testing

Tested locally with service-client example, and Prometheus community edition (prometheus-community/prometheus)

Also include details of the environment this PR was developed in (language/platform/browser version).

  • This change adds test coverage for new/changed/fixed functionality

Checklist

  • I have added documentation for new/changed functionality in this PR and in github.com/otterize/docs

@zohar7ch zohar7ch force-pushed the zohar7ch/support-auto-enable-metrics-scraping-servers branch 7 times, most recently from 2954a06 to 5d9243a Compare March 6, 2025 14:20
@zohar7ch zohar7ch marked this pull request as ready for review March 6, 2025 14:20
@zohar7ch zohar7ch requested a review from omris94 March 6, 2025 14:20
@omris94
Copy link
Copy Markdown
Contributor

omris94 commented Mar 6, 2025

Add tests please 🙂

@zohar7ch zohar7ch force-pushed the zohar7ch/support-auto-enable-metrics-scraping-servers branch 3 times, most recently from 40f4f11 to 1e763e5 Compare March 11, 2025 11:06
@zohar7ch zohar7ch force-pushed the zohar7ch/support-auto-enable-metrics-scraping-servers branch from 1e763e5 to c75ecee Compare March 12, 2025 06:46
This is the first layer in managing Prometheus scrape annotations.
When a pod changes, we aim to reduce the entire state of the pod’s
namespace and ensure that the current state of the cluster aligns
with the expected configuration.

We reduce the state of the entire namespace, rather than just the
individual pod, because sometimes we cannot determine everything
needed by looking at just the pod. For example, in the case of a
deployment with a single pod (which has pod annotations for scraping
metrics), when we terminate this pod, there is a race condition
between creating (or keeping) or deleting the pod's
metrics-collection network policy. This depends on when the original
pod terminates.

Instead of attempting to handle many edge cases, we opted for a
stateless approach similar to the one used in
service-effective-policy.
This approach calculates the state and only updates what is necessary.
change

When the configuration is set to 'If blocked by Otterize,' we create
 a network policy to enable metrics collection only if another
network policy, created by Otterize, blocks communication to the pod.

Otterize can block network traffic either based on the pod itself or
 its corresponding service. We can detect this service only after an
 endpoint is established between the service and the pod.
Therefore, we need to check the status of the pods after the endpoint
 is up and running.
…hange

We want to reconcile after a network policy change to handle multiple
 scenarios:

1. When a new network policy is created or deleted by Otterize: If the
   configuration is set to 'if blocked by Otterize,' this means we may
   need to create or delete a metric collection network policy.

2. To address a race condition: For instance, if there is an operator
   update and a new instance is created, we don’t want the old
   operator to determine the cluster's state. If the last instance to
   run is the one shutting down and it decides to modify a network
   policy, the active instance will receive an update, which could
   change the state as needed.
…icies

In Prometheus, we can choose which pod to scrape using scrape
annotations. These annotations can be applied to the pod, a service,
an ingress, and so on. When creating a network policy for scraping
metrics, we aim to cover all possible levels and differentiate
between them.
While we could create a single network policy that handles all
annotations, managing each one separately makes the code more
readable and reduces the number of edge cases. This refactor enables
us to specify and target the annotation level for which we created
the network policy.
We want to add events when creating \ updating \ deleting network
policy.
The event we will add on the resource that was responsible for the
network policy (meaning, the one that has the scraping annotation).
@zohar7ch zohar7ch force-pushed the zohar7ch/support-auto-enable-metrics-scraping-servers branch from f57288d to 49965cd Compare March 16, 2025 11:38
@zohar7ch zohar7ch changed the title Auto enable network for Prometheus metrics scraping based on configuration Preparation for auto enable network traffic for Prometheus metrics scraping based on configuration Mar 16, 2025
@zohar7ch zohar7ch changed the title Preparation for auto enable network traffic for Prometheus metrics scraping based on configuration Preparation for auto enable network traffic for Prometheus metrics scraping based on annotations Mar 16, 2025
@zohar7ch zohar7ch force-pushed the zohar7ch/support-auto-enable-metrics-scraping-servers branch from 8f9705e to 3bf58af Compare March 16, 2025 12:53
If a resource does not specify the `prometheus.io/port`, Prometheus
will attempt to scrape all the ports of the resource.
We have implemented the same behavior for determining the network
policy ports to ensure Prometheus functions correctly.
However, it's important to note that this is not considered a best
practice, and you should explicitly define the `prometheus.io/port`.
So it would be more clear that this component handles metrics
collection traffic, and does not collect metrics
@zohar7ch zohar7ch force-pushed the zohar7ch/support-auto-enable-metrics-scraping-servers branch from 3bf58af to 25097cd Compare March 17, 2025 08:30
@zohar7ch zohar7ch merged commit 28fe847 into main Mar 18, 2025
21 checks passed
@zohar7ch zohar7ch deleted the zohar7ch/support-auto-enable-metrics-scraping-servers branch March 18, 2025 11:39
@github-actions github-actions bot locked and limited conversation to collaborators Mar 18, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants