-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Add PromQL info function blog post
#2777
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Arve Knudsen <[email protected]>
info function blog postinfo function blog post
ywwg
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great! I made a few comments
| Much more comprehensible, no? | ||
| The real magic happens under the hood though: **`info()` automatically selects the time series with the latest sample**, eliminating churn-related join failures entirely. | ||
|
|
||
| ### Basic Syntax |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know Patrick ran into issues where info() had to be put in very specific places, especially when a rate function was involved. I think his intuition was to "wrap the metric name in info()" but that doesn't work when there's a rate function? So I think we should have an aside with some advice on how to know where the info function will need to go.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree in general, but the rationale is not that info is special or needlessly complicated here, but the opposite: info is just a normal function. Nothing new here. You place it where you place any other function that receives an instant vector. (You could call that "very specific places". I would call it "just the normal expected places".) For example, label_replace is placed at exactly the same places where you would use info. (Buth functions not only take an instant vector as an argument, they also share the property that they just manipulate labels, not values. All perfectly normal and expected – I might even use the infamous word "intuitive" for that.)
Patrick's issue has to do with PromQL in general, not info in particular. Obviously, the intuition of somebody not familiar with PromQL is very different from the intuition of somebody who knows PromQL. Catering for the one might very well be counterproductive for the other.
So we should argue here that info is just a normal function acting on any instant vector, not any new concept like a "decorator" or something.
| The `info()` function is more efficient than traditional joins: | ||
| - Only selects matching info series | ||
| - Avoids unnecessary label matching operations | ||
| - Optimized query execution path |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know some people were suspicious about the performance of the target_info concept ... Do we have numbers comparing the performance of label promotion vs info() joining? If it's small enough, that may help discourage people from promoting everything by default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know about perf comparisons versus label promotion. I just know there's an obvious performance benefit over join queries, due to being able to filter what to select on the RHS :)
|
|
||
| ## Current Limitations and Future Plans | ||
|
|
||
| The current implementation is an **MVP (Minimum Viable Product)** designed to validate the approach and gather user feedback. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add a link to directly file an issue? "If you experience any issues with info() please report them here" etc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, PTAL.
| The experimental `info()` function represents a significant step forward in making PromQL more accessible and reliable. | ||
| By simplifying metadata label enrichment and automatically handling the churn problem, it removes two major pain points for Prometheus users, especially those adopting OpenTelemetry. | ||
|
|
||
| We encourage you to try the `info()` function and share your feedback: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
link again
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, PTAL.
Co-authored-by: Owen Williams <[email protected]> Signed-off-by: Arve Knudsen <[email protected]>
Co-authored-by: Owen Williams <[email protected]> Signed-off-by: Arve Knudsen <[email protected]>
| In Prometheus 3.0, we introduced the [`info()`](https://prometheus.io/docs/prometheus/latest/querying/functions/#info) function, a powerful new way to enrich your time series with labels from info metrics. | ||
| `info` doesn't only offer a simpler syntax however. | ||
| It also solves a subtle yet critical problem that has plagued join queries for years: The "churn problem" that causes queries to fail when "non-identifying" info metric labels change. | ||
| In practice, "identifying labels" refers to those labels that the join is performed on. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should expand a section laying out what "identifying" really means, compared to "data labels" or "nonidentifying labels", referencing the otel data model (which also uses the word "descriptive"!)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you want to write a section and make a PR to my branch or something? Maybe the easiest.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, can do
beorn7
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the avalanche of comments. Here are my main lines of thought (from which most comments derive):
- We should really focus on the cases where
infois much more simple than the traditional query:- All data labels are copied. (You have covered this case, but always further down; it should be the very first case to show, though.)
- Filtering for label value. (You haven't covered this case at all.)
- Rather than reinforcing the misconception that
infois merely solving a syntactic problem (which could be read asinfobeing merely syntactic sugar), we should clarify that the problem of the complex join queries is caused by the amount of information the user has to specify and thatinfosolves this by deducing this information from other sources (or just using heuristics for now, which also explains the shortcomings of the current implementation). - We should be very upfront about the missing parts, both that they are relevant but also that we want to work on them. That's partially already the case, but as explained in comments, it could be clearer. What we should also add is that the tooling outside of PromQL proper doesn't really handle the
infofunction appropriately yet, e.g. autocomplete or query builders. (Many issues Patrick reported were about exactly those deficiencies. But for him, it appeared as fundamental problems ofinfoand not as an incomplete implementation.)
| --- | ||
|
|
||
| Enriching metrics with metadata labels can be surprisingly tricky in Prometheus, even if you're a PromQL wiz! | ||
| Traditionally, complex PromQL join syntax is required in Prometheus to add even basic information like Kubernetes cluster names or cloud provider regions to queries. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sounds like it's just a problem of syntax. This blog post might be an opportunity to debunk this naive assumption that is in everyone's head.
It could also be a good segue into the parts of the info function that still need to be finished before it can unleash its true power.
To reiterate the reasons for the complexity of the join query:
- You have to specify on which labels to join (segue into what "identifying labels" means, see @ywwg's comment, and that we want to store this information in the future).
- You have to specify the info metric to join with (segue into storing the type information in the future to know what info metrics exist).
- You have to specify which "data labels" to copy over from the info metric.
Of course, you cannot explain all of this in the opening paragraph, but maybe a bit of it, something like:
| Traditionally, complex PromQL join syntax is required in Prometheus to add even basic information like Kubernetes cluster names or cloud provider regions to queries. | |
| The PromQL join query traditionally used for this is inherently quite complex because it has to specify the labels to join on, the info metric to join with, and the labels to enrich with. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PTAL.
| The new, still experimental `info()` function, promises a simpler way, making label enrichment as simple as wrapping your query in a single function call. | ||
|
|
||
| In Prometheus 3.0, we introduced the [`info()`](https://prometheus.io/docs/prometheus/latest/querying/functions/#info) function, a powerful new way to enrich your time series with labels from info metrics. | ||
| `info` doesn't only offer a simpler syntax however. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, I don't think the fundamental nature of the problem is syntactical.
info is simpler because you do not have to specify the identifying labels, you do not have to specify the info metric to join with, and if you just want to enrich with all data labels, you do not even have to specify the data labels you want to enrich with.
I think each of the three features is more important (both in practical terms and as a key insight how info works and why it is useful) than the next one (the "churn problem"), among others because the churn problem only occurs with broken staleness handling (which could be fixed in OTLP ingestion), and it should actually never occur if OTel folks just used resource attributes as they are meant to be used. (Of course, we don't need to explain those subtleties here, I just want to put things into perspective.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Revised, PTAL.
| **2. The Churn Problem (The Critical Issue):** | ||
|
|
||
| Here's the subtle but serious problem: What happens when a Kubernetes pod gets recreated? | ||
| The `k8s_pod_name` label in `target_info` changes, and Prometheus sees this as a completely new time series. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that in this situation, also the instance label will change (and has to change according to OTel conventions), so this problem shouldn't occur.
Maybe we can still keep using this example, but we should then also mention that this is a broken setup (where instance doesn't change upon pod recreation).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Revised, PTAL.
| Here's the subtle but serious problem: What happens when a Kubernetes pod gets recreated? | ||
| The `k8s_pod_name` label in `target_info` changes, and Prometheus sees this as a completely new time series. | ||
|
|
||
| If the old `target_info` series isn't properly marked as stale immediately, both the old and new series can exist simultaneously for up to 5 minutes (the default lookback delta). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe don't phrase this as an "if". It will never be marked properly as stale because of the way OTLP ingestion works. So on the one hand, we can clearly say that the staleness handling doesn't work right now, but we could also mention that this is an issue that could be fixed in the OTLP ingestion layer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Revised, PTAL.
| Much more comprehensible, no? | ||
| The real magic happens under the hood though: **`info()` automatically selects the time series with the latest sample**, eliminating churn-related join failures entirely. | ||
|
|
||
| ### Basic Syntax |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree in general, but the rationale is not that info is special or needlessly complicated here, but the opposite: info is just a normal function. Nothing new here. You place it where you place any other function that receives an instant vector. (You could call that "very specific places". I would call it "just the normal expected places".) For example, label_replace is placed at exactly the same places where you would use info. (Buth functions not only take an instant vector as an argument, they also share the property that they just manipulate labels, not values. All perfectly normal and expected – I might even use the infamous word "intuitive" for that.)
Patrick's issue has to do with PromQL in general, not info in particular. Obviously, the intuition of somebody not familiar with PromQL is very different from the intuition of somebody who knows PromQL. Catering for the one might very well be counterproductive for the other.
So we should argue here that info is just a normal function acting on any instant vector, not any new concept like a "decorator" or something.
|
|
||
| ## Technical Benefits | ||
|
|
||
| Beyond cleaner syntax, the `info()` function provides several technical advantages: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not just syntax, as said. info is utilizing information that has otherwise to be provided by the user. Sorry for hammering this in so much, but I think this is very crucial to understand why info is so cool (and also what is still missing in its implementation).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Revised, PTAL.
| Here are some simple examples to try: | ||
|
|
||
| ```promql | ||
| # Basic usage - add all target_info labels | ||
| info(up) | ||
|
|
||
| # Selective enrichment - add only cluster name | ||
| info(up, {k8s_cluster_name=~".+"}) | ||
|
|
||
| # In a real query | ||
| info( | ||
| rate(http_server_request_duration_seconds_count[5m]), | ||
| {k8s_cluster_name=~".+"} | ||
| ) | ||
|
|
||
| # With aggregation | ||
| sum by (k8s_cluster_name) ( | ||
| info(up, {k8s_cluster_name=~".+"}) | ||
| ) | ||
| ``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if there is value to repeat examples already given above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed them, PTAL.
| - Workaround: You can use `__name__` matchers like `{__name__=~"(target|build)_info"}` in the data-label-selector, though this still assumes `job` and `instance` as identifying labels | ||
|
|
||
| 2. **Fixed identifying labels:** Always assumes `job` and `instance` are the identifying labels for joining | ||
| - This works for most use cases but may not be suitable for all scenarios |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this over-promises. There are common use cases already where this doesn't work. You even dropped kube_pod_labels above. We should clearly say that this is a real problem but also that we want to solve this in the future (by storing the information which labels are identifying).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Revised, PTAL.
|
|
||
| A future version of the `info()` function should: | ||
| - Support all info metrics (not just `target_info`) | ||
| - Dynamically determine identifying labels based on the info metric's structure |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not so much dynamic, but it's about (statically) storing the information what labels are identifying.
Note for future implementations (not related to the review, just happened to come to my mind while doing this review): For something like kube_pod_labels, we'll have the actual identifying labels (namespace, pod), then we'll have the data labels (in this case the actual pod labels), but when these metrics are ingested from KSM, we will also get a job and instance label attached (and whatever other target labels are configured). The future perfect version of info probably should just join on namespace and pod and only add the pod labels, but completely ignore job and instance and possibly other target labels (i.e. don't join on them, but also don't add them to the result either). Devil is in the detail here…
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Revised, PTAL.
Co-authored-by: Björn Rabenstein <[email protected]> Signed-off-by: Arve Knudsen <[email protected]>
Co-authored-by: Björn Rabenstein <[email protected]> Signed-off-by: Arve Knudsen <[email protected]>
Co-authored-by: Björn Rabenstein <[email protected]> Signed-off-by: Arve Knudsen <[email protected]>
Co-authored-by: Björn Rabenstein <[email protected]> Signed-off-by: Arve Knudsen <[email protected]>
Signed-off-by: Arve Knudsen <[email protected]>
Signed-off-by: Arve Knudsen <[email protected]>
Co-authored-by: Björn Rabenstein <[email protected]> Signed-off-by: Arve Knudsen <[email protected]>
Signed-off-by: Arve Knudsen <[email protected]>
bdab42e to
2f68dcb
Compare
Signed-off-by: Arve Knudsen <[email protected]>
Signed-off-by: Arve Knudsen <[email protected]>
Signed-off-by: Arve Knudsen <[email protected]>
Signed-off-by: Arve Knudsen <[email protected]>
Signed-off-by: Arve Knudsen <[email protected]>
Signed-off-by: Arve Knudsen <[email protected]>
|
Will have a look ASAP. (I'm saying that to many these days… it might not be very soon…) |
|
Thanks @beorn7! |
beorn7
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few more things, but looking very good already.
Note that the date in the filename needs to be updated to the actual date of publication.
| During this overlap period, your join query finds **two distinct matching `target_info` time series** and fails with a "many-to-many matching" error. | ||
|
|
||
| This could in practice mean your dashboards break and your alerts stop firing when infrastructure changes are happening, perhaps precisely when you would need visibility the most. | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something is missing here. A headline like "The Solution in its simplest form" or something?
Currently, this appears as just another paragraph in the section describing the 2nd part of the problem…
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right. On reviewing the section, I realized that a sub-section was missing. I added it, PTAL.
| info(up, {__name__="build_info", version=~".+"}) | ||
| ``` | ||
|
|
||
| **Note:** The current implementation always uses `job` and `instance` as the identifying labels for joining, regardless of which info metric you select. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I would love that.
|
|
||
| **Traditional approach:** | ||
| ```promql | ||
| sum by (http_status_code, k8s_cluster_name, k8s_container_name, k8s_cronjob_name, k8s_daemonset_name, k8s_deployment_name, k8s_job_name, k8s_namespace_name, k8s_pod_name, k8s_replicaset_name, k8s_statefulset_name) ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a realistic set of labels to sum by? I don't know the semantic conversions by heart, but isn't k8s_pod_name essentially unique per pod? There wouldn't be much to sum up in this way… Maybe sum by fewer labels, something that is more likely to happen in a real query.
(Mostly I want to prevent that readers just dismiss the query as contrived.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I revised to what might be a more realistic set of labels. PTAL.
Co-authored-by: Björn Rabenstein <[email protected]> Signed-off-by: Arve Knudsen <[email protected]>
Co-authored-by: Björn Rabenstein <[email protected]> Signed-off-by: Arve Knudsen <[email protected]>
Co-authored-by: Björn Rabenstein <[email protected]> Signed-off-by: Arve Knudsen <[email protected]>
Co-authored-by: Björn Rabenstein <[email protected]> Signed-off-by: Arve Knudsen <[email protected]>
Signed-off-by: Arve Knudsen <[email protected]>
Signed-off-by: Arve Knudsen <[email protected]>
Signed-off-by: Arve Knudsen <[email protected]>
ywwg
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we make "Giving Feedback" a heading so it's more eyecatching? Right now it's buried at the end of the post -- make it really easy for people who want to talk back to do so.
| info(up, {__name__="build_info", version=~".+"}) | ||
| ``` | ||
|
|
||
| **Note:** The current implementation always uses `job` and `instance` as the identifying labels for joining, regardless of which info metric you select. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"current implementation" implies that this may change, do we have any plans to? Maybe we can call out that we are looking for feedback on this point
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"current implementation" implies that this may change, do we have any plans to?
Yes the plan is relatively clear, the last sentence of the same paragraph already explains how it's supposed to work in the future. I don't think we need to solicit feedback, since we know how it's supposed to work on top of persisted metadata.
Signed-off-by: Arve Knudsen <[email protected]>
Signed-off-by: Arve Knudsen <[email protected]>
|
I'm traveling until Dec 16. My remaining items were not very significant, and I trust you to address them appropriately, so as soon as the other reviewers are happy, assume I am, too, and don't wait for my explicit approval. |
I'm proposing to add a blog post on the PromQL
infofunction! During the last PromCon, a revelation was that theinfofunction is too poorly marketed. One potential remediation mentioned was writing a blog post :)