Skip to content

[+] add Single Query Details dashboard for Prometheus, relates to #1169#1179

Open
Bishoywadea wants to merge 12 commits intocybertec-postgresql:masterfrom
Bishoywadea:feat/add-single-query-details-prometheus-dashboard
Open

[+] add Single Query Details dashboard for Prometheus, relates to #1169#1179
Bishoywadea wants to merge 12 commits intocybertec-postgresql:masterfrom
Bishoywadea:feat/add-single-query-details-prometheus-dashboard

Conversation

@Bishoywadea
Copy link

@Bishoywadea Bishoywadea commented Feb 4, 2026

Add Single Query Details Dashboard for Prometheus

fixes #1169


Description

This PR introduces the Single Query Details dashboard for Prometheus.

Included Panels:

  • Avg Runtime
  • Total Runtime
  • Calls Rate
  • Shared Buffers Hit Ratio
  • Temp Blocks Read/Written
  • Backend Block Read/Write Time
  • % of Total Time in Direct I/O
  • SQL Text
  • Logo Panel

Screenshot

image

@Bishoywadea
Copy link
Author

Hi @0xgouda , i have set up a single-query Prometheus dashboard for review, I only included 2 panels for now to make sure the logic is correct before i build the rest.
if this looks good, I will finish the other panels and your suggestions,
I am planning to finish v12 first to move faster, then i will do all of v11 at once. is that okay with you?

@0xgouda
Copy link
Collaborator

0xgouda commented Feb 5, 2026

Looks good, please continue.

No need to create the dashboard for v11.

@0xgouda 0xgouda self-assigned this Feb 5, 2026
@0xgouda 0xgouda added the dashboards Grafana dashboards related label Feb 5, 2026
@0xgouda 0xgouda marked this pull request as draft February 5, 2026 07:52
@coveralls
Copy link

coveralls commented Feb 5, 2026

Pull Request Test Coverage Report for Build 22117087494

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • 58 unchanged lines in 4 files lost coverage.
  • Overall coverage increased (+0.6%) to 77.58%

Files with Coverage Reduction New Missed Lines %
internal/metrics/yaml.go 8 93.28%
internal/sources/yaml.go 11 89.57%
internal/sources/resolver.go 17 78.97%
internal/sinks/prometheus.go 22 75.85%
Totals Coverage Status
Change from base Build 21993433024: 0.6%
Covered Lines: 4277
Relevant Lines: 5513

💛 - Coveralls

@Bishoywadea
Copy link
Author

@0xgouda i think it is ready now, unless you need to implement new panels other than the already exiting ones in pg dashboard

@Bishoywadea Bishoywadea marked this pull request as ready for review February 6, 2026 17:47
@0xgouda 0xgouda force-pushed the feat/add-single-query-details-prometheus-dashboard branch from 4aff950 to a632cdf Compare February 9, 2026 08:33
@0xgouda
Copy link
Collaborator

0xgouda commented Feb 9, 2026

@Bishoywadea, Thanks for your time!

  1. Please make the Query ID field text box (that should be fixed in the pg dashboard later as well), no need to add additional querying overhead, the user just types (or get redirected) the query id he wants to inspect
  2. Why do you use increase() not rate() or irate() (just wondering I haven't evaluated yet which is the optimal here)
  3. There is a problem with the PromQL queries, see below, there is 2 Avg runtime axises
image
  1. use $__rate_interval and remove Aggregattion Interval

@Bishoywadea
Copy link
Author

  1. Done

  2. from my research i found that increase() gets you the real values aggregated in the time interval unlike rate() which gives you number of calls per second in the time interval aggregated which is not intuitive to read that query is executed 0.4568 per second in the last $aggregated_interval i think is is more clear to read that query has executed 4 times in the last $aggregated_interval in addition to this is what postgres version of it is doing and this is image to show the difference

image
  1. yes you are right, i didn't notced that bug because it happen in demo database only, i noticed demo database was showing two graph lines instead of one (like two "Avg runtime" lines that looked almost the same). After debugging it, I found that the same queryid can have different query label values sometimes the actual SQL text and sometimes just "-" like the image attached below.

To fix this I wrapped the queries with sum by (dbname, queryid) so everything gets merged into a single line.
image

  1. Done

@Bishoywadea
Copy link
Author

Bishoywadea commented Feb 9, 2026

@0xgouda i could make a PR for making Query ID field text box in pg dashboard + is it required to replace Aggregation Interval with $__rate_interval in other dashboards or just this one if so i could make it also

@0xgouda
Copy link
Collaborator

0xgouda commented Feb 10, 2026

i could make a PR for making Query ID field text box in pg dashboard

I will update the pg dashboard very soon, and then I will fix it.

is it required to replace Aggregation Interval with $__rate_interval in other dashboards or just this one if so i could make it also

yeah we probably should, but this would require careful consideration for the scraping interval and the metric fetch interval so rate()/increases()/irate() can get enough data points and hence show correct results.

@Bishoywadea
Copy link
Author

yeah we probably should, but this would require careful consideration for the scraping interval and the metric fetch interval so rate()/increases()/irate() can get enough data points and hence show correct results.

hmm, yes i think converting them all will be more complex than i thought, i’ll think about it more and propose a solution, then we could make an issue to convert all of it, but first i think we need something global to coordinate the fetch and scrape time intervals across all the project (i still don't know how just brainstorming with you)

@0xgouda
Copy link
Collaborator

0xgouda commented Feb 10, 2026

but first i think we need something global to coordinate the fetch and scrape time intervals across all the project (i still don't know how just brainstorming with you)

There is no global right answer; each metric has its own fetching interval (real-time critical metrics are fetched more frequently, and heavy ones can be fetched up to every 18 hours), so this needs to be adjusted based on the specific metric.

The old way (current one) was to let users specify the aggregation interval but it's not very good.

Actually, I am in the process of refactoring most of the prom dashboards, and I am considering this.

@Bishoywadea
Copy link
Author

Okay good luck with it, and i will keep my eye on the issues to see if i could help

@Bishoywadea
Copy link
Author

hi @0xgouda just checking in on the status of this PR i addressed all the previous feedback is there anything else need me to do before this can be merged?

@0xgouda 0xgouda force-pushed the feat/add-single-query-details-prometheus-dashboard branch from 79c40f3 to f1b3af8 Compare February 13, 2026 11:41
@0xgouda
Copy link
Collaborator

0xgouda commented Feb 13, 2026

Hi @Bishoywadea

  1. Please remove the ($__rate_interval aggregate) from the panel names

  2. Set a min step for $__rate_interval otherwise it won't show any data most of the times (I would suggest 9m), see below:

    image
  3. You don't have to use sum by (dbname, queryid) in all panels, just using increase() or rate() in all of them should resolve the issue

  4. Update the Query perf analysis (build on top of the latest updates in [+] improve Query Performance Analysis prom dashboard #1193) to include links to this panel for deeper investigation of this query

  5. I guess it's better to use rate() instead of increase() as then we don't have to pay attention to the aggregation interval used by grafana

  6. I don't get why there is a + 0.01 in the Shared buffer Hit Ratio query
    image

@Bishoywadea
Copy link
Author

I guess it's better to use rate() instead of increase() as then we don't have to pay attention to the aggregation interval used by grafana

ok no problem but do you mean convert increase to rate in all panels or specific one (i mean if we change increase to rate in calls panel this would make different result from the pg dashboard as i clarified early in the comments above)

I don't get why there is a + 0.01 in the Shared buffer Hit Ratio query

this is gaurd to prevent division by zero

@0xgouda
Copy link
Collaborator

0xgouda commented Feb 13, 2026

ok no problem but do you mean convert increase to rate in all panels or specific one

Yeah use rate() instead, and there is no problem if its different from the pg dashboard, but we need to explicitly specify that the unit we are using is per second (calls/s, time/s, etc.)

@Bishoywadea
Copy link
Author

You don't have to use sum by (dbname, queryid) in all panels, just using increase() or rate() in all of them should resolve the issue

@0xgouda related to this comment in the below image shows that querying with queryid and dbname sometimes return more than 1 entry each entry almost have the same numbers that's why i have been using sum() and divide by their number to get the avg to solve the issue that you have pointed out that some panels have 2 lines so i can use either sum() or avg() is that ok to use avg() or no and in case of no do you have solution to that or now why the it return more than 1 entry

image

@0xgouda
Copy link
Collaborator

0xgouda commented Feb 13, 2026

Just use rate() and it will be resolved. Displaying the raw value is not very beneficial anyway; we want to know the average runtime over the aggregation interval instead.

@Bishoywadea
Copy link
Author

hi @0xgouda sorry for responding late i was busy
honestly i don't get how do you want to use only rate() without sum by() i am sure rate() only won't solve the problem of multiple entities returned by the query but sum() will solve it
if we don't sum valid values from all entities we get wrong results for example if one node does 1000 QPS at 10ms and another does 1 QPS at 1000ms a simple rate logic might give ~505ms which is wrong
so i used weighted average: sum(rate(time)) / sum(rate(calls)) which returns the correct ~11ms which is exactly "the average runtime over the aggregation interval instead"

please reconsider it again or give me additional details like what is the logic do you want the prom query to be ?

@0xgouda
Copy link
Collaborator

0xgouda commented Feb 15, 2026

if we don't sum valid values from all entities we get wrong results for example if one node does 1000 QPS at 10ms and another does 1 QPS at 1000ms a simple rate logic might give ~505ms which is wrong

  1. I mostly don't understand your examples
  2. How are we going to have 2 nodes with different QPS for the same query on the same database? rate(pgwatch_stat_statements_calls{dbname='$dbname', queryid='$queryid'}[$__rate_interval])?

pg_stat_statements will already store the fields as counters, so if we have 2 entities one with query=....text... and the other query=-, the above rate() query will put them in a single vector as they have the same dbname and queryid, and then arrange them based on the timestamp and calculate the per-second average, and hence they are counters they build on each other's values, so what's the problem here?

@Bishoywadea Bishoywadea force-pushed the feat/add-single-query-details-prometheus-dashboard branch from 8a1e257 to f1b3af8 Compare February 16, 2026 05:48
@Bishoywadea
Copy link
Author

Bishoywadea commented Feb 16, 2026

image image image image

i tried your approach (using rate() directly without aggregation) shown in the first 3 images, but it doesn't seem to solve the issue of multiple lines.
the attached screenshots show what's happening, in prom a unique time series is defined by its labels because the query label is different (one has the SQL text and the other is just "-") prom returns them as two separate entries in the vector
you can see in the graph panel this results in two overlapping lines for the same query id and database name, in this case we have two entities reporting different QPS for that same id if we don't merge them the data looks broken on the dashboard.
to fix this i am using sum by (queryid, dbname) (rate(...))
this lets us use rate() as you suggested but ensures that all entries for that query id are merged into a single line, it also makes sure the "Average Runtime" math stays correct by using the weighted average: sum(rate(time)) / sum(rate(calls)) (the last attached screenshot).

1. change panels ordering
2. don't use `sum by()`
3. add `Execution time per call` panel
@0xgouda
Copy link
Collaborator

0xgouda commented Feb 17, 2026

i tried the your approach (using rate() directly without aggregation) shown in the first 3 images, but it doesn't seem to solve the issue of multiple lines.

It works for me; probably there is a problem with your gathered data, let me inspect.

but otherwise I will just add a couple more panels and by tomorrow or so, this should be ready for merging.

@Bishoywadea
Copy link
Author

image

this is how i see the dashboard now on my side
that is why i keep telling you rate will not work
i don't know if this is only on my machine or not but i don't think so because i didn't play with any thing other than this panel
overall thanks for bearing with me all these 2 weeks and ramadan kareem😊

@0xgouda
Copy link
Collaborator

0xgouda commented Feb 17, 2026

You probably have both stat_statements and stat_statements_no_query_text metrics active at the same time, thats why you get 2 versions for each queryid one with query=- and the other with the actual query text,

because stat_statements_no_query_text is the one that returns query=-

are you using the debug preset? or how are you running pgwatch?

@Bishoywadea
Copy link
Author

are you using the debug preset? or how are you running pgwatch?

honestly i didn't care much about other options and let all be the default so now i have open metrics presets and found it set on debug

@Bishoywadea
Copy link
Author

but i think all things are clear now and i got why the other databases are not getting duplicated query id because when i saw the mode of metrics presets of them now it is set on full

@0xgouda
Copy link
Collaborator

0xgouda commented Feb 17, 2026

ok that is the problem. So yeah, please use the full preset instead.

But also, this is a problem on our behalf, we shouldn't let the debug preset cause some dashboards to be broken, so I probably have to update it to not include both stat_statements and stat_statements_no_query_text

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dashboards Grafana dashboards related

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add Single Query Details prom dashboard

3 participants