Skip to content

Conversation

@gaantunes
Copy link
Contributor

This is adding a PostgreSQL observ lib. It is currently implementing the Prometheus potsgres_exporter.

Composed of these dashboards:

Cluster Overview dashboard for cluster stats at a glance
image

Overview dashboard for instance drilldown
image

Query Overview for query analysis
image

Also packs the following alerts:

| PostgreSQLDown | pg_up == 0 | critical |
| PostgreSQLHighConnectionUsage | >80% | warning |
| PostgreSQLLowCacheHitRatio | <90% | warning |
| PostgreSQLReplicationLag | >30s | warning |
| PostgreSQLReplicationLagCritical | >1h | critical |
| PostgreSQLDeadlocks | any | warning |
| PostgreSQLLongRunningQuery | >5min | warning |
| PostgreSQLBlockedQueries | any | warning |
| PostgreSQLWALArchiveFailure | any | critical |
| PostgreSQLHighDeadTuples | >10% | warning |
| PostgreSQLVacuumNotRunning | >7 days | warning |
| PostgreSQLTooManyRollbacks | >10% | warning |
| PostgreSQLTooManyLocksAcquired | >20% | warning |
| PostgreSQLInactiveReplicationSlot | any | critical |
| PostgreSQLReplicationRoleChanged | any | warning |
| PostgreSQLExporterErrors | any | critical |
| PostgreSQLHighQPS | >10000 | warning |

@gaantunes gaantunes requested a review from a team as a code owner December 5, 2025 16:08
@Dasomeone Dasomeone self-assigned this Dec 5, 2025
Copy link
Member

@Dasomeone Dasomeone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortuantely don't have the time to give this the attention it deserves!
I did a visual pass based on your screenshots and the sample app over on integration-sample-apps.

I ran into quite a few instances of no-data for where I know metrics exist (in explore). There may be some incompatibility here, but worth double-checking yourself!

Overall dashboard structure wise I think it looks great. I'm in favour of your adoption of the modular approach, though I think you could make a bit more use of the pre-existing styles in the panels part of the common-lib, rather than overwriting generic each time.

Left a couple other comments, but that's all I have time for right now, sorry!
Also please check linting and jsonnet formatting:
Mixtool:

postgres-observ-lib$ mixtool lint mixin.libsonnet 
could not unmarshal lint configuration .lint: EOF
[alert-summary-style] Alert PostgreSQLDown annotation 'summary' must start with capital letter and end with period, is currently 'PostgreSQL is down'
[alert-summary-style] Alert PostgreSQLHighConnectionUsage annotation 'summary' must start with capital letter and end with period, is currently 'PostgreSQL connection usage is high'
[alert-summary-style] Alert PostgreSQLLowCacheHitRatio annotation 'summary' must start with capital letter and end with period, is currently 'PostgreSQL cache hit ratio is low'
[alert-summary-style] Alert PostgreSQLReplicationLag annotation 'summary' must start with capital letter and end with period, is currently 'PostgreSQL replication lag is high'
[alert-summary-style] Alert PostgreSQLDeadlocks annotation 'summary' must start with capital letter and end with period, is currently 'PostgreSQL deadlocks detected'
[alert-summary-style] Alert PostgreSQLLongRunningQuery annotation 'summary' must start with capital letter and end with period, is currently 'PostgreSQL has long-running query'
[alert-summary-style] Alert PostgreSQLBlockedQueries annotation 'summary' must start with capital letter and end with period, is currently 'PostgreSQL has blocked queries'                                                                                                                   
[alert-summary-style] Alert PostgreSQLWALArchiveFailure annotation 'summary' must start with capital letter and end with period, is currently 'PostgreSQL WAL archiving is failing'                                                                                                           
[alert-summary-style] Alert PostgreSQLHighDeadTuples annotation 'summary' must start with capital letter and end with period, is currently 'PostgreSQL table needs vacuum'                                                                                                                    
[alert-summary-style] Alert PostgreSQLVacuumNotRunning annotation 'summary' must start with capital letter and end with period, is currently 'PostgreSQL table has not been vacuumed'                                                                                                         
[alert-summary-style] Alert PostgreSQLTooManyRollbacks annotation 'summary' must start with capital letter and end with period, is currently 'PostgreSQL has too many rollbacks'                                                                                                              
[alert-summary-style] Alert PostgreSQLTooManyLocksAcquired annotation 'summary' must start with capital letter and end with period, is currently 'PostgreSQL has acquired too many locks'                                                                                                     
[alert-summary-style] Alert PostgreSQLInactiveReplicationSlot annotation 'summary' must start with capital letter and end with period, is currently 'PostgreSQL has inactive replication slot'                                                                                                
[alert-summary-style] Alert PostgreSQLReplicationRoleChanged annotation 'summary' must start with capital letter and end with period, is currently 'PostgreSQL replication role changed'                                                                                                      
[alert-summary-style] Alert PostgreSQLExporterErrors annotation 'summary' must start with capital letter and end with period, is currently 'PostgreSQL exporter has errors'                                                                                                                   
[alert-name-camelcase] Alert 'PostgreSQLHighQPS' name is not in camel case
[alert-summary-style] Alert PostgreSQLHighQPS annotation 'summary' must start with capital letter and end with period, is currently 'PostgreSQL has high QPS'                                                                                                                                 
[alert-summary-style] Alert PostgreSQLReplicationLagCritical annotation 'summary' must start with capital letter and end with period, is currently 'PostgreSQL replication lag exceeds 1 hour'                                                                                                
failed to load the dashboard-linter config file .lint: could not unmarshal lint configuration .lint: EOF
failed to load the dashboard-linter config file .lint: could not unmarshal lint configuration .lint: EOF
failed to load the dashboard-linter config file .lint: could not unmarshal lint configuration .lint: EOF
2025/12/05 17:47:07 failed to lint the file mixin.libsonnet: 22 lint errors found

expr: 'pg_stat_statements_seconds_total{job=~"$job",cluster=~"$cluster",instance=~"$instance"}',
format: 'table',
instant: true,
refId: 'TotalTime',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is getting cut off in your screenshot, just something to beware of

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is getting cut on the screenshot concerning in terms of smaller windows? It works well in mine, but I have a 34 inches monitor.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Works well in my 14 inches notebook screen as well.

unit: 'rows/s',
sources: {
postgres_exporter: {
expr: 'topk(10, rate(pg_stat_statements_rows_total{%(queriesSelector)s}[$__rate_interval]))',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having a selector for k (template variable) would probably be a good improvement here!

Additionally since we're looking at up to 10 series here, we should consider a right-aligned table, that way you can also look at mean/max, etc

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice idea will do. Not sure what you mean with "that way you can also look at mean/max, etc"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image


{
// Cluster overview dashboard - Top-level view of the entire cluster
'postgres-cluster.json':
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Testing with the sample-app I am getting data for ~ half the panels here.
I've not checked all the metrics, but I know that for example pg_up is present but not loading correctly on the overview dashboard with your queries

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be because the sample-app is a standalone instance and probably missing the cluster label.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might also be missing some required collectors which are not enabled by default.
I can look into the sample-app to sync it too.

@gaantunes
Copy link
Contributor Author

gaantunes commented Dec 5, 2025

@Dasomeone I have shifted to use the common-lib panels where appropriate, and made some other changes according to your review suggestions. Here is how the dashboards look like now.

Cluster Overview
image

Instance Overview
image

Queries Overview
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants