Skip to content

feat: add cloudnative-pg-mixin #1469

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 13 additions & 0 deletions cloudnative-pg-mixin/.lint
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
exclusions:
panel-title-description-rule:
reason: "mixtool upgrade made this rule stricter. TODO: Fix errors and remove the warning exclusion"
template-instance-rule:
reason: "mixtool upgrade made this rule stricter. TODO: Fix errors and remove the warning exclusion"
template-job-rule:
reason: "mixtool upgrade made this rule stricter. TODO: Fix errors and remove the warning exclusion"
template-label-promql-rule:
reason: "mixtool upgrade made this rule stricter. TODO: Fix errors and remove the warning exclusion"
target-instance-rule:
reason: "mixtool upgrade made this rule stricter. TODO: Fix errors and remove the warning exclusion"
target-job-rule:
reason: "mixtool upgrade made this rule stricter. TODO: Fix errors and remove the warning exclusion"
1 change: 1 addition & 0 deletions cloudnative-pg-mixin/Makefile
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
include ../Makefile_mixin
25 changes: 25 additions & 0 deletions cloudnative-pg-mixin/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
# CloudNativePG Mixin

A monitoring mixin for [CloudNativePG](https://cloudnative-pg.io), providing Grafana dashboards and Prometheus alerting rules for PostgreSQL clusters running on Kubernetes.

## Dashboards

This mixin bundles the [Grafana dashboard provided by CloudNativePG](https://github.com/cloudnative-pg/grafana-dashboards/blob/cececeb393fb7c5400b4fa290aca68041293a127/charts/cluster/grafana-dashboard.json).

<picture>
<source media="(prefers-color-scheme: dark)" srcset="images/dashboard-dark.png">
<source media="(prefers-color-scheme: light)" srcset="images/dashboard-light.png">
<img alt="CloudNativePG Dashboard" src="images/dashboard-light.png">
</picture>

## Prometheus Alerts

This mixin bundles the sample [Prometheus Alert rules provided by CloudNativePG](https://github.com/cloudnative-pg/cloudnative-pg/blob/b7e9f07cf6fa2181bc5c9b8e82d4b37b27ee92ba/docs/src/samples/monitoring/alerts.yaml).

- `LongRunningTransaction`: A query is taking longer than 5 minutes.
- `BackendsWaiting`: If a backend is waiting for longer than 5 minutes
- `PGDatabase`: Number of transactions from the frozen XID to the current one
- `PGReplication`: The standby is lagging behind the primary
- `LastFailedArchiveTime`: Checks the last time archiving failed. Will be < 0 when it has not failed.
- `DatabaseDeadlockConflicts`: Checks the number of database conflicts
- `ReplicaFailingReplication`: Checks if the replica is failing to replicate
66 changes: 66 additions & 0 deletions cloudnative-pg-mixin/alerts/alerts.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,66 @@
groups:
- name: cnp-default.rules
rules:
- alert: LongRunningTransaction
annotations:
description: Pod {{ $labels.pod }} is taking more than 5 minutes (300 seconds) for a query.
summary: A query is taking longer than 5 minutes.
expr: |-
cnpg_backends_max_tx_duration_seconds > 300
for: 1m
labels:
severity: warning
- alert: BackendsWaiting
annotations:
description: Pod {{ $labels.pod }} has been waiting for longer than 5 minutes
summary: If a backend is waiting for longer than 5 minutes.
expr: |-
cnpg_backends_waiting_total > 300
for: 1m
labels:
severity: warning
- alert: PGDatabase
annotations:
description: Over 300,000,000 transactions from frozen xid on pod {{ $labels.pod }}
summary: Number of transactions from the frozen XID to the current one.
expr: |-
cnpg_pg_database_xid_age > 300000000
for: 1m
labels:
severity: warning
- alert: PGReplication
annotations:
description: Standby {{ $labels.pod }} is lagging behind by over 300 seconds
summary: The standby is lagging behind the primary.
expr: |-
cnpg_pg_replication_lag > 300
for: 1m
labels:
severity: warning
- alert: LastFailedArchiveTime
annotations:
description: Archiving failed for {{ $labels.pod }}
summary: Checks the last time archiving failed. Will be < 0 when it has not failed.
expr: |-
(cnpg_pg_stat_archiver_last_failed_time - cnpg_pg_stat_archiver_last_archived_time) > 1
for: 1m
labels:
severity: warning
- alert: DatabaseDeadlockConflicts
annotations:
description: There are over 10 deadlock conflicts in {{ $labels.pod }}
summary: Checks the number of database conflicts.
expr: |-
cnpg_pg_stat_database_deadlocks > 10
for: 1m
labels:
severity: warning
- alert: ReplicaFailingReplication
annotations:
description: Replica {{ $labels.pod }} is failing to replicate
summary: Checks if the replica is failing to replicate.
expr: |-
cnpg_pg_replication_in_recovery > cnpg_pg_replication_is_wal_receiver_up
for: 1m
labels:
severity: warning
Loading