-
Notifications
You must be signed in to change notification settings - Fork 21
Description
Problem
The default Grafana dashboard shows reconciliation errors in the top overview panel using raw counters from controller_runtime_reconcile_total{result="error"}. Since these are monotonically increasing counters, they don't provide a clear view of the current error rate, making it difficult to detect active reconciliation issues.
Current behavior
The "Reconcile errors" panel (grid position: top row, after operator status) uses these queries:
# Backup reconciliation errors
controller_runtime_reconcile_total{namespace=~"$operatorNamespace", result="error", controller="backup"}
# Cluster reconciliation errors
controller_runtime_reconcile_total{namespace=~"$operatorNamespace", result="error", controller="cluster"}
# Pooler reconciliation errors
controller_runtime_reconcile_total{namespace=~"$operatorNamespace", result="error", controller="pooler"}
# Scheduled Backup reconciliation errors
controller_runtime_reconcile_total{namespace=~"$operatorNamespace", result="error", controller=~"scheduledbackup|scheduled-backup"}
These cumulative counters make it hard to distinguish between:
Old errors that occurred days ago
New errors happening right now
The panel currently maps the cumulative values to different scopes (Backup=1-9, Cluster=10-99, Pooler=100-999, Scheduled Backup=1000-9999) but this doesn't show if errors are currently occurring.
Expected behavior
The dashboard should show the rate of errors to help operators identify active reconciliation problems.
Proposed solution
Use rate() function in the queries to show errors per time window:
Replace:
controller_runtime_reconcile_total{namespace=~"$operatorNamespace", result="error", controller="cluster"}
With:
rate(controller_runtime_reconcile_total{namespace=~"$operatorNamespace", result="error", controller="cluster"}[5m]) > 0
Or use increase() for count over time window:
increase(controller_runtime_reconcile_total{namespace=~"$operatorNamespace", result="error", controller="cluster"}[5m])
The > 0 filter ensures the panel only shows active errors, making it immediately visible when reconciliation issues occur.
Benefits
✅ Shows active error rate instead of cumulative count
✅ Makes alerts more meaningful (alert when rate > threshold)
✅ Easier to correlate errors with recent changes/events
✅ Clear visual indication when reconciliation problems start/stop
✅ Better observability for production troubleshooting
Environment
CloudNative-PG version: 1.27.0
Kubernetes version: 1.28+
Monitoring stack: VictoriaMetrics/Prometheus + Grafana
Dashboard location: Top overview panel "Reconcile errors"
Additional context
We've implemented this improvement in our custom dashboard and it significantly improved our ability to detect and respond to reconciliation issues in real-time. The panel now clearly shows when active errors are occurring by scope (Backup/Cluster/Pooler/Scheduled Backup), rather than just showing a cumulative count.