Skip to content

Dashboard improvement: Use rate() for reconciliation error metrics in top overview panel #48

@SyTry

Description

@SyTry

Problem

The default Grafana dashboard shows reconciliation errors in the top overview panel using raw counters from controller_runtime_reconcile_total{result="error"}. Since these are monotonically increasing counters, they don't provide a clear view of the current error rate, making it difficult to detect active reconciliation issues.

Current behavior

The "Reconcile errors" panel (grid position: top row, after operator status) uses these queries:

# Backup reconciliation errors
controller_runtime_reconcile_total{namespace=~"$operatorNamespace", result="error", controller="backup"}

# Cluster reconciliation errors  
controller_runtime_reconcile_total{namespace=~"$operatorNamespace", result="error", controller="cluster"}

# Pooler reconciliation errors
controller_runtime_reconcile_total{namespace=~"$operatorNamespace", result="error", controller="pooler"}

# Scheduled Backup reconciliation errors
controller_runtime_reconcile_total{namespace=~"$operatorNamespace", result="error", controller=~"scheduledbackup|scheduled-backup"}

These cumulative counters make it hard to distinguish between:

Old errors that occurred days ago
New errors happening right now
The panel currently maps the cumulative values to different scopes (Backup=1-9, Cluster=10-99, Pooler=100-999, Scheduled Backup=1000-9999) but this doesn't show if errors are currently occurring.

Expected behavior

The dashboard should show the rate of errors to help operators identify active reconciliation problems.

Proposed solution

Use rate() function in the queries to show errors per time window:

Replace:

controller_runtime_reconcile_total{namespace=~"$operatorNamespace", result="error", controller="cluster"}

With:

rate(controller_runtime_reconcile_total{namespace=~"$operatorNamespace", result="error", controller="cluster"}[5m]) > 0

Or use increase() for count over time window:

increase(controller_runtime_reconcile_total{namespace=~"$operatorNamespace", result="error", controller="cluster"}[5m])

The > 0 filter ensures the panel only shows active errors, making it immediately visible when reconciliation issues occur.

Benefits

✅ Shows active error rate instead of cumulative count
✅ Makes alerts more meaningful (alert when rate > threshold)
✅ Easier to correlate errors with recent changes/events
✅ Clear visual indication when reconciliation problems start/stop
✅ Better observability for production troubleshooting

Environment

CloudNative-PG version: 1.27.0
Kubernetes version: 1.28+
Monitoring stack: VictoriaMetrics/Prometheus + Grafana
Dashboard location: Top overview panel "Reconcile errors"

Additional context

We've implemented this improvement in our custom dashboard and it significantly improved our ability to detect and respond to reconciliation issues in real-time. The panel now clearly shows when active errors are occurring by scope (Backup/Cluster/Pooler/Scheduled Backup), rather than just showing a cumulative count.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions