feat: monitoring - CNPG dashboard deps, teardown, tests #48

ardentperf · 2026-01-07T01:32:55Z

this is a followup to #38 and adds dependencies of the CNPG dashboard (kube-state-metrics and node-exporter and prometheus default recording rules), adds a teardown script, and adds tests including github actions that successfully run the tests on pushes and PRs. the tests are not as comprehensive as they could be, but they cover the basics of ensuring the setup and teardown scripts work. for teardown, the test confirms that we can successfully re-run setup afterwards.

this PR pins versions of CNPG dashboard dependencies which it adds to the playground, and it also switches to pinning the versions of grafana and prometheus. stability is important for the playground since it's intended for newcomers to CNPG and we don't want them to have a frustrating experience of things being broken the first time they try out CNPG. This exact problem was encountered in the middle of testing this PR due to a CRD validation bug upstream in grafana v5.21.4 and it took some time to realize that v5.21.3 had originally been used and the stack had silently switched to the broken verison 5.21.4 in the middle of the day.

the github action CICD test intentionally uses the playground nix devshell to get dependencies like kind (which determines the version of kubernetes used for testing). this ensures that it's easy for anyone to use the same versions of kind, kubernetes, etc that we are testing with. for now, the end-to-end test only runs the default setup without parameters to setup.sh and involves both eu and us clusters. testing with at least two clusters ensures that minio and networking work correctly since backups are used to create the replica cluster. the end-to-end tests also currently run with LEGACY=true for demo/setup.sh because the setup script currently isn't working with the backup plugin on a clean setup with dependency versions pinned in nix.

once demo/setup.sh supports custom regions (#40), we can parameterize these tests and set up github action jobs in multiple configurations like single-region and custom-region. once non-legacy mode works with the playground, we can switch to non-legacy mode for testing (this controls whether backup plugins are used).

an example successful execution of the github action is available at https://github.com/ardentperf/cnpg-playground/actions/runs/20766915570

this PR should be merged after PRs #43 and #45 and #47 - testing was done on a temporary branch with all of these commits cherrypicked ahead of this PR.

ardentperf · 2026-01-07T17:12:19Z

fyi - just now did a force push to remove a "pre-cleanup' step from test-1-setup.sh which removed any existing MinIO containers. I had this in the code for testing and hadn't intended to include it in the final PR

ardentperf · 2026-01-07T19:49:58Z

two more force-push updates which resulted from using the enclosed GH action CICD tests to collect a repro and debugging info for issue #44

first, removed all cases where output was being suppressed. i think we want the output of the GH action to fully reflect what end users see when they run the playground scripts (this also aids debugging failures). second, moved the sleep earlier in the test since issue #44 is a case where the monitoring pods cannot be created by the operator, and there was a race condition where the error wasn't always emitted to the logs before the initial monitoring stack health tests.

in the interest of keeping this code lightweight and easy to review/read i've opted not to lengthen the test code with lots of extra logic to look everywhere that errors might occur in underlying systems; i think the best approach is to keep the test code simple (as it is in this PR) and then do failure debugging on personal GH repos where targeted debugging code can be added on demand - or in many cases just run the tests locally instead of debugging on GH runners

ardentperf · 2026-01-08T08:47:11Z

Current status: all metrics are working except for storage/pvc metrics.

Storage metrics are still missing because the playground setup currently uses local-path-provisioner for storage which doesn't implement volume metrics.

ubuntu@cnpg1:~/cnpg-playground$ k get storageclasses
NAME                 PROVISIONER             RECLAIMPOLICY   VOLUMEBINDINGMODE      ALLOWVOLUMEEXPANSION   AGE
standard (default)   rancher.io/local-path   Delete          WaitForFirstConsumer   false                  75m

We might be able to address this with a very basic deployment of csi-driver-host-path but I'll split this out separately and tackle it later.

ardentperf · 2026-01-14T21:35:25Z

rebased and resolved conflicts w #53

ardentperf · 2026-01-14T23:10:43Z

latest test run failed, after the rebase and last two updates. i'll debug and update the PR https://github.com/ardentperf/cnpg-playground/actions/runs/21012880095

followup to cloudnative-pg#38 - add dependencies of the CNPG dashboard (kube-state-metrics and node-exporter and prometheus default recording rules), add a teardown script and add tests Signed-off-by: Jeremy Schneider <[email protected]>

CNPG dashboard was missing data because recording rule metrics only appeared for kube-state-metrics, not for PostgreSQL database pods. Fixed by setting honorLabels: true on kube-state-metrics and kubelet ServiceMonitors, which preserves the correct pod labels needed for the recording rule joins to succeed. Now all pods including pg-{region}-* appear in recording rule metrics. Signed-off-by: Jeremy Schneider <[email protected]>

after rebasing, update test scripts to use centralized logic that was introduced in cloudnative-pg#53 Signed-off-by: Jeremy Schneider <[email protected]>

the test script was previously doing a basic HTTP check on the grafana port to determine that it was alive. this upgrades that test to explicitly confirm that not only is grafana alive, but that the CNPG dashboard was created successfully. Signed-off-by: Jeremy Schneider <[email protected]>

ardentperf · 2026-01-15T03:51:11Z

rebased again; tests fail without latest commit on main

update monitoring/teardown.sh with centralized logic and add 10s pause before testing that db pods are successfully terminated Signed-off-by: Jeremy Schneider <[email protected]>

ardentperf · 2026-01-15T04:59:59Z

successful E2E test https://github.com/ardentperf/cnpg-playground/actions/runs/21019845258

(the test branch also includes #45 and #57 since those are required for E2E tests to succeed)

ardentperf requested a review from a team as a code owner January 7, 2026 01:32

zied-jt approved these changes Jan 7, 2026

View reviewed changes

ardentperf force-pushed the pr-monitoring-updates branch from 6764889 to cad0080 Compare January 7, 2026 17:10

ardentperf force-pushed the pr-monitoring-updates branch 2 times, most recently from f89c6ff to 24e78d6 Compare January 7, 2026 19:37

ardentperf mentioned this pull request Jan 7, 2026

grafana fails with rbac errors on startup #44

Open

ardentperf force-pushed the pr-monitoring-updates branch from 24e78d6 to 49858c5 Compare January 7, 2026 21:28

ardentperf mentioned this pull request Jan 7, 2026

fix: wait for cluster to stabilize before creating podmonitor #47

Closed

ardentperf force-pushed the pr-monitoring-updates branch 3 times, most recently from 7237216 to 36446ef Compare January 8, 2026 07:50

ardentperf force-pushed the pr-monitoring-updates branch from 36446ef to 2e04ae8 Compare January 14, 2026 21:33

ardentperf force-pushed the pr-monitoring-updates branch from 046452b to dc3bead Compare January 14, 2026 23:00

ardentperf added 4 commits January 14, 2026 19:43

chore: update tests for centralized logic

ed4bb55

after rebasing, update test scripts to use centralized logic that was introduced in cloudnative-pg#53 Signed-off-by: Jeremy Schneider <[email protected]>

ardentperf force-pushed the pr-monitoring-updates branch from dc3bead to 014ddda Compare January 15, 2026 03:43

fix: centralized logic and teardown test 10s sleep

8d7a1c6

update monitoring/teardown.sh with centralized logic and add 10s pause before testing that db pods are successfully terminated Signed-off-by: Jeremy Schneider <[email protected]>

ardentperf mentioned this pull request Jan 15, 2026

chore: update nix devshell versions #56

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: monitoring - CNPG dashboard deps, teardown, tests #48

feat: monitoring - CNPG dashboard deps, teardown, tests #48

Uh oh!

ardentperf commented Jan 7, 2026 •

edited

Loading

Uh oh!

ardentperf commented Jan 7, 2026 •

edited

Loading

Uh oh!

ardentperf commented Jan 7, 2026

Uh oh!

ardentperf commented Jan 8, 2026

Uh oh!

ardentperf commented Jan 14, 2026

Uh oh!

ardentperf commented Jan 14, 2026

Uh oh!

ardentperf commented Jan 15, 2026

Uh oh!

ardentperf commented Jan 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

feat: monitoring - CNPG dashboard deps, teardown, tests #48

Are you sure you want to change the base?

feat: monitoring - CNPG dashboard deps, teardown, tests #48

Uh oh!

Conversation

ardentperf commented Jan 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ardentperf commented Jan 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ardentperf commented Jan 7, 2026

Uh oh!

ardentperf commented Jan 8, 2026

Uh oh!

ardentperf commented Jan 14, 2026

Uh oh!

ardentperf commented Jan 14, 2026

Uh oh!

ardentperf commented Jan 15, 2026

Uh oh!

ardentperf commented Jan 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ardentperf commented Jan 7, 2026 •

edited

Loading

ardentperf commented Jan 7, 2026 •

edited

Loading