Skip to content

Conversation

@ardentperf
Copy link
Contributor

@ardentperf ardentperf commented Jan 7, 2026

this is a followup to #38 and adds dependencies of the CNPG dashboard (kube-state-metrics and node-exporter and prometheus default recording rules), adds a teardown script, and adds tests including github actions that successfully run the tests on pushes and PRs. the tests are not as comprehensive as they could be, but they cover the basics of ensuring the setup and teardown scripts work. for teardown, the test confirms that we can successfully re-run setup afterwards.

this PR pins versions of CNPG dashboard dependencies which it adds to the playground, and it also switches to pinning the versions of grafana and prometheus. stability is important for the playground since it's intended for newcomers to CNPG and we don't want them to have a frustrating experience of things being broken the first time they try out CNPG. This exact problem was encountered in the middle of testing this PR due to a CRD validation bug upstream in grafana v5.21.4 and it took some time to realize that v5.21.3 had originally been used and the stack had silently switched to the broken verison 5.21.4 in the middle of the day.

the github action CICD test intentionally uses the playground nix devshell to get dependencies like kind (which determines the version of kubernetes used for testing). this ensures that it's easy for anyone to use the same versions of kind, kubernetes, etc that we are testing with. for now, the end-to-end test only runs the default setup without parameters to setup.sh and involves both eu and us clusters. testing with at least two clusters ensures that minio and networking work correctly since backups are used to create the replica cluster. the end-to-end tests also currently run with LEGACY=true for demo/setup.sh because the setup script currently isn't working with the backup plugin on a clean setup with dependency versions pinned in nix.

once demo/setup.sh supports custom regions (#40), we can parameterize these tests and set up github action jobs in multiple configurations like single-region and custom-region. once non-legacy mode works with the playground, we can switch to non-legacy mode for testing (this controls whether backup plugins are used).

an example successful execution of the github action is available at https://github.com/ardentperf/cnpg-playground/actions/runs/20766915570

this PR should be merged after PRs #43 and #45 and #47 - testing was done on a temporary branch with all of these commits cherrypicked ahead of this PR.

@ardentperf ardentperf requested a review from a team as a code owner January 7, 2026 01:32
@ardentperf ardentperf force-pushed the pr-monitoring-updates branch from 6764889 to cad0080 Compare January 7, 2026 17:10
@ardentperf
Copy link
Contributor Author

ardentperf commented Jan 7, 2026

fyi - just now did a force push to remove a "pre-cleanup' step from test-1-setup.sh which removed any existing MinIO containers. I had this in the code for testing and hadn't intended to include it in the final PR

@ardentperf ardentperf force-pushed the pr-monitoring-updates branch 2 times, most recently from f89c6ff to 24e78d6 Compare January 7, 2026 19:37
@ardentperf
Copy link
Contributor Author

two more force-push updates which resulted from using the enclosed GH action CICD tests to collect a repro and debugging info for issue #44

first, removed all cases where output was being suppressed. i think we want the output of the GH action to fully reflect what end users see when they run the playground scripts (this also aids debugging failures). second, moved the sleep earlier in the test since issue #44 is a case where the monitoring pods cannot be created by the operator, and there was a race condition where the error wasn't always emitted to the logs before the initial monitoring stack health tests.

in the interest of keeping this code lightweight and easy to review/read i've opted not to lengthen the test code with lots of extra logic to look everywhere that errors might occur in underlying systems; i think the best approach is to keep the test code simple (as it is in this PR) and then do failure debugging on personal GH repos where targeted debugging code can be added on demand - or in many cases just run the tests locally instead of debugging on GH runners

@ardentperf ardentperf force-pushed the pr-monitoring-updates branch from 24e78d6 to 49858c5 Compare January 7, 2026 21:28
@ardentperf ardentperf force-pushed the pr-monitoring-updates branch 3 times, most recently from 7237216 to 36446ef Compare January 8, 2026 07:50
@ardentperf
Copy link
Contributor Author

Current status: all metrics are working except for storage/pvc metrics.

image

Storage metrics are still missing because the playground setup currently uses local-path-provisioner for storage which doesn't implement volume metrics.

ubuntu@cnpg1:~/cnpg-playground$ k get storageclasses
NAME                 PROVISIONER             RECLAIMPOLICY   VOLUMEBINDINGMODE      ALLOWVOLUMEEXPANSION   AGE
standard (default)   rancher.io/local-path   Delete          WaitForFirstConsumer   false                  75m

We might be able to address this with a very basic deployment of csi-driver-host-path but I'll split this out separately and tackle it later.

@ardentperf ardentperf force-pushed the pr-monitoring-updates branch from 36446ef to 2e04ae8 Compare January 14, 2026 21:33
@ardentperf
Copy link
Contributor Author

rebased and resolved conflicts w #53

@ardentperf ardentperf force-pushed the pr-monitoring-updates branch from 046452b to dc3bead Compare January 14, 2026 23:00
@ardentperf
Copy link
Contributor Author

latest test run failed, after the rebase and last two updates. i'll debug and update the PR https://github.com/ardentperf/cnpg-playground/actions/runs/21012880095

followup to cloudnative-pg#38 - add dependencies of the CNPG dashboard
(kube-state-metrics and node-exporter and prometheus default recording
rules), add a teardown script and add tests

Signed-off-by: Jeremy Schneider <[email protected]>
CNPG dashboard was missing data because recording rule metrics only
appeared for kube-state-metrics, not for PostgreSQL database pods.

Fixed by setting honorLabels: true on kube-state-metrics and kubelet
ServiceMonitors, which preserves the correct pod labels needed for the
recording rule joins to succeed.

Now all pods including pg-{region}-* appear in recording rule metrics.

Signed-off-by: Jeremy Schneider <[email protected]>
after rebasing, update test scripts to use centralized logic that was
introduced in cloudnative-pg#53

Signed-off-by: Jeremy Schneider <[email protected]>
the test script was previously doing a basic HTTP check on the grafana
port to determine that it was alive. this upgrades that test to
explicitly confirm that not only is grafana alive, but that the CNPG
dashboard was created successfully.

Signed-off-by: Jeremy Schneider <[email protected]>
@ardentperf ardentperf force-pushed the pr-monitoring-updates branch from dc3bead to 014ddda Compare January 15, 2026 03:43
@ardentperf
Copy link
Contributor Author

rebased again; tests fail without latest commit on main

update monitoring/teardown.sh with centralized logic and add 10s pause
before testing that db pods are successfully terminated

Signed-off-by: Jeremy Schneider <[email protected]>
@ardentperf
Copy link
Contributor Author

successful E2E test https://github.com/ardentperf/cnpg-playground/actions/runs/21019845258

(the test branch also includes #45 and #57 since those are required for E2E tests to succeed)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants