-
Notifications
You must be signed in to change notification settings - Fork 11
feat: monitoring - CNPG dashboard deps, teardown, tests #48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
6764889 to
cad0080
Compare
|
fyi - just now did a force push to remove a "pre-cleanup' step from |
f89c6ff to
24e78d6
Compare
|
two more force-push updates which resulted from using the enclosed GH action CICD tests to collect a repro and debugging info for issue #44 first, removed all cases where output was being suppressed. i think we want the output of the GH action to fully reflect what end users see when they run the playground scripts (this also aids debugging failures). second, moved the sleep earlier in the test since issue #44 is a case where the monitoring pods cannot be created by the operator, and there was a race condition where the error wasn't always emitted to the logs before the initial monitoring stack health tests. in the interest of keeping this code lightweight and easy to review/read i've opted not to lengthen the test code with lots of extra logic to look everywhere that errors might occur in underlying systems; i think the best approach is to keep the test code simple (as it is in this PR) and then do failure debugging on personal GH repos where targeted debugging code can be added on demand - or in many cases just run the tests locally instead of debugging on GH runners |
24e78d6 to
49858c5
Compare
7237216 to
36446ef
Compare
36446ef to
2e04ae8
Compare
|
rebased and resolved conflicts w #53 |
046452b to
dc3bead
Compare
|
latest test run failed, after the rebase and last two updates. i'll debug and update the PR https://github.com/ardentperf/cnpg-playground/actions/runs/21012880095 |
followup to cloudnative-pg#38 - add dependencies of the CNPG dashboard (kube-state-metrics and node-exporter and prometheus default recording rules), add a teardown script and add tests Signed-off-by: Jeremy Schneider <[email protected]>
CNPG dashboard was missing data because recording rule metrics only
appeared for kube-state-metrics, not for PostgreSQL database pods.
Fixed by setting honorLabels: true on kube-state-metrics and kubelet
ServiceMonitors, which preserves the correct pod labels needed for the
recording rule joins to succeed.
Now all pods including pg-{region}-* appear in recording rule metrics.
Signed-off-by: Jeremy Schneider <[email protected]>
after rebasing, update test scripts to use centralized logic that was introduced in cloudnative-pg#53 Signed-off-by: Jeremy Schneider <[email protected]>
the test script was previously doing a basic HTTP check on the grafana port to determine that it was alive. this upgrades that test to explicitly confirm that not only is grafana alive, but that the CNPG dashboard was created successfully. Signed-off-by: Jeremy Schneider <[email protected]>
dc3bead to
014ddda
Compare
|
rebased again; tests fail without latest commit on main |
update monitoring/teardown.sh with centralized logic and add 10s pause before testing that db pods are successfully terminated Signed-off-by: Jeremy Schneider <[email protected]>
|
successful E2E test https://github.com/ardentperf/cnpg-playground/actions/runs/21019845258 (the test branch also includes #45 and #57 since those are required for E2E tests to succeed) |

this is a followup to #38 and adds dependencies of the CNPG dashboard (kube-state-metrics and node-exporter and prometheus default recording rules), adds a teardown script, and adds tests including github actions that successfully run the tests on pushes and PRs. the tests are not as comprehensive as they could be, but they cover the basics of ensuring the setup and teardown scripts work. for teardown, the test confirms that we can successfully re-run setup afterwards.
this PR pins versions of CNPG dashboard dependencies which it adds to the playground, and it also switches to pinning the versions of grafana and prometheus. stability is important for the playground since it's intended for newcomers to CNPG and we don't want them to have a frustrating experience of things being broken the first time they try out CNPG. This exact problem was encountered in the middle of testing this PR due to a CRD validation bug upstream in grafana v5.21.4 and it took some time to realize that v5.21.3 had originally been used and the stack had silently switched to the broken verison 5.21.4 in the middle of the day.
the github action CICD test intentionally uses the playground nix devshell to get dependencies like kind (which determines the version of kubernetes used for testing). this ensures that it's easy for anyone to use the same versions of kind, kubernetes, etc that we are testing with. for now, the end-to-end test only runs the default setup without parameters to
setup.shand involves botheuandusclusters. testing with at least two clusters ensures that minio and networking work correctly since backups are used to create the replica cluster. the end-to-end tests also currently run withLEGACY=truefordemo/setup.shbecause the setup script currently isn't working with the backup plugin on a clean setup with dependency versions pinned in nix.once
demo/setup.shsupports custom regions (#40), we can parameterize these tests and set up github action jobs in multiple configurations like single-region and custom-region. once non-legacy mode works with the playground, we can switch to non-legacy mode for testing (this controls whether backup plugins are used).an example successful execution of the github action is available at https://github.com/ardentperf/cnpg-playground/actions/runs/20766915570
this PR should be merged after PRs #43 and #45 and #47 - testing was done on a temporary branch with all of these commits cherrypicked ahead of this PR.