fix: wait for cluster to stabilize before creating podmonitor #47

ardentperf · 2026-01-06T00:23:09Z

There appears to be a timing-related bug or race condition where the
first kubectl apply for PodMonitors during the setup script execution
reports success (HTTP 201 Created from the API server) but the resource
doesn't persist. The exact root cause is unclear, but moving podmonitor
creation to the end and giving the cluster a few seconds to stabilize
before creating podmonitors seems to work around the issue.

Closes #46

Signed-off-by: Jeremy Schneider [email protected]

There appears to be a timing-related bug or race condition where the first kubectl apply for PodMonitors during the setup script execution reports success (HTTP 201 Created from the API server) but the resource doesn't persist. The exact root cause is unclear, but moving podmonitor creation to the end and giving the cluster a few seconds to stabilize before creating podmonitors seems to work around the issue. Closes cloudnative-pg#46 Signed-off-by: Jeremy Schneider <[email protected]>

sxd · 2026-01-07T12:36:17Z

demo/setup.sh

+   #    cf. https://github.com/cloudnative-pg/cnpg-playground/issues/46
   if check_crd_existence podmonitors.monitoring.coreos.com
   then
+     sleep 5


Probably to make sure here that the podmonitors are going to be created what is required is just to make sure that the CRDs exists? why waiting 5 seconds here will fix an issue not related to the cnpg cluster? the podmonitor object will be created by the prometheus operator it is weird that you need to wait here

fwiw, this cleanly reproduces on GH runners with the CICD test framework from #48 which is very close to being a repro on main (#48 doesn't change relevant existing code, it's just adding new code and tests). https://github.com/ardentperf/cnpg-playground/actions/runs/20797196550/job/59733660471

note that the only difference between PR #48 having a successful run of the test and this failed run of the test is that branch repro-issue-46 doesn't include the commit from this PR

from lines 682 and 687 in the GH action test output, we can see that the podmonitors disappear after cnpg finishes creating the clusters

yes it's weird that the 5 second sleep and ordering change works as a remediation. if you have cycles to debug that would be great - i wasn't able to figure it out yet, and this PR seems harmless enough until someone gets an RCA and the underlying issue is fixed. i wonder if there's a race condition where the CNPG operator itself is removing the podmonitor.

ardentperf requested a review from a team as a code owner January 6, 2026 00:23

ardentperf mentioned this pull request Jan 7, 2026

feat: monitoring - CNPG dashboard deps, teardown, tests #48

Open

sxd reviewed Jan 7, 2026

View reviewed changes

ardentperf mentioned this pull request Jan 7, 2026

podmonitor does not exist after demo/setup.sh completes #46

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: wait for cluster to stabilize before creating podmonitor #47

fix: wait for cluster to stabilize before creating podmonitor #47

Uh oh!

ardentperf commented Jan 6, 2026

Uh oh!

sxd Jan 7, 2026

Uh oh!

ardentperf Jan 7, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

fix: wait for cluster to stabilize before creating podmonitor #47

Are you sure you want to change the base?

fix: wait for cluster to stabilize before creating podmonitor #47

Uh oh!

Conversation

ardentperf commented Jan 6, 2026

Uh oh!

sxd Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

ardentperf Jan 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ardentperf Jan 7, 2026 •

edited

Loading