Skip to content

Conversation

@ardentperf
Copy link
Contributor

There appears to be a timing-related bug or race condition where the
first kubectl apply for PodMonitors during the setup script execution
reports success (HTTP 201 Created from the API server) but the resource
doesn't persist. The exact root cause is unclear, but moving podmonitor
creation to the end and giving the cluster a few seconds to stabilize
before creating podmonitors seems to work around the issue.

Closes #46

Signed-off-by: Jeremy Schneider [email protected]

There appears to be a timing-related bug or race condition where the
first kubectl apply for PodMonitors during the setup script execution
reports success (HTTP 201 Created from the API server) but the resource
doesn't persist. The exact root cause is unclear, but moving podmonitor
creation to the end and giving the cluster a few seconds to stabilize
before creating podmonitors seems to work around the issue.

Closes cloudnative-pg#46

Signed-off-by: Jeremy Schneider <[email protected]>
Comment on lines +120 to +123
# cf. https://github.com/cloudnative-pg/cnpg-playground/issues/46
if check_crd_existence podmonitors.monitoring.coreos.com
then
sleep 5
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably to make sure here that the podmonitors are going to be created what is required is just to make sure that the CRDs exists? why waiting 5 seconds here will fix an issue not related to the cnpg cluster? the podmonitor object will be created by the prometheus operator it is weird that you need to wait here

Copy link
Contributor Author

@ardentperf ardentperf Jan 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fwiw, this cleanly reproduces on GH runners with the CICD test framework from #48 which is very close to being a repro on main (#48 doesn't change relevant existing code, it's just adding new code and tests). https://github.com/ardentperf/cnpg-playground/actions/runs/20797196550/job/59733660471

note that the only difference between PR #48 having a successful run of the test and this failed run of the test is that branch repro-issue-46 doesn't include the commit from this PR

from lines 682 and 687 in the GH action test output, we can see that the podmonitors disappear after cnpg finishes creating the clusters

yes it's weird that the 5 second sleep and ordering change works as a remediation. if you have cycles to debug that would be great - i wasn't able to figure it out yet, and this PR seems harmless enough until someone gets an RCA and the underlying issue is fixed. i wonder if there's a race condition where the CNPG operator itself is removing the podmonitor.

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

podmonitor does not exist after demo/setup.sh completes

2 participants