-
Notifications
You must be signed in to change notification settings - Fork 107
[DNM] Run validate-network service one stage later #595
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Skipping CI for Draft Pull Request. |
evallesp
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Due to some race condition, DNS service is not returning FQDN for compute nodes when validate-network pods kick in. This patch aims to start validate-network service in later stage with hope this random issue gets fixed.
ed5a80f to
033c623
Compare
|
New changes are detected. LGTM label has been removed. |
Due to some race condition, FQDN check is running before dnsmasq is returning FQDN for compute nodes. To avoid that, we want that this check should retry for couple of minutes until dnsmasq is ready and returns FQDN. Ref: openstack-k8s-operators/architecture#595
Due to some race condition, FQDN check is running before dnsmasq is returning FQDN for compute nodes. To avoid that, we want that this check should retry for couple of minutes until dnsmasq is ready and returns FQDN. Ref: openstack-k8s-operators/architecture#595
|
-1 If we really think validate-network should be run later, then we should change it in EDPM itself and update our documentation. @slagle what do you think? |
We don't have a root cause of the issue that indicates such a change is needed. I agree that changing the service order here isn't right either. |
@evallesp WDYT about this? It is true that we don't have a root cause. This issue is happening randomly and we are trying to avoid it by changing the order of services. What should be the next step as reordering services without root cause is discouraged. |
|
@fultonj @slagle I agree with you, so I looked into the playbook, which is called through the configure-network service. The playbook led me to this edpm_network_config role, and it seems to be the last stage in the configure-network service. IIUC, the role is doing all the configuration, but the loading of dnsmasq is still in progress. Then, the validate-network service is initiated, and sometimes dnsmasq is not ready, causing the issue. @sdatko Maybe this should be handled by networking experts rather than the ci-framework folks. @karelyatin Maybe you could help? |
We need to know why dnsmasq isn't ready. It's deployed as part of the OpenStackControlPlane resource, so when that reports ready, dnsmasq should be ready as well. Are we waiting for OpenStackControlPlane to be Ready=True before starting the dataplane deployment? We document this as a requirement in the docs, but don't show an example of using the oc wait command to check, which we should probably add. The command is what is done by the openstack_wait_deploy target in install_yamls. |
|
<< @karelyatin Maybe you could help? Do we have logs of job where it failed? |
|
Unfortunately, logs have been rotated. And since it's occurring rarely, it
is difficult to reproduce it.
…On Wed, Aug 13, 2025, 18:18 Yatin Karel ***@***.***> wrote:
*karelyatin* left a comment (openstack-k8s-operators/architecture#595)
<#595 (comment)>
<< @karelyatin <https://github.com/karelyatin> Maybe you could help?
Do we have logs of job where it failed?
—
Reply to this email directly, view it on GitHub
<#595 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ATLBCGQYIAEMFXVMUEL6TYD3NMXX3AVCNFSM6AAAAACDBA6NCKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTCOBTG43TSOJQGE>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
|
<< Unfortunately, logs have been rotated. And since it's occurring rarely, it |
|
recheck |
|
James wrote:
We do in HCI for example: https://github.com/openstack-k8s-operators/architecture/blob/main/automation/vars/hci.yaml#L35 I see a similar call in these jobs: |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: amartyasinha The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Due to some race condition, DNS service is not returning FQDN for compute nodes when validate-network pods kick in. This patch aims to start validate-network service in later stage with hope this random issue gets fixed.