Skip to content

Conversation

@amartyasinha
Copy link

Due to some race condition, DNS service is not returning FQDN for compute nodes when validate-network pods kick in. This patch aims to start validate-network service in later stage with hope this random issue gets fixed.

@openshift-ci
Copy link

openshift-ci bot commented Aug 4, 2025

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

evallesp
evallesp previously approved these changes Aug 6, 2025
Copy link
Contributor

@evallesp evallesp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@openshift-ci openshift-ci bot added the lgtm label Aug 6, 2025
@amartyasinha amartyasinha changed the title [DNM] Run validate-network service one stage later Run validate-network service one stage later Aug 6, 2025
Due to some race condition, DNS service is not returning FQDN for compute nodes when validate-network pods kick in. This patch aims to start validate-network service in later stage with hope this random issue gets fixed.
@openshift-ci
Copy link

openshift-ci bot commented Aug 6, 2025

New changes are detected. LGTM label has been removed.

@amartyasinha amartyasinha marked this pull request as ready for review August 6, 2025 09:01
@openshift-ci openshift-ci bot requested review from abays and fultonj August 6, 2025 09:01
amartyasinha added a commit to amartyasinha/edpm-ansible that referenced this pull request Aug 6, 2025
Due to some race condition, FQDN check is running before dnsmasq is returning FQDN for compute nodes. To avoid that, we want that this check should retry for couple of minutes until dnsmasq is ready and returns FQDN.

Ref: openstack-k8s-operators/architecture#595
amartyasinha added a commit to amartyasinha/edpm-ansible that referenced this pull request Aug 6, 2025
Due to some race condition, FQDN check is running before dnsmasq is returning FQDN for compute nodes. To avoid that, we want that this check should retry for couple of minutes until dnsmasq is ready and returns FQDN.

Ref: openstack-k8s-operators/architecture#595
@fultonj
Copy link
Contributor

fultonj commented Aug 11, 2025

-1

If we really think validate-network should be run later, then we should change it in EDPM itself and update our documentation.

@slagle what do you think?

@slagle
Copy link
Contributor

slagle commented Aug 11, 2025

-1

If we really think validate-network should be run later, then we should change it in EDPM itself and update our documentation.

@slagle what do you think?

We don't have a root cause of the issue that indicates such a change is needed. I agree that changing the service order here isn't right either.

@amartyasinha
Copy link
Author

amartyasinha commented Aug 12, 2025

If we really think validate-network should be run later, then we should change it in EDPM itself and update our documentation.

We don't have a root cause of the issue that indicates such a change is needed. I agree that changing the service order here isn't right either.

@evallesp WDYT about this? It is true that we don't have a root cause. This issue is happening randomly and we are trying to avoid it by changing the order of services. What should be the next step as reordering services without root cause is discouraged.

@amartyasinha
Copy link
Author

amartyasinha commented Aug 12, 2025

@fultonj @slagle I agree with you, so I looked into the playbook, which is called through the configure-network service.

The playbook led me to this edpm_network_config role, and it seems to be the last stage in the configure-network service. IIUC, the role is doing all the configuration, but the loading of dnsmasq is still in progress. Then, the validate-network service is initiated, and sometimes dnsmasq is not ready, causing the issue.

@sdatko Maybe this should be handled by networking experts rather than the ci-framework folks.

@karelyatin Maybe you could help?

@amartyasinha amartyasinha added the do-not-merge Changes are not ready to be merged label Aug 12, 2025
@slagle
Copy link
Contributor

slagle commented Aug 13, 2025

@fultonj @slagle I agree with you, so I looked into the playbook, which is called through the configure-network service.

The playbook led me to this edpm_network_config role, and it seems to be the last stage in the configure-network service. IIUC, the role is doing all the configuration, but the loading of dnsmasq is still in progress. Then, the validate-network service is initiated, and sometimes dnsmasq is not ready, causing the issue.

We need to know why dnsmasq isn't ready. It's deployed as part of the OpenStackControlPlane resource, so when that reports ready, dnsmasq should be ready as well.

Are we waiting for OpenStackControlPlane to be Ready=True before starting the dataplane deployment? We document this as a requirement in the docs, but don't show an example of using the oc wait command to check, which we should probably add. The command is what is done by the openstack_wait_deploy target in install_yamls.

@karelyatin
Copy link
Contributor

<< @karelyatin Maybe you could help?

Do we have logs of job where it failed?

@amartyasinha
Copy link
Author

amartyasinha commented Aug 13, 2025 via email

@karelyatin
Copy link
Contributor

<< Unfortunately, logs have been rotated. And since it's occurring rarely, it
is difficult to reproduce it.
ack then let's hold this and edpm-ansible PR as slagle pointed out we should check the issue once we have logs and/or reproducer

@evallesp evallesp changed the title Run validate-network service one stage later [DNM] Run validate-network service one stage later Aug 13, 2025
@evallesp
Copy link
Contributor

recheck

@fultonj
Copy link
Contributor

fultonj commented Aug 15, 2025

James wrote:

Are we waiting for OpenStackControlPlane to be Ready=True before starting the dataplane deployment?

We do in HCI for example:

https://github.com/openstack-k8s-operators/architecture/blob/main/automation/vars/hci.yaml#L35

I see a similar call in these jobs:

[zuul@controller-0 vars]$ grep "openstackcontrolplane" *.yaml
bgp-l3-xl.yaml:            oc -n openstack wait openstackcontrolplane
bgp_dt01.yaml:            oc -n openstack wait openstackcontrolplane
bmo01.yaml:            oc -n openstack wait openstackcontrolplane
dz-storage.yaml:            oc -n openstack wait openstackcontrolplane
osasinfra-ipv6.yaml:            oc -n openstack wait openstackcontrolplane
osasinfra.yaml:            oc -n openstack wait openstackcontrolplane
uni01alpha.yaml:            oc -n openstack wait openstackcontrolplane
uni02beta.yaml:            oc -n openstack wait openstackcontrolplane
uni05epsilon.yaml:            oc -n openstack wait openstackcontrolplane
uni06zeta.yaml:            oc -n openstack wait openstackcontrolplane
uni07eta.yaml:            oc -n openstack wait openstackcontrolplane
[zuul@controller-0 vars]$
[zuul@controller-0 vars]$ grep controlplane *.yaml
bgp-l3-xl.yaml:            oc -n openstack wait openstackcontrolplane
bgp-l3-xl.yaml:            controlplane
bgp-l3-xl.yaml:          - name: Create BGPConfiguration after controlplane is deployed
bgp_dt01.yaml:            oc -n openstack wait openstackcontrolplane
bgp_dt01.yaml:            controlplane
bmo01.yaml:            oc -n openstack wait openstackcontrolplane
bmo01.yaml:            controlplane
dcn.yaml:            oc -n openstack wait osctlplane controlplane --for condition=Ready
dz-storage.yaml:            oc -n openstack wait openstackcontrolplane
dz-storage.yaml:            controlplane
dz-storage.yaml:          - name: Create BGPConfiguration after controlplane is deployed
hci.yaml:            oc -n openstack wait osctlplane controlplane --for condition=Ready
multi-namespace.yaml:            oc -n openstack wait osctlplane controlplane --for condition=Ready
multi-namespace.yaml:            oc -n openstack2 wait osctlplane controlplane --for condition=Ready
nfv-ovs-dpdk-sriov-hci.yaml:            oc -n openstack wait osctlplane controlplane --for condition=Ready
nova-three-cells.yaml:            oc -n openstack wait osctlplane controlplane --for condition=Ready
nova01alpha.yaml:            oc -n openstack wait osctlplane controlplane --for condition=Ready
nvidia-mdev.yaml:            oc -n openstack wait osctlplane controlplane --for condition=Ready
osasinfra-ipv6.yaml:            oc -n openstack wait openstackcontrolplane
osasinfra-ipv6.yaml:            controlplane
osasinfra.yaml:            oc -n openstack wait openstackcontrolplane
osasinfra.yaml:            controlplane
ovs-dpdk-sriov-2nodesets.yaml:            oc -n openstack wait osctlplane controlplane --for condition=Ready
ovs-dpdk-sriov-networker.yaml:            oc -n openstack wait osctlplane controlplane --for condition=Ready
ovs-dpdk-sriov.yaml:            oc -n openstack wait osctlplane controlplane --for condition=Ready
ovs-dpdk.yaml:            oc -n openstack wait osctlplane controlplane --for condition=Ready
pidone.yaml:            oc -n openstack wait osctlplane controlplane --for condition=Ready
sriov.yaml:            oc -n openstack wait osctlplane controlplane --for condition=Ready
uni01alpha.yaml:            oc -n openstack wait openstackcontrolplane
uni01alpha.yaml:            controlplane
uni02beta.yaml:            oc -n openstack wait openstackcontrolplane
uni02beta.yaml:            controlplane
uni04delta-ipv6.yaml:            oc -n openstack wait osctlplane controlplane
uni04delta.yaml:            oc -n openstack wait osctlplane controlplane
uni05epsilon.yaml:            oc -n openstack wait openstackcontrolplane
uni05epsilon.yaml:            controlplane
uni06zeta.yaml:            oc -n openstack wait openstackcontrolplane
uni06zeta.yaml:            controlplane
uni07eta.yaml:            oc -n openstack wait openstackcontrolplane
uni07eta.yaml:            controlplane
[zuul@controller-0 vars]$ 

@openshift-ci
Copy link

openshift-ci bot commented Oct 23, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: amartyasinha
Once this PR has been reviewed and has the lgtm label, please assign fultonj for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

do-not-merge Changes are not ready to be merged

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants