Skip to content

Conversation

@pperiyasamy
Copy link
Member

to test stale SNAT and LRP nexthop fixes when eip failover and ovnkube-node restart happens around same time.

cc @jechen0648

Signed-off-by: Periyasamy Palanisamy <[email protected]>
(cherry picked from commit e42cf17c6f4e6a0b902781b0fbacbc3de2a9c01a)
Previously, when syncing EgressIP objects, the podAssignment cache was
initialized with empty egressStatuses for assigned pods. This commit
ensures that the pod state is updated with the correct EIP status during
the sync process.

Signed-off-by: Periyasamy Palanisamy <[email protected]>
(cherry picked from commit 2c8dc6f3fdd6e8cac9085ad063181594ededd767)
When ovnkube-controller and an EgressIP (EIP) failover occur at the same
time, a race condition can leave the informer cache empty for the
EgressIP watcher. As a result, only the Add event for the EIP with the
newly assigned node is triggered, and the controller fails to update
the SNAT and LRP configuration for the previously assigned node.

This commit leverages the egressStatuses stored in the podAssignment
cache to reconcile and update those stale entries correctly.

Signed-off-by: Periyasamy Palanisamy <[email protected]>
(cherry picked from commit a69e7ceb47baafe47c82264936afab53ccb3f92b)
@openshift-ci openshift-ci bot requested review from kyrtapz and tssurya October 8, 2025 07:46
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 8, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: pperiyasamy
Once this PR has been reviewed and has the lgtm label, please assign knobunc for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@pperiyasamy pperiyasamy force-pushed the simultaneous_eip_failover_ovnkubenode_restart branch from f611325 to 847333f Compare October 10, 2025 11:19
@pperiyasamy
Copy link
Member Author

/assign @huiran0826

…odes

Previously, we were checking if the next hop IP is valid
for the current set of nodes but we werent but
every EgressIP is assigned to a subset of the total nodes.

Stale LRPs could occur if a node hosted eip pods,
ovnkube-controller is down, and the EIP moved
to a new Node which said controller is down.

Signed-off-by: Martin Kennelly <[email protected]>
(cherry picked from commit dad551464861df3906db09fb45ee14a12b9ce755)
For IC mode, there is no expectation we can fetch
a remote nodes LSP, therefore, by skipping (continue),
it is causing us to skip generating valid next
hops for the remote node.

Later in sync LRPs, when a valid next hop is inspected,
we do not find it valid and remove that valid next hop.

Handlers will re-add it shortly later.

Signed-off-by: Martin Kennelly <[email protected]>
(cherry picked from commit 33dcfdca2b000370791112d76b2ef47e580e5ca5)
Signed-off-by: Martin Kennelly <[email protected]>
(cherry picked from commit bc61618e6047fa336fccbb747c968cfc7e054c2c)
Previous to this change, we dont emit log error
for stale next hops.

Signed-off-by: Martin Kennelly <[email protected]>
(cherry picked from commit 62e67e77b24b7ec1d24cd5cd2ed994239d6051a8)
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 14, 2025

@pperiyasamy: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/4.20-upgrade-from-stable-4.19-e2e-aws-ovn-upgrade 8cdbaae link true /test 4.20-upgrade-from-stable-4.19-e2e-aws-ovn-upgrade
ci/prow/e2e-aws-ovn-shared-to-local-gateway-mode-migration 3bfbb89 link true /test e2e-aws-ovn-shared-to-local-gateway-mode-migration
ci/prow/qe-perfscale-payload-control-plane-6nodes 3bfbb89 link true /test qe-perfscale-payload-control-plane-6nodes
ci/prow/e2e-gcp-ovn-techpreview 3bfbb89 link true /test e2e-gcp-ovn-techpreview
ci/prow/e2e-azure-ovn-upgrade 3bfbb89 link true /test e2e-azure-ovn-upgrade
ci/prow/4.21-upgrade-from-stable-4.20-e2e-aws-ovn-upgrade-ipsec 3bfbb89 link false /test 4.21-upgrade-from-stable-4.20-e2e-aws-ovn-upgrade-ipsec
ci/prow/e2e-aws-ovn-upgrade-local-gateway 3bfbb89 link true /test e2e-aws-ovn-upgrade-local-gateway
ci/prow/security 3bfbb89 link false /test security
ci/prow/lint 3bfbb89 link true /test lint
ci/prow/e2e-aws-ovn-edge-zones 3bfbb89 link true /test e2e-aws-ovn-edge-zones
ci/prow/e2e-aws-ovn-upgrade 3bfbb89 link true /test e2e-aws-ovn-upgrade

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants