[DNM] Simultaneous eip failover ovnkube-node restart #2788

pperiyasamy · 2025-10-08T07:43:45Z

to test stale SNAT and LRP nexthop fixes when eip failover and ovnkube-node restart happens around same time.

Signed-off-by: Periyasamy Palanisamy <[email protected]> (cherry picked from commit e42cf17c6f4e6a0b902781b0fbacbc3de2a9c01a)

Previously, when syncing EgressIP objects, the podAssignment cache was initialized with empty egressStatuses for assigned pods. This commit ensures that the pod state is updated with the correct EIP status during the sync process. Signed-off-by: Periyasamy Palanisamy <[email protected]> (cherry picked from commit 2c8dc6f3fdd6e8cac9085ad063181594ededd767)

When ovnkube-controller and an EgressIP (EIP) failover occur at the same time, a race condition can leave the informer cache empty for the EgressIP watcher. As a result, only the Add event for the EIP with the newly assigned node is triggered, and the controller fails to update the SNAT and LRP configuration for the previously assigned node. This commit leverages the egressStatuses stored in the podAssignment cache to reconcile and update those stale entries correctly. Signed-off-by: Periyasamy Palanisamy <[email protected]> (cherry picked from commit a69e7ceb47baafe47c82264936afab53ccb3f92b)

openshift-ci · 2025-10-08T07:46:51Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: pperiyasamy
Once this PR has been reviewed and has the lgtm label, please assign knobunc for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Signed-off-by: Periyasamy Palanisamy <[email protected]>

pperiyasamy · 2025-10-10T13:55:18Z

/assign @huiran0826

…odes Previously, we were checking if the next hop IP is valid for the current set of nodes but we werent but every EgressIP is assigned to a subset of the total nodes. Stale LRPs could occur if a node hosted eip pods, ovnkube-controller is down, and the EIP moved to a new Node which said controller is down. Signed-off-by: Martin Kennelly <[email protected]> (cherry picked from commit dad551464861df3906db09fb45ee14a12b9ce755)

For IC mode, there is no expectation we can fetch a remote nodes LSP, therefore, by skipping (continue), it is causing us to skip generating valid next hops for the remote node. Later in sync LRPs, when a valid next hop is inspected, we do not find it valid and remove that valid next hop. Handlers will re-add it shortly later. Signed-off-by: Martin Kennelly <[email protected]> (cherry picked from commit 33dcfdca2b000370791112d76b2ef47e580e5ca5)

Signed-off-by: Martin Kennelly <[email protected]> (cherry picked from commit bc61618e6047fa336fccbb747c968cfc7e054c2c)

Previous to this change, we dont emit log error for stale next hops. Signed-off-by: Martin Kennelly <[email protected]> (cherry picked from commit 62e67e77b24b7ec1d24cd5cd2ed994239d6051a8)

openshift-ci · 2025-10-14T19:58:13Z

@pperiyasamy: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/4.20-upgrade-from-stable-4.19-e2e-aws-ovn-upgrade	`8cdbaae`	link	true	`/test 4.20-upgrade-from-stable-4.19-e2e-aws-ovn-upgrade`
ci/prow/e2e-aws-ovn-shared-to-local-gateway-mode-migration	`3bfbb89`	link	true	`/test e2e-aws-ovn-shared-to-local-gateway-mode-migration`
ci/prow/qe-perfscale-payload-control-plane-6nodes	`3bfbb89`	link	true	`/test qe-perfscale-payload-control-plane-6nodes`
ci/prow/e2e-gcp-ovn-techpreview	`3bfbb89`	link	true	`/test e2e-gcp-ovn-techpreview`
ci/prow/e2e-azure-ovn-upgrade	`3bfbb89`	link	true	`/test e2e-azure-ovn-upgrade`
ci/prow/4.21-upgrade-from-stable-4.20-e2e-aws-ovn-upgrade-ipsec	`3bfbb89`	link	false	`/test 4.21-upgrade-from-stable-4.20-e2e-aws-ovn-upgrade-ipsec`
ci/prow/e2e-aws-ovn-upgrade-local-gateway	`3bfbb89`	link	true	`/test e2e-aws-ovn-upgrade-local-gateway`
ci/prow/security	`3bfbb89`	link	false	`/test security`
ci/prow/lint	`3bfbb89`	link	true	`/test lint`
ci/prow/e2e-aws-ovn-edge-zones	`3bfbb89`	link	true	`/test e2e-aws-ovn-edge-zones`
ci/prow/e2e-aws-ovn-upgrade	`3bfbb89`	link	true	`/test e2e-aws-ovn-upgrade`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

pperiyasamy added 3 commits October 8, 2025 09:40

Add test for simultaneous EIP failover and ovnkube-controller restart

f592dc2

Signed-off-by: Periyasamy Palanisamy <[email protected]> (cherry picked from commit e42cf17c6f4e6a0b902781b0fbacbc3de2a9c01a)

openshift-ci bot requested review from kyrtapz and tssurya October 8, 2025 07:46

Remove Stale SNAT and LRP nexthops for remote zone pods

847333f

Signed-off-by: Periyasamy Palanisamy <[email protected]>

pperiyasamy force-pushed the simultaneous_eip_failover_ovnkubenode_restart branch from f611325 to 847333f Compare October 10, 2025 11:19

openshift-ci bot assigned huiran0826 Oct 10, 2025

martinkennelly added 4 commits October 14, 2025 16:50

OVN EIP: add UT for syncing next hops for v4/v6

ca42580

Signed-off-by: Martin Kennelly <[email protected]> (cherry picked from commit bc61618e6047fa336fccbb747c968cfc7e054c2c)

OVN EIP: fix printing of stale next hops value

3bfbb89

Previous to this change, we dont emit log error for stale next hops. Signed-off-by: Martin Kennelly <[email protected]> (cherry picked from commit 62e67e77b24b7ec1d24cd5cd2ed994239d6051a8)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[DNM] Simultaneous eip failover ovnkube-node restart #2788

[DNM] Simultaneous eip failover ovnkube-node restart #2788

Uh oh!

pperiyasamy commented Oct 8, 2025

Uh oh!

openshift-ci bot commented Oct 8, 2025

Uh oh!

pperiyasamy commented Oct 10, 2025

Uh oh!

openshift-ci bot commented Oct 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[DNM] Simultaneous eip failover ovnkube-node restart #2788

Are you sure you want to change the base?

[DNM] Simultaneous eip failover ovnkube-node restart #2788

Uh oh!

Conversation

pperiyasamy commented Oct 8, 2025

Uh oh!

openshift-ci bot commented Oct 8, 2025

Uh oh!

pperiyasamy commented Oct 10, 2025

Uh oh!

openshift-ci bot commented Oct 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants