-
Notifications
You must be signed in to change notification settings - Fork 166
[DNM] Simultaneous eip failover ovnkube-node restart #2788
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
[DNM] Simultaneous eip failover ovnkube-node restart #2788
Conversation
Signed-off-by: Periyasamy Palanisamy <[email protected]> (cherry picked from commit e42cf17c6f4e6a0b902781b0fbacbc3de2a9c01a)
Previously, when syncing EgressIP objects, the podAssignment cache was initialized with empty egressStatuses for assigned pods. This commit ensures that the pod state is updated with the correct EIP status during the sync process. Signed-off-by: Periyasamy Palanisamy <[email protected]> (cherry picked from commit 2c8dc6f3fdd6e8cac9085ad063181594ededd767)
When ovnkube-controller and an EgressIP (EIP) failover occur at the same time, a race condition can leave the informer cache empty for the EgressIP watcher. As a result, only the Add event for the EIP with the newly assigned node is triggered, and the controller fails to update the SNAT and LRP configuration for the previously assigned node. This commit leverages the egressStatuses stored in the podAssignment cache to reconcile and update those stale entries correctly. Signed-off-by: Periyasamy Palanisamy <[email protected]> (cherry picked from commit a69e7ceb47baafe47c82264936afab53ccb3f92b)
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: pperiyasamy The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Signed-off-by: Periyasamy Palanisamy <[email protected]>
f611325 to
847333f
Compare
|
/assign @huiran0826 |
…odes Previously, we were checking if the next hop IP is valid for the current set of nodes but we werent but every EgressIP is assigned to a subset of the total nodes. Stale LRPs could occur if a node hosted eip pods, ovnkube-controller is down, and the EIP moved to a new Node which said controller is down. Signed-off-by: Martin Kennelly <[email protected]> (cherry picked from commit dad551464861df3906db09fb45ee14a12b9ce755)
For IC mode, there is no expectation we can fetch a remote nodes LSP, therefore, by skipping (continue), it is causing us to skip generating valid next hops for the remote node. Later in sync LRPs, when a valid next hop is inspected, we do not find it valid and remove that valid next hop. Handlers will re-add it shortly later. Signed-off-by: Martin Kennelly <[email protected]> (cherry picked from commit 33dcfdca2b000370791112d76b2ef47e580e5ca5)
Signed-off-by: Martin Kennelly <[email protected]> (cherry picked from commit bc61618e6047fa336fccbb747c968cfc7e054c2c)
Previous to this change, we dont emit log error for stale next hops. Signed-off-by: Martin Kennelly <[email protected]> (cherry picked from commit 62e67e77b24b7ec1d24cd5cd2ed994239d6051a8)
|
@pperiyasamy: The following tests failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
to test stale SNAT and LRP nexthop fixes when eip failover and ovnkube-node restart happens around same time.
cc @jechen0648