Skip to content

Conversation

martinkennelly
Copy link
Contributor

/hold

Depends on #2775

Conflicts documented in each commit.
Requires Surya ack + QE premerge ack.

Currently, we are force exiting with the trap before the background
processes can end, container is removed and the orphaned processes
end early causing our config to go into an unknown state because we
dont end in an orderly manner.

Wait until the pid file for ovnkube controller with node is removed
which shows the process has completed.

Conflict (from previous commit -> this commit):

'check_ovn_daemonset_version "1.1.0"' -> 'check_ovn_daemonset_version "1.0.0"'

Signed-off-by: Martin Kennelly <[email protected]>
(cherry picked from commit 8b29419)
(cherry picked from commit d65ec5c)
(cherry picked from commit d3ae338)
(cherry picked from commit 7057948)
Prevent ovn-controller from sending stale GARP by adding
drop flows on external bridge patch ports
until ovnkube-controller synchronizes the southbound database - henceforth
known as "drop flows".

This addresses race conditions where ovn-controller processes outdated
SB DB state before ovnkube-controller updates it, particularly affecting
EIP SNAT configurations attached to logical router ports.
Fixes: https://issues.redhat.com/browse/FDP-1537

ovnkube-controller controls the lifecycle of the drop flows.
ovs / ovn-controller running is required to configure external bridge.
Downstream, the external bridge maybe precreated and ovn-controller
will use this.

This fix considers three primary scenarios: node, container and pod restart.

On Node restart means the ovs flows installed priotior to reboot on the node are
cleared but the external bridge exists. Add the flows before ovnkube controller
with node starts. The reason to add it here is that our gateway code depends
on ovn-controller started and running...
There is now a race here between ovn-controller starting
(and garping) before we set this flow but I think the risk is low however
it needs serious testing. The reason I did not naturally at the drop
flows before ovn-controller started is because I have no way to detect
if its a node reboot or pod reboot and i dont want to inject drop flows
for simple ovn-controller container restart which could disrupt traffic.
ovnkube-controller starts, we create a new gateway and apply flows the same
flows in-order to ensure we always drop GARP when ovnkube controller
hasn't sync.
Remove the flows when ovnkube-controller has syncd. There is also a race here
between ovnkube-controller removing the flows and ovn-controller GARPing with
stale SB DB info. There is no easy way to detect what SB DB data ovn-controller
has consumed.

On Pod restart, we add the drop flows before exit. ovnkube-controller-with-node
will also add it before it starts the go code.

Container restart:
- ovnkube-controller: adds flows upon start and exit
- ovn-controller: no changes

While the drop flows are set, OVN may not be able to resolve IPs
it doesn't know about in its Logical Router pipelines generation. Following
removal of the drop flows, OVN may resolve the IPs using GARP requests.

OVN-Controller always sends out GARPs with op code 1
on startup.

Conflicts:
A lot of conflicts due to code movement but little
actual code change.

From nodetypes to Node pkg:
"GarpCookie" moved to a new file and renamed.
Not needed to be exported anymore so its now
lower case G.

From bridgeflows.go to gateway_shared_intf.go,
I moved several pieces of code.
I also needed to make netConfig public methods
OfPortPatch to private method ofPortPatch &
PatchPort to patchPort.

Within gateway_shared_intf.go func
flowsForDefaultBridge i needed to
call the method attached to the fn
arg "bridge" "dropGarp".

I also needed to cast defaultNodeNetworkController
which is BaseNetworkController interface to
DefaultNodeNetworkController to access
Gateway Interface. defaultNodeNetworkController
is set only in initDefaultNodeNetworkController
and is always of type DefaultNodeNetworkController.

Signed-off-by: Martin Kennelly <[email protected]>
(cherry picked from commit 82fc3bf)
(cherry picked from commit 50a94e1)
(cherry picked from commit 37dd4e2)
(cherry picked from commit 5b53803)
PR 5373 to drop the GARP flows didnt consider that we
set the default network controller and later we set
the gateway obj. In-between this period, ovnkube node
may receive a stop signal and we do not guard against
accessing the gateway if its not yet set.

OVNKube controller may have sync'd before the gateway
obj is set.

There is nothing to reconcile if the gateway is not set.

Conflict:
Needed to cast defaultNodeNetworkController from interface
BaseNetworkController to the concrete type DefaultNodeNetworkController.
We can be sure its only set to this type because
its set in one location in func initDefaultNodeNetworkController.

Signed-off-by: Martin Kennelly <[email protected]>
(cherry picked from commit e60220a)
(cherry picked from commit a7869b2)
(cherry picked from commit 2ac68e4)
(cherry picked from commit 3b039fe)
Ensure ovn-controller has processed the SB DB updates before
removing the GARP drop flows by utilizing the hv_cfg field
in NB_Global [1]

OVNKube controller increments the nb_cfg value post sync, which is copied
to SB DB by northd. OVN-Controllers copy this nb_cfg value from SB DB
and write it to their chassis_private tables nb_cfg field after
they have processed the SB DB changes. Northd will then look
at all the chassis_private tables nb_cfg value and set the
NB DBs Nb_global hv_cfg value to the min integer found.

Since IC currently only supports one node per zone, we
can be sure ovn-controller is running locally and therefore
its ok to block removing the drop GARP flows.

[1] https://man7.org/linux/man-pages/man5/ovn-nb.5.html

Signed-off-by: Martin Kennelly <[email protected]>
(cherry picked from commit 3b5da01)
(cherry picked from commit a4776fb)
(cherry picked from commit f7c67b7)
(cherry picked from commit 2396130)
@openshift-ci-robot openshift-ci-robot added jira/severity-important Referenced Jira bug's severity is important for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Oct 15, 2025
@openshift-ci-robot
Copy link
Contributor

@martinkennelly: This pull request references Jira Issue OCPBUGS-63154, which is invalid:

  • expected dependent Jira Issue OCPBUGS-62671 to be in one of the following states: VERIFIED, RELEASE PENDING, CLOSED (ERRATA), CLOSED (CURRENT RELEASE), CLOSED (DONE), CLOSED (DONE-ERRATA), but it is New instead

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

/hold

Depends on #2775

Conflicts documented in each commit.
Requires Surya ack + QE premerge ack.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 15, 2025
@openshift-ci openshift-ci bot requested review from jcaamano and tssurya October 15, 2025 15:46
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 15, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: martinkennelly

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 16, 2025

@martinkennelly: This PR was included in a payload test run from openshift/machine-config-operator#5358
trigger 4 job(s) of type blocking for the ci release of OCP 4.17

  • periodic-ci-openshift-release-master-ci-4.17-upgrade-from-stable-4.16-e2e-aws-ovn-upgrade
  • periodic-ci-openshift-release-master-ci-4.17-upgrade-from-stable-4.16-e2e-azure-ovn-upgrade
  • periodic-ci-openshift-release-master-ci-4.17-e2e-gcp-ovn-upgrade
  • periodic-ci-openshift-hypershift-release-4.17-periodics-e2e-aws-ovn

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/9edb31f0-aa7b-11f0-9c89-1e6f781e075b-0

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 16, 2025

@martinkennelly: This PR was included in a payload test run from openshift/machine-config-operator#5358
trigger 8 job(s) of type blocking for the nightly release of OCP 4.17

  • periodic-ci-openshift-release-master-nightly-4.17-e2e-aws-ovn-serial
  • periodic-ci-openshift-release-master-ci-4.17-e2e-aws-ovn-upgrade
  • periodic-ci-openshift-release-master-ci-4.17-e2e-aws-ovn-upgrade
  • periodic-ci-openshift-release-master-ci-4.17-e2e-azure-ovn-upgrade
  • periodic-ci-openshift-release-master-ci-4.17-upgrade-from-stable-4.16-e2e-gcp-ovn-rt-upgrade
  • periodic-ci-openshift-hypershift-release-4.17-periodics-e2e-aws-ovn-conformance
  • periodic-ci-openshift-release-master-nightly-4.17-e2e-metal-ipi-ovn-bm
  • periodic-ci-openshift-release-master-nightly-4.17-e2e-metal-ipi-ovn-ipv6

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/b42b47c0-aa7b-11f0-800b-f0a8e74790a8-0

@martinkennelly
Copy link
Contributor Author

/test e2e-aws-ovn

Building image failed and it seemed it was stuck on pulling an image :

[1/2] STEP 1/5: FROM quay-proxy.ci.openshift.org/openshift/ci@sha256:fa4be59cf5b0506d06ad2dbd690b0066f9cb0ef0d1ea3bb3addf0104701bf332 AS builder
Trying to pull quay-proxy.ci.openshift.org/openshift/ci@sha256:fa4be59cf5b0506d06ad2dbd690b0066f9cb0ef0d1ea3bb3addf0104701bf332...
INFO[2025-10-15T17:13:09Z] Build ovn-kubernetes-amd64 succeeded after 28m14s 
INFO[2025-10-15T17:13:09Z] Retrieving digests of member images          
INFO[2025-10-15T17:13:18Z] Image ci-op-gbvy5vzg/pipeline:ovn-kubernetes created  digest=sha256:a2f3f02cd99487f728e3dd1ad55e22a6459594bedec1f23077dc0a9f0f1155f9 for-build=ovn-kubernetes
INFO[2025-10-15T17:13:18Z] Tagging ovn-kubernetes into stable           
INFO[2025-10-15T17:13:18Z] Ran for 1h27m58s                             
ERRO[2025-10-15T17:13:18Z] Some steps failed:                           
ERRO[2025-10-15T17:13:18Z] 

@martinkennelly
Copy link
Contributor Author

/retest

All the tests failed the same way as described in previous comment - stuck pulling an image it seems even though theres no error actually saying that but build timed out because it hit the limit so i am just guessing.

@martinkennelly
Copy link
Contributor Author

/test e2e-aws-ovn-hypershift

Known issue today with installing.


: hosted cluster version rollout succeeds expand_less	0s
{hosted cluster version rollout never completed  
      
error: hosted cluster version rollout never completed, dumping relevant hosted cluster condition messages
Degraded: The hosted cluster is not degraded
ClusterVersionSucceeding: Cluster operators console, dns, image-registry, ingress, insights, kube-storage-version-migrator, monitoring, node-tuning, openshift-samples, service-ca, storage are not available

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 16, 2025

@martinkennelly: This PR was included in a payload test run from openshift/machine-config-operator#5358
trigger 8 job(s) of type blocking for the nightly release of OCP 4.17

  • periodic-ci-openshift-release-master-nightly-4.17-e2e-aws-ovn-serial
  • periodic-ci-openshift-release-master-ci-4.17-e2e-aws-ovn-upgrade
  • periodic-ci-openshift-release-master-ci-4.17-e2e-aws-ovn-upgrade
  • periodic-ci-openshift-release-master-ci-4.17-e2e-azure-ovn-upgrade
  • periodic-ci-openshift-release-master-ci-4.17-upgrade-from-stable-4.16-e2e-gcp-ovn-rt-upgrade
  • periodic-ci-openshift-hypershift-release-4.17-periodics-e2e-aws-ovn-conformance
  • periodic-ci-openshift-release-master-nightly-4.17-e2e-metal-ipi-ovn-bm
  • periodic-ci-openshift-release-master-nightly-4.17-e2e-metal-ipi-ovn-ipv6

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/0098b1c0-aaa8-11f0-8214-eeb08d09becf-0

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 16, 2025

@martinkennelly: This PR was included in a payload test run from openshift/machine-config-operator#5358
trigger 4 job(s) of type blocking for the ci release of OCP 4.17

  • periodic-ci-openshift-release-master-ci-4.17-upgrade-from-stable-4.16-e2e-aws-ovn-upgrade
  • periodic-ci-openshift-release-master-ci-4.17-upgrade-from-stable-4.16-e2e-azure-ovn-upgrade
  • periodic-ci-openshift-release-master-ci-4.17-e2e-gcp-ovn-upgrade
  • periodic-ci-openshift-hypershift-release-4.17-periodics-e2e-aws-ovn

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/415df670-aaa8-11f0-9746-b80293902535-0

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 17, 2025

@martinkennelly: This PR was included in a payload test run from openshift/machine-config-operator#5358
trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-hypershift-release-4.17-periodics-e2e-aws-ovn

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/9a3d6070-ab37-11f0-9e94-db54b8692762-0

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 17, 2025

@martinkennelly: This PR was included in a payload test run from openshift/machine-config-operator#5358
trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-master-ci-4.17-e2e-aws-ovn-upgrade

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/ec788e00-ab37-11f0-9d91-c753c0c5439c-0

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 17, 2025

@martinkennelly: This PR was included in a payload test run from openshift/machine-config-operator#5358
trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-master-ci-4.17-e2e-aws-ovn-upgrade

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/f6b4cb40-ab37-11f0-8bc3-e0a4a264d89b-0

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 17, 2025

@martinkennelly: This PR was included in a payload test run from openshift/machine-config-operator#5358
trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-master-ci-4.17-e2e-azure-ovn-upgrade

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/fe626e60-ab37-11f0-940f-5851f796847f-0

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 17, 2025

@martinkennelly: This PR was included in a payload test run from openshift/machine-config-operator#5358
trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-hypershift-release-4.17-periodics-e2e-aws-ovn-conformance

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/04b08180-ab38-11f0-804b-dbfc4243f7b3-0

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 17, 2025

@martinkennelly: This PR was included in a payload test run from openshift/machine-config-operator#5358
trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-master-nightly-4.17-e2e-metal-ipi-ovn-bm

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/0b473660-ab38-11f0-867c-fad12314f178-0

@martinkennelly
Copy link
Contributor Author

/jira refresh

@openshift-ci-robot openshift-ci-robot added the jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. label Oct 17, 2025
@openshift-ci-robot openshift-ci-robot removed the jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. label Oct 17, 2025
@openshift-ci-robot
Copy link
Contributor

@martinkennelly: This pull request references Jira Issue OCPBUGS-63154, which is valid.

7 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.17.z) matches configured target version for branch (4.17.z)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)
  • release note text is set and does not match the template
  • dependent bug Jira Issue OCPBUGS-62671 is in the state Verified, which is one of the valid states (VERIFIED, RELEASE PENDING, CLOSED (ERRATA), CLOSED (CURRENT RELEASE), CLOSED (DONE), CLOSED (DONE-ERRATA))
  • dependent Jira Issue OCPBUGS-62671 targets the "4.18.z" version, which is one of the valid target versions: 4.18.0, 4.18.z
  • bug has dependents

No GitHub users were found matching the public email listed for the QA contact in Jira ([email protected]), skipping review request.

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@jechen0648
Copy link
Contributor

/verified by 'pre-merge testing'

@openshift-ci-robot openshift-ci-robot added the verified Signifies that the PR passed pre-merge verification criteria label Oct 17, 2025
@openshift-ci-robot
Copy link
Contributor

@jechen0648: This PR has been marked as verified by 'pre-merge testing'.

In response to this:

/verified by 'pre-merge testing'

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@jechen0648
Copy link
Contributor

/test e2e-aws-ovn-hypershift

1 similar comment
@jechen0648
Copy link
Contributor

/test e2e-aws-ovn-hypershift

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 20, 2025

@martinkennelly: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/security f22003c link false /test security
ci/prow/e2e-aws-ovn-hypershift f22003c link true /test e2e-aws-ovn-hypershift

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. jira/severity-important Referenced Jira bug's severity is important for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. verified Signifies that the PR passed pre-merge verification criteria

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants