-
Notifications
You must be signed in to change notification settings - Fork 166
OCPBUGS-61865, OCPBUGS-62636, OCPBUGS-59552: DownStream Merge [10-19-2025] #2817
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OCPBUGS-61865, OCPBUGS-62636, OCPBUGS-59552: DownStream Merge [10-19-2025] #2817
Conversation
It was not doing anything since 2020. Signed-off-by: Ihar Hrachyshka <[email protected]>
Signed-off-by: Ihar Hrachyshka <[email protected]>
Signed-off-by: Ihar Hrachyshka <[email protected]> Assisted-By: Claude Code; claude-sonnet-4-20250514
Replace master with base branch to make it work on release branches. Signed-off-by: Nadia Pinaeva <[email protected]>
Signed-off-by: zhaozhanqi <[email protected]>
…tack cluster Signed-off-by: zhaozhanqi <[email protected]>
Addresses incorrect DNAT rules with <proto>/0 target port when using services with externalTrafficPolicy: Local and named ports. The issue occurred when allocateLoadBalancerNodePorts was false and services referenced pod named ports. The previous implementation used svcPort.TargetPort.IntValue() which returns 0 for named ports, causing invalid DNAT rules. This refactoring introduces/uses structured endpoint types that properly handle port mapping from endpoint slices, ensuring the actual pod port numbers are used instead of attempting to convert named ports to integers. This change unifies endpoint processing logic by having both the services controller and nodePortWatcher use the same GetEndpointsForService function. This ensures consistent endpoint resolution and port mapping behavior across all service-related components, preventing divergence in logic and similar unnoticed port handling issues in the future. Signed-off-by: Andreas Karis <[email protected]>
Adds tests for loadBalancer services with named ports and AllocateLoadBalancerNodePorts=False. Add new test cases in Test_getEndpointsForService. Signed-off-by: Andreas Karis <[email protected]>
Signed-off-by: Andreas Karis <[email protected]>
E2E test "Allow connection to an external IP using a source port that is equal to a node port" might flake if a service is already created with the same nodePort number. Give it a chance to recover by selecting a different port. Signed-off-by: Andreas Karis <[email protected]>
Node-taints for too-small MTU were removed in #3004. Taints for NoSchedule were removed in openshift#2459. In general, it's not the CNI plugins responsibility to set node taints. This is for the kubelet/container runtime to figure out. Therefore, it's safe to remove this unused code since it won't be required in future. Signed-off-by: Dave Tucker <[email protected]>
While trying to reproduce flakes with these tests, this is the thing I could reproduce easily. In the tests we add 20 target IPs to each gateway, then we ping them to make sure they go to each gateway and get resolved. However, for TCP/UDP tests, we only run a listener on one of the target IPs. Then we would attempt to contact the listenter from a source pod 20 times, and check that it hit both gateways. In my testing, I can easily run these tests in a loop and see it fail, due to all 20 of the attempts hashing to the same gateway, and never hitting the other gateway. I bumped it to 50, and ran it all night and do not see the issue anymore. Not sure if this fixes all of the flakes we see with these tests, as the logs have gone stale for other runs, but will consider this closed for now and then if we see more flakes reopen it. Closes: #4432 Signed-off-by: Tim Rozet <[email protected]>
unskip skip cases as bug is verified
When OVS is run as system service on node, the /run/openvswitch/ovs-vswitchd.pid is locked by ovs-vswitchd with its PID in host process ID namespace: ``` $ lslocks | grep ovs-vswitchd.pid COMMAND PID TYPE SIZE MODE M START END PATH ovs-vswitchd 1615 POSIX 5B WRITE 0 0 0 /run/openvswitch/ovs-vswitchd.pid $ stat -Lc '%d:%i %n' /run/openvswitch/ovs-vswitchd.pid 25:5398 /run/openvswitch/ovs-vswitchd.pid ``` In ovnkube-node Pod, if hostPID is false, the ovs-vswitchd's PID is not visible inside the Pod's process ID namespace, so the file lock becomes invisible as well, that causes ovs-appctl fail to run: ``` $ ovs-appctl fdb/show br-int 2025-10-14T19:18:36Z|00001|daemon_unix|WARN|/var/run/openvswitch/ovs-vswitchd.pid: stale pidfile for pid 1615 being deleted by pid 0 ovs-appctl: cannot read pidfile "/var/run/openvswitch/ovs-vswitchd.pid" (No such process) command terminated with exit code 1 $ stat -Lc '%d:%i %n' /run/openvswitch/ovs-vswitchd.pid 25:5398 /run/openvswitch/ovs-vswitchd.pid ``` This change replaces RunOVSAppctl() with RunOvsVswitchdAppCtl(), which use `-t /var/run/openvswitch/ovs-vswitchd.1234.ctl` option to skip reading pid file. Signed-off-by: Lei Huang <[email protected]>
The external gatweay tests use default BFD timers, which in OVN is a send frequency of every 1 second, with a max of 3 failures - or 3 seconds total. The tests would remove an external gateway, wait 3 seconds, and then send a packet from a pod client. We notice in CI upstream sometimes this flakes on the first attempt and causes the test case to fail. I cannot reproduce this locally, but we can see that the math is wrong here. If the the external gateway was deleted at the same time that a heart beat was sent and ack'ed by OVN, then it would require almost 4 seconds to detect 3 more failures and transition BFD down. Therefore make the timeout a constant and bump it to 4 seconds. Signed-off-by: Tim Rozet <[email protected]>
Get the latest changes from [1]. There are some improvements, but it is supposed to work the same (if not better). [1] ovn-kubernetes/kubernetes-traffic-flow-tests@ce924ee Signed-off-by: Thomas Haller <[email protected]>
The test validates LoadBalancer services with: - Named targetPorts (http/udp) instead of numeric ports - AllocateLoadBalancerNodePorts=false configuration - ExternalTrafficPolicy=Local behavior Signed-off-by: Andreas Karis <[email protected]>
[th/tft-update] traffic-flow-tests: update to latest version of k8s-tft
RunOVSAppctl() doesn't work when ovs is run on host and hostPID is false
External Gateway E2E: Increase single target attempts
fix: --logfile-maxsize is in megabytes, not bytes
chore: Remove SetTaintOnNode
chore: Remove --pod-ip option
I accidentally removed the check in recent PR [1] which could have performance consequences as checking agains other pods has a cost. Reintroduce the check with a hopefully useful comment to prevent it form happening again. [1] ovn-kubernetes/ovn-kubernetes#5626 Signed-off-by: Jaime Caamaño Ruiz <[email protected]>
Enable ovn-ci workflow on release branches
OCPBUGS-59552: Referencing pod named ports within a service results in bad DNAT rules containing tcp/0 target port
fix: list allowed values for --platform-type option
When processing pods during an EgressIP status update, the controller used to stop iterating as soon as it encountered a pod in Pending state (in my case, pod IPs are not found when pod is in pending state with container creating status). This caused any subsequent Running pods to be skipped, leaving their SNAT entries unprogrammed on the egress node. With this change, only Pending pods are skipped, while iteration continues for the rest. This ensures that Running pods are properly processed and their SNAT entries are programmed. This change also skips pods that are unscheduled or use host networking. Signed-off-by: Periyasamy Palanisamy <[email protected]>
Signed-off-by: Nadia Pinaeva <[email protected]>
[okep: layer2 router topology] Add clarification for joinIP routes.
|
@jluhrsen: trigger 5 job(s) of type blocking for the ci release of OCP 4.20
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/82955dc0-aea1-11f0-807e-8628fab62aec-0 trigger 13 job(s) of type blocking for the nightly release of OCP 4.20
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/82955dc0-aea1-11f0-807e-8628fab62aec-1 |
|
/retest |
|
/retest |
|
@jluhrsen: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/b35ce7c0-aefb-11f0-88e8-89ab45ab2f04-0 |
|
/verified by @Meina-rh |
|
@Meina-rh: This PR has been marked as verified by In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/override ci/prow/e2e-metal-ipi-ovn-dualstack-bgp-local-gw broken due to https://issues.redhat.com/browse/OCPBUGS-63027 |
|
/override ci/prow/lint |
|
@jcaamano: Overrode contexts on behalf of jcaamano: ci/prow/e2e-metal-ipi-ovn-dualstack-bgp-local-gw In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
@jcaamano: Overrode contexts on behalf of jcaamano: ci/prow/lint In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
/retest |
|
/retest-required 4.21-upgrade-from-stable-4.20-e2e-aws-ovn-upgrade fails with below error. running it again.
|
same issue on retry again. will retest one more time: /retest |
this time the e2e got off the ground, but something weird with OAUTH failed. assuming it's not related to us, so sigh will retry again: /retest |
|
/lgtm |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: jcaamano, pperiyasamy The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/override ci/prow/lint |
|
@jcaamano: Overrode contexts on behalf of jcaamano: ci/prow/e2e-metal-ipi-ovn-dualstack-bgp-local-gw, ci/prow/lint In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
/override ci/prow/lint |
|
@jcaamano: Overrode contexts on behalf of jcaamano: ci/prow/lint In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
@pperiyasamy: The following tests failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
/override ci/prow/e2e-metal-ipi-ovn-dualstack-bgp-local-gw |
|
@jcaamano: Overrode contexts on behalf of jcaamano: ci/prow/e2e-metal-ipi-ovn-dualstack-bgp-local-gw In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
7dd6e74
into
openshift:master
|
@pperiyasamy: Jira Issue Verification Checks: Jira Issue OCPBUGS-61865 Jira Issue OCPBUGS-61865 has been moved to the MODIFIED state and will move to the VERIFIED state when the change is available in an accepted nightly payload. 🕓 Jira Issue Verification Checks: Jira Issue OCPBUGS-62636 Jira Issue OCPBUGS-62636 has been moved to the MODIFIED state and will move to the VERIFIED state when the change is available in an accepted nightly payload. 🕓 Jira Issue Verification Checks: Jira Issue OCPBUGS-59552 Jira Issue OCPBUGS-59552 has been moved to the MODIFIED state and will move to the VERIFIED state when the change is available in an accepted nightly payload. 🕓 In response to this: Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
No description provided.