Skip to content

Conversation

@rfredette
Copy link
Contributor

No description provided.

@openshift-ci openshift-ci bot requested review from Miciah and Thealisyed March 24, 2025 18:44
@candita
Copy link
Contributor

candita commented Mar 26, 2025

/assign @alebedev87

@rfredette
Copy link
Contributor Author

test failures appear unrelated.
/retest

@rfredette rfredette force-pushed the ocpbugs-53424-multiple-installplans branch 3 times, most recently from ede8707 to 1bdc5af Compare March 28, 2025 15:24
@rfredette
Copy link
Contributor Author

The gateway api suite passed, but I'm going to run it a few more times to check for flakes.
/test e2e-aws-gatewayapi

@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 29, 2025
@rfredette rfredette force-pushed the ocpbugs-53424-multiple-installplans branch from 1bdc5af to 928c8ed Compare March 31, 2025 18:55
@openshift-merge-robot openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 31, 2025
Copy link
Contributor

@alebedev87 alebedev87 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was more in favor of approving all install plans of the expected subscription. But I guess approving the latest one (knowing that we watch for installplans) may be enough. Let's see more e2e-operator-techpreview runs.

// - If the Istio and Subscription CRs are deleted, they are recreated
// automatically.
func testGatewayAPIIstioInstallation(t *testing.T) {
gatewayClassName := "openshift-test"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will be the third subtest which will create a gatewayclass with the opesnhift gatewayapi controller. I think it's redundant in the current setup where we have a single test which runs them all. But I can understand the idea of being able to run each test separately. In that context I'm wondering why not calling this gatewayclass openshift-default as we do in 2 other subtests?


t.Run("testGatewayAPIResources", testGatewayAPIResources)
if gatewayAPIControllerEnabled {
t.Run("testGatewayAPIIstioInstallation", testGatewayAPIIstioInstallation)
Copy link
Contributor

@alebedev87 alebedev87 Mar 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just realized that testGatewayAPIIstioInstallation deletes the subscription at the end (to see it recreated by the operator). Which means that all the checks (OSSM operator, istiod control plane, etc.) which could be useful to detect problems in the next step testGatewayAPIObjects (where we wait for the acceptance of GatewayClass) won't help. Because they are made for the previous subscription.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, but I think it's still useful to run testGatewayAPIIstioInstallation first. That way, hopefully any errors that would stop testGatewayAPIObjects will also appear in testGatewayAPIIstioInstallation, so even though you'll still get the "waiting for gateway class to be accepted" message, the first failure in the suite will hopefully be the more detailed one.

Copy link
Contributor

@alebedev87 alebedev87 Apr 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That way, hopefully any errors that would stop testGatewayAPIObjects will also appear in testGatewayAPIIstioInstallation, so even though you'll still get the "waiting for gateway class to be accepted" message, the first failure in the suite will hopefully be the more detailed one.

Well, yes, "hopefully". However it may be not the case and the newly re-created Subscription may result into different errors. Also, if testGatewayAPIIstioInstallation does not delete the subscription the check for the acceptance of GatewayClass from testGatewayAPIObjects will have more time to be successful and show any timeout problems. We can a) move the subscription recreation to a dedicated subtest which is run somewhere after testGatewayAPIIstioInstallation and testGatewayAPIObjects or b) we can separate the change of the order of the subtests into a different PR and discuss it further (check of CSV can be kept in this PR).


func assertClusterServiceVersionSucceeded(t *testing.T) error {
t.Helper()
expectedVersion, err := expectedOSSMVersion(t)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would the starting CSV from the subscription do the same thing? We can return the starting CSV from assertSubscription function to avoid going down to the ingress operator's deployment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used the value from the ingress operator deployment because it's the ultimate source of truth for what the operator is looking for, but you are right, the starting CSV reflects the same value, so it's equally valid.

@rfredette rfredette changed the title If multiple install plans exist for the same OSSM version, approve the most recently created one. OCPBUGS-53424: If multiple install plans exist for the same OSSM version, approve the most recently created one. Apr 2, 2025
@openshift-ci-robot openshift-ci-robot added jira/severity-critical Referenced Jira bug's severity is critical for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Apr 2, 2025
@openshift-ci-robot
Copy link
Contributor

@rfredette: This pull request references Jira Issue OCPBUGS-53424, which is invalid:

  • expected the bug to target the "4.19.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@rfredette
Copy link
Contributor Author

/jira refresh

@openshift-ci-robot openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Apr 2, 2025
@openshift-ci-robot
Copy link
Contributor

@rfredette: This pull request references Jira Issue OCPBUGS-53424, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.19.0) matches configured target version for branch (4.19.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

No GitHub users were found matching the public email listed for the QA contact in Jira ([email protected]), skipping review request.

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@rfredette rfredette force-pushed the ocpbugs-53424-multiple-installplans branch from 928c8ed to 6217df9 Compare April 2, 2025 15:06
@rfredette
Copy link
Contributor Author

/retest

/test e2e-aws-gatewayapi
/test e2e-aws-operator-techpreview

@rhamini3
Copy link
Contributor

rhamini3 commented Apr 2, 2025

Pre-merge-verified on 4.19.0-0.ci.test-2025-04-02-194605-ci-ln-3sn8dbb-latest

  1. get install plans
NAMESPACE             NAME            CSV                           APPROVAL   APPROVED
openshift-operators   install-8jvkv   servicemeshoperator3.v3.0.0   Manual     true
openshift-operators   install-cxsz6   servicemeshoperator3.v3.0.0   Manual     true

  kind: InstallPlan
  metadata:
    creationTimestamp: "2025-04-02T21:05:09Z"
    generateName: install-
    generation: 2
    name: install-8jvkv
    namespace: openshift-operators

  kind: InstallPlan
  metadata:
    creationTimestamp: "2025-04-02T21:05:10Z"
    generateName: install-
    generation: 2
    labels:
      operators.coreos.com/servicemeshoperator3.openshift-operators: ""
    name: install-cxsz6
    namespace: openshift-operators
    ownerReferences:

  1. check subscription YAML
    installedCSV: servicemeshoperator3.v3.0.0
    installplan:
      apiVersion: operators.coreos.com/v1alpha1
      kind: InstallPlan
      name: install-cxsz6
      uuid: ff0026fe-0a80-4565-9771-45d4493434c1
    lastUpdated: "2025-04-02T21:05:23Z"
    state: AtLatestKnown

Subscription uses the latest installPlan hence marking as verified
/label qe-approved

@openshift-ci openshift-ci bot added the qe-approved Signifies that QE has signed off on this PR label Apr 2, 2025
@alebedev87
Copy link
Contributor

must-gather of the failed e2e-aws-gatewayapi job shows that all 5(!) installplans were approved (fix is working). However, the CSV didn't come up due to an error (canbe seen in subscription):

                    {
                        "message": "error ensuring InstallPlan: Operation cannot be fulfilled on installplans.operators.coreos.com \"install-29m2r\": the object has been modified; please apply your changes to the latest version and try again",
                        "reason": "EnsureInstallPlanFailed",
                        "status": "True",
                        "type": "BundleUnpackFailed"
                    }

@alebedev87
Copy link
Contributor

/test e2e-aws-gatewayapi
/test e2e-aws-operator-techpreview

@lihongan
Copy link
Contributor

lihongan commented Apr 2, 2025

hit the issue in origin e2e test: openshift/origin#29638 (comment)

@alebedev87
Copy link
Contributor

alebedev87 commented Apr 3, 2025

e2e-aws-operator-techpreview failed in TestAll/serial/TestGatewayAPI/testGatewayAPIObjects because DNSRecord for test-gateway Gateway didn't come up (apparently the LB host was missing for the gateway service):

=== RUN   TestAll/serial/TestGatewayAPI/testGatewayAPIObjects
    gateway_api_test.go:191: Creating namespace "test-e2e-gwapi-z2xb2"...
    gateway_api_test.go:191: Waiting for ServiceAccount test-e2e-gwapi-z2xb2/default to be provisioned...
    gateway_api_test.go:191: Waiting for RoleBinding test-e2e-gwapi-z2xb2/system:image-pullers to be created...
    gateway_api_test.go:507: Making sure the gatewayclass is created and accepted...
    gateway_api_test.go:508: Observed that gatewayclass openshift-default has been accepted: {Conditions:[{Type:Accepted Status:True ObservedGeneration:1 LastTransitionTime:2025-04-03 00:22:31 +0000 UTC Reason:Accepted Message:Handled by Istio controller}] SupportedFeatures:[]}
    gateway_api_test.go:513: Making sure the gateway is created and accepted...
    util_gatewayapi_test.go:584: Found gateway openshift-ingress/test-gateway as Accepted
    gateway_api_test.go:514: Observed that gateway openshift-ingress/test-gateway has been accepted: {Addresses:[] Conditions:[{Type:Accepted Status:True ObservedGeneration:1 LastTransitionTime:2025-04-03 00:23:25 +0000 UTC Reason:Accepted Message:Resource accepted} {Type:Programmed Status:False ObservedGeneration:1 LastTransitionTime:2025-04-03 00:23:25 +0000 UTC Reason:AddressNotUsable Message:Failed to assign to any requested addresses: hostname "test-gateway-openshift-default.openshift-ingress.svc.cluster.local" not found}] Listeners:[{Name:http SupportedKinds:[{Group:0xc0012084e0 Kind:HTTPRoute} {Group:0xc0012084f0 Kind:GRPCRoute}] AttachedRoutes:1 Conditions:[{Type:Accepted Status:True ObservedGeneration:1 LastTransitionTime:2025-04-03 00:23:25 +0000 UTC Reason:Accepted Message:No errors found} {Type:Conflicted Status:False ObservedGeneration:1 LastTransitionTime:2025-04-03 00:23:25 +0000 UTC Reason:NoConflicts Message:No errors found} {Type:Programmed Status:True ObservedGeneration:1 LastTransitionTime:2025-04-03 00:23:25 +0000 UTC Reason:Programmed Message:No errors found} {Type:ResolvedRefs Status:True ObservedGeneration:1 LastTransitionTime:2025-04-03 00:23:25 +0000 UTC Reason:ResolvedRefs Message:No errors found}]}]}
    
. . .

    util_gatewayapi_test.go:906: Failed to get DNSRecord openshift-ingress/test-gateway-5c66bf66d8-wildcard: dnsrecords.ingress.operator.openshift.io "test-gateway-5c66bf66d8-wildcard" not found; retrying...
    util_gatewayapi_test.go:906: Failed to get DNSRecord openshift-ingress/test-gateway-5c66bf66d8-wildcard: client rate limiter Wait returned an error: context deadline exceeded; retrying...
    gateway_api_test.go:201: failed to observe successful status of one or more gateway object/s: context deadline exceeded

I could not find a reason for the missing load balancer status for gateway's service in must-gather.

I think that we can verify the Programmed condition of Gateway resource as an additional check point in assertGatewaySuccessful function.

Also, I'm still wondering whether it's a good idea to remove (expecting recreation) Istio and Subscription as part of testGatewayAPIIstioInstallation which now goes before testGatewayAPIObjects. From what I saw on my cluster the removed servicemeshoperator3 subscription gets recreated so fast that GatewayClass status doesn't move, so I doubt the OSSM operator and Istiod get removed along the subscription removal.

@rfredette rfredette force-pushed the ocpbugs-53424-multiple-installplans branch 2 times, most recently from 8799e18 to 19ae0a4 Compare April 4, 2025 20:13
@alebedev87
Copy link
Contributor

/test e2e-aws-gatewayapi
/test e2e-aws-operator-techpreview

Copy link
Contributor

@alebedev87 alebedev87 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fix looks good to me + it passed the e2e tests twice with no problems. Just the last remark about the commit message: I think that we need to highlight the fact that we approve only the install plans which require approval. Because I think that that was the main workaround, not the fact that we approve the most recent one.

@alebedev87
Copy link
Contributor

/approve

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Apr 7, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alebedev87

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 7, 2025
@alebedev87
Copy link
Contributor

/test e2e-aws-gatewayapi
/test e2e-aws-operator-techpreview

Never enough of e2e runs ;)

…approving them.

Waiting until that point guarantees that there won't be simultaneous
updates to the install plan, which can cause the install process to
fail.

Also, if multiple install plans exist for the same OSSM version, approve
the most recently created one.

This is part of the fix for OCPBUGS-53424
@rfredette rfredette force-pushed the ocpbugs-53424-multiple-installplans branch from 19ae0a4 to b9d2420 Compare April 7, 2025 14:15
@rfredette
Copy link
Contributor Author

I think that we need to highlight the fact that we approve only the install plans which require approval

Agreed. I've updated the commit message.

@rfredette
Copy link
Contributor Author

The last aws-operator-techpreview run timed out on contacting its test pod:


    util_gatewayapi_test.go:721: GET test-hostname-dj59x.gws.ci-op-lmilz29f-9e7c5.origin-ci-int-aws.dev.rhcloud.com failed: status 503, expected 200, retrying...
    util_gatewayapi_test.go:721: GET test-hostname-dj59x.gws.ci-op-lmilz29f-9e7c5.origin-ci-int-aws.dev.rhcloud.com failed: status 503, expected 200, retrying...
    util_gatewayapi_test.go:729: Response headers for most recent request: map[Content-Length:[19] Content-Type:[text/plain] Date:[Mon, 07 Apr 2025 13:28:26 GMT]]
    util_gatewayapi_test.go:730: Reponse body for most recent request: no healthy upstream
    util_gatewayapi_test.go:732: Error connecting to test-hostname-dj59x.gws.ci-op-lmilz29f-9e7c5.origin-ci-int-aws.dev.rhcloud.com: context deadline exceeded 

Is that the issue #1212 is trying to address?

@alebedev87
Copy link
Contributor

Is that the issue #1212 is trying to address?

No, 1212 focuses on augmenting the logs and improving the e2e to detect the missing DNSRecord, the last job failed after the DNSRecord was created.

@alebedev87
Copy link
Contributor

Thank you for changing the commit message!

/retitle OCPBUGS-53424: Wait for install plans to enter the "Requires Approval" phase before approving them
/lgtm

@openshift-ci openshift-ci bot changed the title OCPBUGS-53424: If multiple install plans exist for the same OSSM version, approve the most recently created one. OCPBUGS-53424: Wait for install plans to enter the "Requires Approval" phase before approving them Apr 7, 2025
@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Apr 7, 2025
@openshift-ci-robot
Copy link
Contributor

/retest-required

Remaining retests: 0 against base HEAD 1410fe0 and 2 for PR HEAD b9d2420 in total

@rfredette
Copy link
Contributor Author

ovn-serial timed out. Otherwise, no specific test failures

/retest

@rfredette
Copy link
Contributor Author

/test e2e-aws-ovn-serial

1 similar comment
@melvinjoseph86
Copy link

/test e2e-aws-ovn-serial

@Miciah
Copy link
Contributor

Miciah commented Apr 8, 2025

e2e-aws-ovn-serial failed because of an infrastructure issue:

level=fatal msg=failed to fetch Cluster Infrastructure Variables: failed to fetch dependency of "Cluster Infrastructure Variables": failed to generate asset "Platform Provisioning Check": baseDomain: Invalid value: "origin-ci-int-aws.dev.rhcloud.com": the zone already has record sets for the domain of the cluster: [api.ci-op-hsyq78km-52591.origin-ci-int-aws.dev.rhcloud.com. (A), \052.apps.ci-op-hsyq78km-52591.origin-ci-int-aws.dev.rhcloud.com. (A)] 

I think that CI sometimes leaks a DNS record and then runs into this error when it re-uses the base domain. Shane saw something similar on #1206. He worked around the issue by amending a commit and pushing the change, which caused CI to use a different base domain for the cluster. (He changed a code comment, but even just changing the commit message should suffice.)

Let's see whether just re-running the test solves the issue.

/test e2e-aws-ovn-serial

@Miciah
Copy link
Contributor

Miciah commented Apr 8, 2025

(If you do need to push an artificial change, my suggestion is to reword the commit message to keep the title line within 50 characters.)

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Apr 8, 2025

@rfredette: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@openshift-merge-bot openshift-merge-bot bot merged commit b3bff60 into openshift:master Apr 8, 2025
19 checks passed
@openshift-ci-robot
Copy link
Contributor

@rfredette: Jira Issue OCPBUGS-53424: All pull requests linked via external trackers have merged:

Jira Issue OCPBUGS-53424 has been moved to the MODIFIED state.

In response to this:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-bot
Copy link
Contributor

[ART PR BUILD NOTIFIER]

Distgit: ose-cluster-ingress-operator
This PR has been included in build ose-cluster-ingress-operator-container-v4.20.0-202504081610.p0.gb3bff60.assembly.stream.el9.
All builds following this will include this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/severity-critical Referenced Jira bug's severity is critical for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged. qe-approved Signifies that QE has signed off on this PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants