Skip to content

Conversation

@razo7
Copy link
Member

@razo7 razo7 commented Jan 30, 2025

Why we need this PR

  • Fix the Flaky E2E test regarding an error in the missing succeeded maintenance event.
  • Extend the verification timeout of existing/non-existing events from 120 sec to 180 sec, since successful event might happen after 2 minutes.

Changes made

  • Extending events timeout from 120 sec to 180 sec.
  • Small fixes and fixes to deprecated modules

Which issue(s) this PR fixes

Test plan

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jan 30, 2025

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jan 30, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: razo7

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@razo7
Copy link
Member Author

razo7 commented Jan 30, 2025

/test 4.17-openshift-e2e

@razo7 razo7 changed the title Check for recent event on e2e tests Check for Recent Event on E2E Tests Jan 30, 2025
@razo7
Copy link
Member Author

razo7 commented Jan 30, 2025

/test 4.17-openshift-e2e

@razo7
Copy link
Member Author

razo7 commented Jan 30, 2025

/test 4.18-openshift-e2e

eventInterval = time.Second * 10
timeout = time.Second * 120
testDeployment = "test-deployment"
// sometimes an event is emitted so quic that it is raced with the the current time which is
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I doubt that quick events are the reason. There should be enough time between creating a CR, and the controller reconciling it and emitting the event. I think it's more likely that the time on the controller pod and the test pod are out of sync...?
Also: what about adding the fix to the waitForEvent() function only, instead of everywhere it's called? When we have it at one place only, we also don't need this const with that weird name ;)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's more likely that the time on the controller pod and the test pod are out of sync?

I have checked it locally with the following commands and on AWS cluster bot and there is no significant out-of-sync timing between them.

➜ oc exec -it node-maintenance-operator-controller-manager-5bcf79f46-wbpn4 -n openshift-workload-availability -- date -u
Thu Jan 30 15:55:08 UTC 2025
➜ oc exec -it test-deployment-856597466f-j45cs -n node-maintenance-test -- date -u
Thu Jan 30 15:55:12 UTC 2025

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also: what about adding the fix to the waitForEvent() function only, instead of everywhere it's called? When we have it at one place only, we also don't need this const with that weird name ;)

I tried it and it also worked without having this subtle delay..

@razo7
Copy link
Member Author

razo7 commented Jan 30, 2025

/test 4.17-openshift-e2e
/test 4.18-openshift-e2e

for _, event := range events.Items {
if strings.Contains(event.Name, eventIdentifier) && event.Reason == eventReason {
return true, nil
if event.LastTimestamp.Time.After(beginTime) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, somehow I missed that startTime / beginTime (why 2 different names) is completely new... so... I'm wondering: we add new check to the conditions, and it will work better than without that check? 🤔

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the events of a failed job, there really is no 2nd succeeded event 🤷🏼‍♂️
Look for "reportingComponent": "NodeMaintenance"
Heads up, big file 😉

https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/medik8s_node-maintenance-operator/143/pull-ci-medik8s-node-maintenance-operator-main-4.17-openshift-e2e/1884533178106908672/artifacts/openshift-e2e/gather-extra/artifacts/events.json

Also, see NHC logs here, no logs for sending the event for the 2nd maintenance, looks like pod eviction doesn't finish in time: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/medik8s_node-maintenance-operator/143/pull-ci-medik8s-node-maintenance-operator-main-4.17-openshift-e2e/1884533178106908672/artifacts/openshift-e2e/gather-extra/artifacts/pods/nmo-install_node-maintenance-operator-controller-manager-6cc47455d7-szf4l_manager.log

So probably something I usually try to avoid might help here: increase timeout
(and you can remove that starttime check...)

Copy link
Member Author

@razo7 razo7 Feb 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

startTime / beginTime (why 2 different names) is completely new

No reason :)

Looking at the events of a failed job, there really is no 2nd succeeded event 🤷🏼‍♂️
Look for "reportingComponent": "NodeMaintenance"

Yes, but on the last succeeded test there are two succeded events (one for test-1st-control-plane- and for test-maintenance), and you may find them when looking for SucceedMaintenance. So in practice, it did help somehow.

see NHC logs here

NMO, I was confused for a second 😅

we add new check to the conditions, and it will work better than without that check?

Looking at this change it is a bit odd that it would make a difference, as each e2e test verifies events of a singleton nm CR, so the events of one CR won't collide with the events of another CR 🤔
But still, it did help.
I lean towards increasing the timeout as it leaves us with much more confidence about what is going on rather than the check for the event's time which does help but raise doubts.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but on the last succeeded test there are two succeded events

I was referring to the failed tests

so in practice, it did help somehow

Why do you think so? IMHO that's impossible

@razo7 razo7 force-pushed the check-recent-event-e2e branch 3 times, most recently from fc6495c to d37110d Compare February 10, 2025 13:46
@razo7
Copy link
Member Author

razo7 commented Feb 10, 2025

/test 4.17-openshift-e2e
/test 4.18-openshift-e2e

Trying again with the event's time check.
Test failed :(

@razo7 razo7 force-pushed the check-recent-event-e2e branch from d37110d to 47abfac Compare February 10, 2025 16:34
@razo7
Copy link
Member Author

razo7 commented Feb 10, 2025

/test 4.17-openshift-e2e
/test 4.18-openshift-e2e

Now trying without the event's time check, and with only increasing the timeout

k8s.io/utils/pointer is depreciated and replaced with k8s.io/utils/ptr
Increase timeout for verifying events. Waiting for the success event, and evicting all the pods, may take more time than 120 seconds
@razo7 razo7 force-pushed the check-recent-event-e2e branch from 47abfac to 09aabfd Compare February 11, 2025 06:16
@razo7
Copy link
Member Author

razo7 commented Feb 11, 2025

/test 4.17-openshift-e2e
/test 4.18-openshift-e2e

ops, I forgot to add the increase in timeout commit. Now it is included

@slintes
Copy link
Member

slintes commented Feb 11, 2025

/lgtm
/test all

@openshift-ci openshift-ci bot added the lgtm label Feb 11, 2025
@razo7 razo7 marked this pull request as ready for review February 11, 2025 11:37
@openshift-ci openshift-ci bot requested review from beekhof and clobrano February 11, 2025 11:37
@openshift-merge-bot openshift-merge-bot bot merged commit 096feda into medik8s:main Feb 11, 2025
23 checks passed
@razo7 razo7 changed the title Check for Recent Event on E2E Tests Extend Timeout for Events Verification on E2E Tests Feb 11, 2025
@razo7 razo7 mentioned this pull request Feb 12, 2025
@razo7
Copy link
Member Author

razo7 commented Feb 17, 2025

/cherry-pick release-0.18

@openshift-cherrypick-robot

@razo7: new pull request created: #146

Details

In response to this:

/cherry-pick release-0.18

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants