Extend Timeout for Events Verification on E2E Tests #144

razo7 · 2025-01-30T10:43:38Z

Why we need this PR

Fix the Flaky E2E test regarding an error in the missing succeeded maintenance event.
Extend the verification timeout of existing/non-existing events from 120 sec to 180 sec, since successful event might happen after 2 minutes.

Changes made

Extending events timeout from 120 sec to 180 sec.
Small fixes and fixes to deprecated modules

Which issue(s) this PR fixes

Test plan

openshift-ci · 2025-01-30T10:43:43Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

openshift-ci · 2025-01-30T10:43:48Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: razo7

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [razo7]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

razo7 · 2025-01-30T10:44:11Z

/test 4.17-openshift-e2e

razo7 · 2025-01-30T11:54:25Z

/test 4.17-openshift-e2e

razo7 · 2025-01-30T11:55:08Z

/test 4.18-openshift-e2e

slintes · 2025-01-30T14:30:11Z

test/e2e/node_maintenance_test.go

 	eventInterval         = time.Second * 10
 	timeout               = time.Second * 120
 	testDeployment        = "test-deployment"
+	// sometimes an event is emitted so quic that it is raced with the the current time which is


I doubt that quick events are the reason. There should be enough time between creating a CR, and the controller reconciling it and emitting the event. I think it's more likely that the time on the controller pod and the test pod are out of sync...?
Also: what about adding the fix to the waitForEvent() function only, instead of everywhere it's called? When we have it at one place only, we also don't need this const with that weird name ;)

I think it's more likely that the time on the controller pod and the test pod are out of sync?

I have checked it locally with the following commands and on AWS cluster bot and there is no significant out-of-sync timing between them.

➜ oc exec -it node-maintenance-operator-controller-manager-5bcf79f46-wbpn4 -n openshift-workload-availability -- date -u
Thu Jan 30 15:55:08 UTC 2025
➜ oc exec -it test-deployment-856597466f-j45cs -n node-maintenance-test -- date -u
Thu Jan 30 15:55:12 UTC 2025

Also: what about adding the fix to the waitForEvent() function only, instead of everywhere it's called? When we have it at one place only, we also don't need this const with that weird name ;)

I tried it and it also worked without having this subtle delay..

razo7 · 2025-01-30T16:40:19Z

/test 4.17-openshift-e2e
/test 4.18-openshift-e2e

slintes · 2025-01-30T16:50:30Z

test/e2e/node_maintenance_test.go

 		for _, event := range events.Items {
 			if strings.Contains(event.Name, eventIdentifier) && event.Reason == eventReason {
-				return true, nil
+				if event.LastTimestamp.Time.After(beginTime) {


oh, somehow I missed that startTime / beginTime (why 2 different names) is completely new... so... I'm wondering: we add new check to the conditions, and it will work better than without that check? 🤔

Looking at the events of a failed job, there really is no 2nd succeeded event 🤷🏼‍♂️
Look for "reportingComponent": "NodeMaintenance"
Heads up, big file 😉

https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/medik8s_node-maintenance-operator/143/pull-ci-medik8s-node-maintenance-operator-main-4.17-openshift-e2e/1884533178106908672/artifacts/openshift-e2e/gather-extra/artifacts/events.json

Also, see NHC logs here, no logs for sending the event for the 2nd maintenance, looks like pod eviction doesn't finish in time: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/medik8s_node-maintenance-operator/143/pull-ci-medik8s-node-maintenance-operator-main-4.17-openshift-e2e/1884533178106908672/artifacts/openshift-e2e/gather-extra/artifacts/pods/nmo-install_node-maintenance-operator-controller-manager-6cc47455d7-szf4l_manager.log

So probably something I usually try to avoid might help here: increase timeout
(and you can remove that starttime check...)

startTime / beginTime (why 2 different names) is completely new

No reason :)

Looking at the events of a failed job, there really is no 2nd succeeded event 🤷🏼‍♂️
Look for "reportingComponent": "NodeMaintenance"

Yes, but on the last succeeded test there are two succeded events (one for test-1st-control-plane- and for test-maintenance), and you may find them when looking for SucceedMaintenance. So in practice, it did help somehow.

see NHC logs here

NMO, I was confused for a second 😅

we add new check to the conditions, and it will work better than without that check?

Looking at this change it is a bit odd that it would make a difference, as each e2e test verifies events of a singleton nm CR, so the events of one CR won't collide with the events of another CR 🤔
But still, it did help.
I lean towards increasing the timeout as it leaves us with much more confidence about what is going on rather than the check for the event's time which does help but raise doubts.

Yes, but on the last succeeded test there are two succeded events

I was referring to the failed tests

so in practice, it did help somehow

Why do you think so? IMHO that's impossible

razo7 · 2025-02-10T14:03:53Z

/test 4.17-openshift-e2e
/test 4.18-openshift-e2e

Trying again with the event's time check.
Test failed :(

razo7 · 2025-02-10T16:35:27Z

/test 4.17-openshift-e2e
/test 4.18-openshift-e2e

Now trying without the event's time check, and with only increasing the timeout

k8s.io/utils/pointer is depreciated and replaced with k8s.io/utils/ptr

Increase timeout for verifying events. Waiting for the success event, and evicting all the pods, may take more time than 120 seconds

razo7 · 2025-02-11T06:32:30Z

/test 4.17-openshift-e2e
/test 4.18-openshift-e2e

ops, I forgot to add the increase in timeout commit. Now it is included

slintes · 2025-02-11T10:46:40Z

/lgtm
/test all

razo7 · 2025-02-17T10:56:47Z

/cherry-pick release-0.18

openshift-cherrypick-robot · 2025-02-17T10:57:28Z

@razo7: new pull request created: #146

Details

In response to this:

/cherry-pick release-0.18

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

openshift-ci bot added the do-not-merge/work-in-progress label Jan 30, 2025

openshift-ci bot added the approved label Jan 30, 2025

razo7 changed the title ~~Check for recent event on e2e tests~~ Check for Recent Event on E2E Tests Jan 30, 2025

slintes reviewed Jan 30, 2025

View reviewed changes

razo7 force-pushed the check-recent-event-e2e branch 3 times, most recently from fc6495c to d37110d Compare February 10, 2025 13:46

razo7 force-pushed the check-recent-event-e2e branch from d37110d to 47abfac Compare February 10, 2025 16:34

razo7 added 2 commits February 11, 2025 08:15

Small fixes for depreciation and naming

b8b91d0

k8s.io/utils/pointer is depreciated and replaced with k8s.io/utils/ptr

Increase timeout for verifying events

09aabfd

Increase timeout for verifying events. Waiting for the success event, and evicting all the pods, may take more time than 120 seconds

razo7 force-pushed the check-recent-event-e2e branch from 47abfac to 09aabfd Compare February 11, 2025 06:16

openshift-ci bot assigned slintes Feb 11, 2025

openshift-ci bot added the lgtm label Feb 11, 2025

razo7 marked this pull request as ready for review February 11, 2025 11:37

openshift-ci bot removed the do-not-merge/work-in-progress label Feb 11, 2025

openshift-ci bot requested review from beekhof and clobrano February 11, 2025 11:37

openshift-merge-bot bot merged commit 096feda into medik8s:main Feb 11, 2025
23 checks passed

razo7 changed the title ~~Check for Recent Event on E2E Tests~~ Extend Timeout for Events Verification on E2E Tests Feb 11, 2025

razo7 mentioned this pull request Feb 12, 2025

Build a file-based catalog #143

Merged

openshift-cherrypick-robot mentioned this pull request Feb 17, 2025

[release-0.18] Extend Timeout for Events Verification on E2E Tests #146

Closed

Extend Timeout for Events Verification on E2E Tests #144

Extend Timeout for Events Verification on E2E Tests #144

Uh oh!

Conversation

razo7 commented Jan 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why we need this PR

Changes made

Which issue(s) this PR fixes

Test plan

Uh oh!

openshift-ci bot commented Jan 30, 2025

Uh oh!

openshift-ci bot commented Jan 30, 2025

Uh oh!

razo7 commented Jan 30, 2025

Uh oh!

razo7 commented Jan 30, 2025

Uh oh!

razo7 commented Jan 30, 2025

Uh oh!

slintes Jan 30, 2025

Choose a reason for hiding this comment

Uh oh!

razo7 Jan 30, 2025

Choose a reason for hiding this comment

Uh oh!

razo7 Jan 30, 2025

Choose a reason for hiding this comment

Uh oh!

razo7 commented Jan 30, 2025

Uh oh!

slintes Jan 30, 2025

Choose a reason for hiding this comment

Uh oh!

slintes Jan 30, 2025

Choose a reason for hiding this comment

Uh oh!

razo7 Feb 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

slintes Feb 11, 2025

Choose a reason for hiding this comment

Uh oh!

razo7 commented Feb 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

razo7 commented Feb 10, 2025

Uh oh!

razo7 commented Feb 11, 2025

Uh oh!

slintes commented Feb 11, 2025

Uh oh!

Uh oh!

razo7 commented Feb 17, 2025

Uh oh!

openshift-cherrypick-robot commented Feb 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

razo7 commented Jan 30, 2025 •

edited

Loading

razo7 Feb 10, 2025 •

edited

Loading

razo7 commented Feb 10, 2025 •

edited

Loading