-
Notifications
You must be signed in to change notification settings - Fork 19
Extend Timeout for Events Verification on E2E Tests #144
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extend Timeout for Events Verification on E2E Tests #144
Conversation
|
Skipping CI for Draft Pull Request. |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: razo7 The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/test 4.17-openshift-e2e |
|
/test 4.17-openshift-e2e |
|
/test 4.18-openshift-e2e |
test/e2e/node_maintenance_test.go
Outdated
| eventInterval = time.Second * 10 | ||
| timeout = time.Second * 120 | ||
| testDeployment = "test-deployment" | ||
| // sometimes an event is emitted so quic that it is raced with the the current time which is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I doubt that quick events are the reason. There should be enough time between creating a CR, and the controller reconciling it and emitting the event. I think it's more likely that the time on the controller pod and the test pod are out of sync...?
Also: what about adding the fix to the waitForEvent() function only, instead of everywhere it's called? When we have it at one place only, we also don't need this const with that weird name ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's more likely that the time on the controller pod and the test pod are out of sync?
I have checked it locally with the following commands and on AWS cluster bot and there is no significant out-of-sync timing between them.
➜ oc exec -it node-maintenance-operator-controller-manager-5bcf79f46-wbpn4 -n openshift-workload-availability -- date -u
Thu Jan 30 15:55:08 UTC 2025
➜ oc exec -it test-deployment-856597466f-j45cs -n node-maintenance-test -- date -u
Thu Jan 30 15:55:12 UTC 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also: what about adding the fix to the waitForEvent() function only, instead of everywhere it's called? When we have it at one place only, we also don't need this const with that weird name ;)
I tried it and it also worked without having this subtle delay..
|
/test 4.17-openshift-e2e |
test/e2e/node_maintenance_test.go
Outdated
| for _, event := range events.Items { | ||
| if strings.Contains(event.Name, eventIdentifier) && event.Reason == eventReason { | ||
| return true, nil | ||
| if event.LastTimestamp.Time.After(beginTime) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, somehow I missed that startTime / beginTime (why 2 different names) is completely new... so... I'm wondering: we add new check to the conditions, and it will work better than without that check? 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at the events of a failed job, there really is no 2nd succeeded event 🤷🏼♂️
Look for "reportingComponent": "NodeMaintenance"
Heads up, big file 😉
Also, see NHC logs here, no logs for sending the event for the 2nd maintenance, looks like pod eviction doesn't finish in time: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/medik8s_node-maintenance-operator/143/pull-ci-medik8s-node-maintenance-operator-main-4.17-openshift-e2e/1884533178106908672/artifacts/openshift-e2e/gather-extra/artifacts/pods/nmo-install_node-maintenance-operator-controller-manager-6cc47455d7-szf4l_manager.log
So probably something I usually try to avoid might help here: increase timeout
(and you can remove that starttime check...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
startTime / beginTime (why 2 different names) is completely new
No reason :)
Looking at the events of a failed job, there really is no 2nd succeeded event 🤷🏼♂️
Look for "reportingComponent": "NodeMaintenance"
Yes, but on the last succeeded test there are two succeded events (one for test-1st-control-plane- and for test-maintenance), and you may find them when looking for SucceedMaintenance. So in practice, it did help somehow.
see NHC logs here
NMO, I was confused for a second 😅
we add new check to the conditions, and it will work better than without that check?
Looking at this change it is a bit odd that it would make a difference, as each e2e test verifies events of a singleton nm CR, so the events of one CR won't collide with the events of another CR 🤔
But still, it did help.
I lean towards increasing the timeout as it leaves us with much more confidence about what is going on rather than the check for the event's time which does help but raise doubts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but on the last succeeded test there are two succeded events
I was referring to the failed tests
so in practice, it did help somehow
Why do you think so? IMHO that's impossible
fc6495c to
d37110d
Compare
|
/test 4.17-openshift-e2e Trying again with the event's time check. |
d37110d to
47abfac
Compare
|
/test 4.17-openshift-e2e Now trying without the event's time check, and with only increasing the timeout |
k8s.io/utils/pointer is depreciated and replaced with k8s.io/utils/ptr
Increase timeout for verifying events. Waiting for the success event, and evicting all the pods, may take more time than 120 seconds
47abfac to
09aabfd
Compare
|
/test 4.17-openshift-e2e ops, I forgot to add the increase in timeout commit. Now it is included |
|
/lgtm |
|
/cherry-pick release-0.18 |
|
@razo7: new pull request created: #146 DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
Why we need this PR
Changes made
Which issue(s) this PR fixes
Test plan