- 
                Notifications
    You must be signed in to change notification settings 
- Fork 68
🌱 fix(e2e): wait for leader election #1676
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
🌱 fix(e2e): wait for leader election #1676
Conversation
| ✅ Deploy Preview for olmv1 ready!
 To edit notification comments on pull requests, go to your Netlify site configuration. | 
a96e69f    to
    6b04b01      
    Compare
  
    | Codecov ReportAll modified and coverable lines are covered by tests ✅ 
 Additional details and impacted files@@            Coverage Diff             @@
##             main    #1676      +/-   ##
==========================================
- Coverage   67.50%   67.48%   -0.03%     
==========================================
  Files          57       57              
  Lines        4632     4632              
==========================================
- Hits         3127     3126       -1     
- Misses       1278     1279       +1     
  Partials      227      227              
 Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. | 
0ee61e2    to
    4238c2b      
    Compare
  
    | t.Log("Wait for operator-controller deployment to be ready") | ||
| managerPod := waitForDeployment(t, ctx, "operator-controller-controller-manager") | ||
|  | ||
| t.Log("Start measuring leader election time") | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to be careful about how we measure the timing here. What we are measuring right now is the amount of time between:
- the test detecting that the operator-controller deployment is finished, and
- how long it takes for watchPodLogsForSubstring(leaderElectionCtx, managerPod, "manager", leaderSubstrings...)to return
This may correlate with the time taken for leader election, but it won't necessarily correlate with it. E.g. let's say I upgrade the deployments, go out for lunch for 1h, come back and run the post upgrade test.
Maybe it would be better to extract the timestamp in the first and leader election log lines instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your comment make 100% sense.
To try keep things simple and focused on the goal of this PR, I’ve removed the measurement aspect. Whether we want to include it as info, debug, or decide on a specific measurement approach is a separate discussion. For now, let’s stay within the scope of this change—fixing the test flake and unblocking progress.
TestClusterExtensionAfterOLMUpgrade was failing due to increased leader election timeouts, causing reconciliation checks to run before leadership was acquired. This fix ensures the test explicitly waits for leader election logs (`"successfully acquired lease"`) before verifying reconciliation.
4238c2b    to
    25ffe30      
    Compare
  
    There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm ^^
| t.Log("Wait for acquired leader election") | ||
| // Average case is under 1 minute but in the worst case: (previous leader crashed) | ||
| // we could have LeaseDuration (137s) + RetryPeriod (26s) +/- 163s | ||
| leaderCtx, leaderCancel := context.WithTimeout(ctx, 3*time.Minute) | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am assuming 3 minutes is the worst case scenario. I am not familiar with context.WithTimeout , does it return if we acquire the lease before 163s?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
context.WithTimeout just gives you a context that timesout (gets cancelled) after then timeout period.
This means that the call to watchPodLogsForSubstring(leaderCtx, managerPod, "manager", leaderSubstrings...) will return with an error if it hasn't already after 3 minutes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, looks like it is a straight forward timeout method.
| defer leaderCancel() | ||
|  | ||
| leaderSubstrings := []string{"successfully acquired lease"} | ||
| leaderElected, err := watchPodLogsForSubstring(leaderCtx, managerPod, "manager", leaderSubstrings...) | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Scraping the logs seems brittle.
Would it be better to use a Watch on the leader election? We could use the Leases from CoordinationV1Client from "k8s.io/client-go/kubernetes/typed/coordination/v1" ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I realize it's also longer and more code, but the upside is it reacts right away, like watching for the pod log, but without caring if strings change at some point, and break our tests out of our control.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's a good idea! If this work is blocking CI, I'd say merge it as it is, then follow up with the watch ^^
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I agree that we could do something more fancy
But we check the logs in many places, indeed below.
We can see if we improve after, but there is no reason for us to face the pain of the flak.
TestClusterExtensionAfterOLMUpgrade was failing due to increased leader election timeouts, causing reconciliation checks to run before leadership was acquired.
This fix ensures the test explicitly waits for leader election logs (
"successfully acquired lease") before verifying reconciliation.Example: https://github.com/operator-framework/operator-controller/actions/runs/13047935813/job/36401741998
Logs from operator-controller;