-
Notifications
You must be signed in to change notification settings - Fork 6.1k
*: stabilize flaky unit tests #65674
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. |
2a6f984 to
370a560
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR stabilizes flaky unit tests across three packages by replacing timing assumptions with proper synchronization primitives and making tests more deterministic.
Changes:
- Replaced fixed sleep calls with
require.Eventuallyfor asynchronous operations in task register and memory tuner tests - Added failpoint support for controlling retry intervals in task register to enable deterministic testing
- Changed TLS rotation tests to check server-side error logs instead of unstable client-side error messages
- Introduced per-test HTTP mux instead of global state to avoid test interference
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| br/pkg/utils/register.go | Added sleepRetryInterval helper with failpoint support and refactored error injection to prevent overwriting real errors |
| br/pkg/utils/register_test.go | Replaced fixed sleeps with require.Eventually and added retry interval failpoint for deterministic testing |
| pkg/util/gctuner/memory_limit_tuner_test.go | Replaced strict equality checks with require.Eventually and increased timeouts for slower environments |
| pkg/util/security_test.go | Introduced thread-safe error logging buffer and switched from client-side to server-side TLS error checking |
|
/check-issue-triage-complete |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #65674 +/- ##
================================================
- Coverage 77.8195% 77.5654% -0.2542%
================================================
Files 1989 1919 -70
Lines 543412 533539 -9873
================================================
- Hits 422881 413842 -9039
- Misses 118872 119593 +721
+ Partials 1659 104 -1555
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
|
/test check-issue-triage-complete |
|
@wjhuang2016: The specified target(s) for The following commands are available to trigger optional jobs: Use DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
@wjhuang2016: The specified target(s) for Use DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
/test check-issue-triage-complete |
|
@wjhuang2016: The specified target(s) for The following commands are available to trigger optional jobs: Use DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
@wjhuang2016: The specified target(s) for Use DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
/test mysql-test |
|
@wjhuang2016: The specified target(s) for Use DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
/retest |
|
@pantheon-bot please take a comprehensive bug hunt review |
Bug Hunt Review ResultsFound 3 P2 issues (test quality/reliability): P2-1: Unsafe
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 7 out of 7 changed files in this pull request and generated no new comments.
|
@wjhuang2016: The following test failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
Split into per-case issues/PRs:\n- #65732 (br/pkg/utils task register tests) closes #65726\n- #65733 (pkg/util/gctuner TestIssue48741) closes #65727\n- #65734 (pkg/util TLS tests) closes #65728\n- #65735 (pkg/server/handler/optimizor plan replayer tests) closes #65729\n- #65736 (pkg/server/handler/optimizor stats handler test) closes #65730\n- #65737 (tests/realtikvtest/txntest staleness txn test) closes #65731 |
What problem does this PR solve?
Issue Number: close #65673, close #65679
Problem Summary:
Unit tests were flaky due to timing assumptions, async state propagation, and shared global state.
What changed and how does it work?
Case 1:
br/pkg/utils/register_test.go(task register keepalive)Why it was flaky:
time.Sleep(RegisterRetryInternal)while etcd lease keepalive/retry is async and timing-sensitive.GetImportTasksFromcan transiently fail during TTL query (e.g.requested lease not found) when the lease is revoked/expired.Fix:
taskRegister.sleepRetryIntervaland use it in retry paths, with a failpoint (brie-task-register-retry-interval) to shorten retry sleeps in tests.require.Eventuallychecks.requested lease not foundonClose.Case 2:
pkg/util/gctuner/memory_limit_tuner_test.go(TestIssue48741)Why it was flaky:
debug.SetMemoryLimitupdates and GC triggering/adjusting are not strictly synchronous; 500ms timeouts were too aggressive on slower hosts.Fix:
require.Eventuallytimeouts.require.Equalchecks ondebug.SetMemoryLimit(-1)withrequire.Eventually.Case 3:
pkg/util/security_test.go(TestVerifyCommonNameAndRotate)Why it was flaky:
http.DefaultServeMuxcan cause cross-test interference.Fix:
http.ServeMux.http.Server.ErrorLogand assert on the server-side error message (client certificate authentication failed).Case 4:
pkg/server/handler/optimizor/plan_replayer_test.go(plan replayer + bindings)Why it was flaky:
server.RunInGoTestChanis a global channel and is closed after the first server starts; subsequent server starts in the same test binary could race if the channel is not recreated.PLAN REPLAYER LOAD, the session default DB can change; unqualified queries may hit the wrong schema and miss the intended binding.@@last_plan_from_bindingchecks can be racy.LOADin the same server if not cleaned up.Fix:
server.RunInGoTestChanbefore server starts so<-RunInGoTestChanreliably waits for the current instance.USE planReplayerafterPLAN REPLAYER LOADand enable plan baselines.require.Eventuallyfor the binding effect check.Case 5:
pkg/server/handler/optimizor/statistics_handler_test.go(TestDumpStatsAPI)Why it was flaky:
RunInGoTestChancan lose the startup barrier and race on readingListenAddr/StatusListenerAddr.Fix:
server.RunInGoTestChanbefore starting the server.Case 6:
tests/realtikvtest/txntest/stale_read_test.go(TestValidateReadOnlyInStalenessTransaction)Why it was flaky:
AS OF TIMESTAMP NOW()can be too new under safeTS/scheduling lag; some environments need a small time buffer to make stale-read transactions stable.Fix:
NOW(3) - INTERVAL 1 SECOND.Check List
Tests
make failpoint-enable && go test ./br/pkg/utils -run '^TestTaskRegister' -count=10 -tags=intestmake failpoint-enable && go test ./pkg/util/gctuner -run '^TestIssue48741$' -count=5 -tags=intestgo test ./pkg/util -run '^(TestInvalidTLS|TestVerifyCommonNameAndRotate|TestTLSVersion|TestCA)$' -count=10 -tags=intestgo test ./pkg/server/handler/optimizor -run '^TestPlanReplayerWithMultiForeignKey$' -count=50 -tags=intest -timeout 30mgo test ./tests/realtikvtest/txntest -run '^TestValidateReadOnlyInStalenessTransaction$' -count=10 -tags=intest -timeout 60mgo test ./tests/realtikvtest/addindextest -run '^TestCreateNonUniqueIndex$' -count=30 -tags=intest -timeout 240mSide effects
Documentation
Release note
Please refer to Release Notes Language Style Guide to write a quality release note.