*: stabilize flaky unit tests #65674

wjhuang2016 · 2026-01-20T08:29:17Z

What problem does this PR solve?

Issue Number: close #65673, close #65679

Problem Summary:

Unit tests were flaky due to timing assumptions, async state propagation, and shared global state.

What changed and how does it work?

Case 1: `br/pkg/utils/register_test.go` (task register keepalive)

Why it was flaky:

The tests used time.Sleep(RegisterRetryInternal) while etcd lease keepalive/retry is async and timing-sensitive.
GetImportTasksFrom can transiently fail during TTL query (e.g. requested lease not found) when the lease is revoked/expired.

Fix:

Add taskRegister.sleepRetryInterval and use it in retry paths, with a failpoint (brie-task-register-retry-interval) to shorten retry sleeps in tests.
Replace fixed sleeps with require.Eventually checks.
Make the tests tolerant to requested lease not found on Close.

Case 2: `pkg/util/gctuner/memory_limit_tuner_test.go` (`TestIssue48741`)

Why it was flaky:

debug.SetMemoryLimit updates and GC triggering/adjusting are not strictly synchronous; 500ms timeouts were too aggressive on slower hosts.

Fix:

Relax require.Eventually timeouts.
Replace strict instantaneous require.Equal checks on debug.SetMemoryLimit(-1) with require.Eventually.

Case 3: `pkg/util/security_test.go` (`TestVerifyCommonNameAndRotate`)

Why it was flaky:

The client-side TLS error string is not stable across Go versions/handshake timing.
Using the global http.DefaultServeMux can cause cross-test interference.

Fix:

Use a per-test http.ServeMux.
Capture server-side TLS handshake errors via http.Server.ErrorLog and assert on the server-side error message (client certificate authentication failed).

Case 4: `pkg/server/handler/optimizor/plan_replayer_test.go` (plan replayer + bindings)

Why it was flaky:

server.RunInGoTestChan is a global channel and is closed after the first server starts; subsequent server starts in the same test binary could race if the channel is not recreated.
After PLAN REPLAYER LOAD, the session default DB can change; unqualified queries may hit the wrong schema and miss the intended binding.
Binding creation/reload and cache population is async; immediate @@last_plan_from_binding checks can be racy.
Dump-created global bindings can interfere with later LOAD in the same server if not cleaned up.

Fix:

Recreate server.RunInGoTestChan before server starts so <-RunInGoTestChan reliably waits for the current instance.
Explicitly USE planReplayer after PLAN REPLAYER LOAD and enable plan baselines.
Use require.Eventually for the binding effect check.
Adjust binding cache quota during the test and restore it afterward.
Drop the dump-created global binding to avoid later interference.

Case 5: `pkg/server/handler/optimizor/statistics_handler_test.go` (`TestDumpStatsAPI`)

Why it was flaky:

Same class as Case 4: waiting on a previously-closed global RunInGoTestChan can lose the startup barrier and race on reading ListenAddr/StatusListenerAddr.

Fix:

Recreate server.RunInGoTestChan before starting the server.

Case 6: `tests/realtikvtest/txntest/stale_read_test.go` (`TestValidateReadOnlyInStalenessTransaction`)

Why it was flaky:

AS OF TIMESTAMP NOW() can be too new under safeTS/scheduling lag; some environments need a small time buffer to make stale-read transactions stable.

Fix:

Use a slightly older timestamp expression: NOW(3) - INTERVAL 1 SECOND.

Check List

Tests

Unit test
- make failpoint-enable && go test ./br/pkg/utils -run '^TestTaskRegister' -count=10 -tags=intest
- make failpoint-enable && go test ./pkg/util/gctuner -run '^TestIssue48741$' -count=5 -tags=intest
- go test ./pkg/util -run '^(TestInvalidTLS|TestVerifyCommonNameAndRotate|TestTLSVersion|TestCA)$' -count=10 -tags=intest
- go test ./pkg/server/handler/optimizor -run '^TestPlanReplayerWithMultiForeignKey$' -count=50 -tags=intest -timeout 30m
Integration test
- go test ./tests/realtikvtest/txntest -run '^TestValidateReadOnlyInStalenessTransaction$' -count=10 -tags=intest -timeout 60m
- go test ./tests/realtikvtest/addindextest -run '^TestCreateNonUniqueIndex$' -count=30 -tags=intest -timeout 240m
Manual test (add detailed scripts or steps below)
No need to test
- I checked and no code files have been changed.

Side effects

Performance regression: Consumes more CPU
Performance regression: Consumes more Memory
Breaking backward compatibility

Documentation

Release note

Please refer to Release Notes Language Style Guide to write a quality release note.

None

ti-chi-bot · 2026-01-20T08:29:29Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign leavrth for approval. For more information see the Code Review Process.
Please ensure that each of them provides their approval before proceeding.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS
br/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copilot

Pull request overview

This PR stabilizes flaky unit tests across three packages by replacing timing assumptions with proper synchronization primitives and making tests more deterministic.

Changes:

Replaced fixed sleep calls with require.Eventually for asynchronous operations in task register and memory tuner tests
Added failpoint support for controlling retry intervals in task register to enable deterministic testing
Changed TLS rotation tests to check server-side error logs instead of unstable client-side error messages
Introduced per-test HTTP mux instead of global state to avoid test interference

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.

File	Description
br/pkg/utils/register.go	Added `sleepRetryInterval` helper with failpoint support and refactored error injection to prevent overwriting real errors
br/pkg/utils/register_test.go	Replaced fixed sleeps with `require.Eventually` and added retry interval failpoint for deterministic testing
pkg/util/gctuner/memory_limit_tuner_test.go	Replaced strict equality checks with `require.Eventually` and increased timeouts for slower environments
pkg/util/security_test.go	Introduced thread-safe error logging buffer and switched from client-side to server-side TLS error checking

wjhuang2016 · 2026-01-20T08:39:46Z

/check-issue-triage-complete

codecov · 2026-01-20T08:59:39Z

Codecov Report

❌ Patch coverage is 76.31579% with 9 lines in your changes missing coverage. Please review.
✅ Project coverage is 77.5654%. Comparing base (8e961f7) to head (4f383f2).
⚠️ Report is 11 commits behind head on master.

Additional details and impacted files

@@               Coverage Diff                @@
##             master     #65674        +/-   ##
================================================
- Coverage   77.8195%   77.5654%   -0.2542%     
================================================
  Files          1989       1919        -70     
  Lines        543412     533539      -9873     
================================================
- Hits         422881     413842      -9039     
- Misses       118872     119593       +721     
+ Partials       1659        104      -1555

Flag	Coverage Δ
integration	`41.5227% <0.0000%> (-6.6647%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Components	Coverage Δ
dumpling	`56.7974% <ø> (ø)`
parser	`∅ <ø> (∅)`
br	`48.7696% <76.3157%> (-12.2794%)`	⬇️

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

wjhuang2016 · 2026-01-20T11:36:26Z

/test check-issue-triage-complete

ti-chi-bot · 2026-01-20T11:36:31Z

@wjhuang2016: The specified target(s) for /test were not found.
The following commands are available to trigger required jobs:

/test build

/test check-dev

/test check-dev2

/test mysql-test

/test pull-br-integration-test

/test pull-build-next-gen

/test pull-integration-ddl-test

/test pull-integration-e2e-test

/test pull-integration-realcluster-test-next-gen

/test pull-lightning-integration-test

/test pull-mysql-client-test

/test pull-mysql-client-test-next-gen

/test pull-unit-test-ddlv1

/test pull-unit-test-next-gen

/test unit-test

The following commands are available to trigger optional jobs:

/test pingcap/tidb/canary_ghpr_unit_test

/test pull-br-integration-test-next-gen

/test pull-check-deps

/test pull-common-test

/test pull-e2e-test

/test pull-error-log-review

/test pull-integration-common-test

/test pull-integration-copr-test

/test pull-integration-ddl-test-next-gen

/test pull-integration-e2e-test-next-gen

/test pull-integration-jdbc-test

/test pull-integration-mysql-test

/test pull-integration-nodejs-test

/test pull-integration-python-orm-test

/test pull-mysql-test-next-gen

/test pull-sqllogic-test

/test pull-tiflash-integration-test

Use /test all to run the following jobs that were automatically triggered:

pingcap/tidb/ghpr_build

pingcap/tidb/ghpr_check

pingcap/tidb/ghpr_check2

pingcap/tidb/ghpr_mysql_test

pingcap/tidb/ghpr_unit_test

pingcap/tidb/pull_br_integration_test

pingcap/tidb/pull_build_next_gen

pingcap/tidb/pull_integration_ddl_test

pingcap/tidb/pull_integration_e2e_test

pingcap/tidb/pull_integration_realcluster_test_next_gen

pingcap/tidb/pull_mysql_client_test

pingcap/tidb/pull_mysql_client_test_next_gen

pingcap/tidb/pull_unit_test_next_gen

pull-check-deps

pull-error-log-review

Details

In response to this:

/test check-issue-triage-complete

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

tiprow · 2026-01-20T11:36:33Z

@wjhuang2016: The specified target(s) for /test were not found.
The following commands are available to trigger required jobs:

/test fast_test_tiprow

/test tidb_parser_test

Use /test all to run all jobs.

Details

In response to this:

/test check-issue-triage-complete

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

wjhuang2016 · 2026-01-20T11:38:03Z

/test check-issue-triage-complete

ti-chi-bot · 2026-01-20T11:38:07Z

@wjhuang2016: The specified target(s) for /test were not found.
The following commands are available to trigger required jobs:

/test build

/test check-dev

/test check-dev2

/test mysql-test

/test pull-br-integration-test

/test pull-build-next-gen

/test pull-integration-ddl-test

/test pull-integration-e2e-test

/test pull-integration-realcluster-test-next-gen

/test pull-lightning-integration-test

/test pull-mysql-client-test

/test pull-mysql-client-test-next-gen

/test pull-unit-test-ddlv1

/test pull-unit-test-next-gen

/test unit-test

The following commands are available to trigger optional jobs:

/test pingcap/tidb/canary_ghpr_unit_test

/test pull-br-integration-test-next-gen

/test pull-check-deps

/test pull-common-test

/test pull-e2e-test

/test pull-error-log-review

/test pull-integration-common-test

/test pull-integration-copr-test

/test pull-integration-ddl-test-next-gen

/test pull-integration-e2e-test-next-gen

/test pull-integration-jdbc-test

/test pull-integration-mysql-test

/test pull-integration-nodejs-test

/test pull-integration-python-orm-test

/test pull-mysql-test-next-gen

/test pull-sqllogic-test

/test pull-tiflash-integration-test

Use /test all to run the following jobs that were automatically triggered:

pingcap/tidb/ghpr_build

pingcap/tidb/ghpr_check

pingcap/tidb/ghpr_check2

pingcap/tidb/ghpr_mysql_test

pingcap/tidb/ghpr_unit_test

pingcap/tidb/pull_br_integration_test

pingcap/tidb/pull_build_next_gen

pingcap/tidb/pull_integration_ddl_test

pingcap/tidb/pull_integration_e2e_test

pingcap/tidb/pull_integration_realcluster_test_next_gen

pingcap/tidb/pull_mysql_client_test

pingcap/tidb/pull_mysql_client_test_next_gen

pingcap/tidb/pull_unit_test_next_gen

pull-check-deps

pull-error-log-review

Details

In response to this:

/test check-issue-triage-complete

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

tiprow · 2026-01-20T11:38:09Z

@wjhuang2016: The specified target(s) for /test were not found.
The following commands are available to trigger required jobs:

/test fast_test_tiprow

/test tidb_parser_test

Use /test all to run all jobs.

Details

In response to this:

/test check-issue-triage-complete

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

wjhuang2016 · 2026-01-20T11:42:30Z

/test mysql-test

tiprow · 2026-01-20T11:42:36Z

@wjhuang2016: The specified target(s) for /test were not found.
The following commands are available to trigger required jobs:

/test fast_test_tiprow

/test tidb_parser_test

Use /test all to run all jobs.

Details

In response to this:

/test mysql-test

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

wjhuang2016 · 2026-01-20T12:37:09Z

/retest

IANTHEREAL · 2026-01-22T11:02:28Z

@pantheon-bot please take a comprehensive bug hunt review

pantheon-ai · 2026-01-22T11:02:37Z

Bug Hunt Review Results

Found 3 P2 issues (test quality/reliability):

P2-1: Unsafe `require.Eventually` with `MustExec`/`MustQuery` calls

Problem: require.Eventually runs its condition in a goroutine. When tk.MustQuery/tk.MustExec are called inside the condition, transient SQL errors trigger FailNow() off the test goroutine, causing the condition goroutine to exit without returning a boolean. This results in misleading timeouts (5-10s) with "Condition never satisfied" instead of the actual SQL error.

Locations:

pkg/executor/test/splittest/split_table_test.go:55-56
pkg/testkit/testkit.go:235-236

Impact: Makes flaky tests harder to debug; test failures show generic timeout rather than root cause.

Recommendation: Use require.EventuallyWithT or assert.CollectT, or refactor conditions to pure boolean checks without Must* calls.

P2-2: `require.Eventually` conditions swallow unexpected etcd errors

Problem: In br/pkg/utils/register_test.go, the require.Eventually conditions only check for "requested lease not found" errors via string matching and return false for all other errors. This means unexpected etcd errors (e.g., network failures, auth errors) are silently discarded and manifest as generic timeouts instead of actionable failure messages.

Locations:

br/pkg/utils/register_test.go:114 (TestTaskRegisterFailedGrant)
br/pkg/utils/register_test.go:167 (TestTaskRegisterFailedReput)

Impact: Reduces test failure diagnostics quality; developers see timeout instead of actual error.

Recommendation: Capture and assert on the last error, or use EventuallyWithT/Eventuallyf; consider using typed error matching (rpctypes.ErrLeaseNotFound) instead of string matching.

P2-3: `statistics_handler_test.go` nil-deref panic risk with global `RunInGoTestChan`

Problem: server.RunInGoTestChan is a global channel closed on first (*server.Server).Run() in the test binary. Subsequent server starts in the same test process wait on an already-closed channel, so <-server.RunInGoTestChan returns immediately without synchronizing with the new server's listener initialization. The immediately-following server.ListenAddr()/server.StatusListenerAddr() calls can dereference nil s.listener/s.statusListener and panic if the goroutine hasn't reached initTiDBListener() yet.

Locations:

pkg/server/handler/optimizor/statistics_handler_test.go:66-69
pkg/server/handler/optimizor/statistics_handler_test.go:318-321
pkg/server/handler/optimizor/statistics_handler_test.go:379-382

Impact: Test process crash/flake when multiple servers are started in the same test binary. The PR's fix in plan_replayer_test.go documents this exact hazard, but statistics_handler_test.go still has the unsafe pattern.

Recommendation: Recreate server.RunInGoTestChan = make(chan struct{}) before each server start, as done in plan_replayer_test.go:96-109.

Copilot

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated no new comments.

ti-chi-bot · 2026-01-22T13:06:07Z

@wjhuang2016: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
idc-jenkins-ci-tidb/mysql-test	`4f383f2`	link	true	`/test mysql-test`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

wjhuang2016 · 2026-01-22T13:07:38Z

Split into per-case issues/PRs:\n- #65732 (br/pkg/utils task register tests) closes #65726\n- #65733 (pkg/util/gctuner TestIssue48741) closes #65727\n- #65734 (pkg/util TLS tests) closes #65728\n- #65735 (pkg/server/handler/optimizor plan replayer tests) closes #65729\n- #65736 (pkg/server/handler/optimizor stats handler test) closes #65730\n- #65737 (tests/realtikvtest/txntest staleness txn test) closes #65731

Copilot AI review requested due to automatic review settings January 20, 2026 08:29

ti-chi-bot bot added do-not-merge/needs-triage-completed release-note-none Denotes a PR that doesn't merit a release note. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jan 20, 2026

Copilot started reviewing on behalf of wjhuang2016 January 20, 2026 08:29 View session

test: fix flaky unit tests

370a560

wjhuang2016 force-pushed the test/stabilize-flaky-ut branch from 2a6f984 to 370a560 Compare January 20, 2026 08:35

Copilot AI reviewed Jan 20, 2026

View reviewed changes

ti-chi-bot bot removed the do-not-merge/needs-triage-completed label Jan 20, 2026

pkg/server, tests/realtikvtest: stabilize flaky tests

88f2806

ti-chi-bot bot added the do-not-merge/needs-triage-completed label Jan 20, 2026

wjhuang2016 added the skip-issue-check Indicates that a PR no need to check linked issue. label Jan 20, 2026

test: reset RunInGoTestChan for stats handler test

4f383f2

Copilot AI review requested due to automatic review settings January 22, 2026 12:41

ti-chi-bot bot removed the do-not-merge/needs-triage-completed label Jan 22, 2026

Copilot started reviewing on behalf of wjhuang2016 January 22, 2026 12:42 View session

Copilot AI reviewed Jan 22, 2026

View reviewed changes

wjhuang2016 closed this Jan 22, 2026

*: stabilize flaky unit tests #65674

*: stabilize flaky unit tests #65674

Uh oh!

Conversation

wjhuang2016 commented Jan 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What problem does this PR solve?

What changed and how does it work?

Case 1: br/pkg/utils/register_test.go (task register keepalive)

Case 2: pkg/util/gctuner/memory_limit_tuner_test.go (TestIssue48741)

Case 3: pkg/util/security_test.go (TestVerifyCommonNameAndRotate)

Case 4: pkg/server/handler/optimizor/plan_replayer_test.go (plan replayer + bindings)

Case 5: pkg/server/handler/optimizor/statistics_handler_test.go (TestDumpStatsAPI)

Case 6: tests/realtikvtest/txntest/stale_read_test.go (TestValidateReadOnlyInStalenessTransaction)

Check List

Release note

Uh oh!

ti-chi-bot bot commented Jan 20, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

wjhuang2016 commented Jan 20, 2026

Uh oh!

codecov bot commented Jan 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

wjhuang2016 commented Jan 20, 2026

Uh oh!

ti-chi-bot bot commented Jan 20, 2026

Uh oh!

tiprow bot commented Jan 20, 2026

Uh oh!

wjhuang2016 commented Jan 20, 2026

Uh oh!

ti-chi-bot bot commented Jan 20, 2026

Uh oh!

tiprow bot commented Jan 20, 2026

Uh oh!

wjhuang2016 commented Jan 20, 2026

Uh oh!

tiprow bot commented Jan 20, 2026

Uh oh!

wjhuang2016 commented Jan 20, 2026

Uh oh!

IANTHEREAL commented Jan 22, 2026

Uh oh!

pantheon-ai bot commented Jan 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Bug Hunt Review Results

P2-1: Unsafe require.Eventually with MustExec/MustQuery calls

P2-2: require.Eventually conditions swallow unexpected etcd errors

P2-3: statistics_handler_test.go nil-deref panic risk with global RunInGoTestChan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

ti-chi-bot bot commented Jan 22, 2026

Uh oh!

wjhuang2016 commented Jan 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

wjhuang2016 commented Jan 20, 2026 •

edited

Loading

Case 1: `br/pkg/utils/register_test.go` (task register keepalive)

Case 2: `pkg/util/gctuner/memory_limit_tuner_test.go` (`TestIssue48741`)

Case 3: `pkg/util/security_test.go` (`TestVerifyCommonNameAndRotate`)

Case 4: `pkg/server/handler/optimizor/plan_replayer_test.go` (plan replayer + bindings)

Case 5: `pkg/server/handler/optimizor/statistics_handler_test.go` (`TestDumpStatsAPI`)

Case 6: `tests/realtikvtest/txntest/stale_read_test.go` (`TestValidateReadOnlyInStalenessTransaction`)

codecov bot commented Jan 20, 2026 •

edited

Loading

pantheon-ai bot commented Jan 22, 2026 •

edited

Loading

P2-1: Unsafe `require.Eventually` with `MustExec`/`MustQuery` calls

P2-2: `require.Eventually` conditions swallow unexpected etcd errors

P2-3: `statistics_handler_test.go` nil-deref panic risk with global `RunInGoTestChan`