Skip to content

Conversation

@wjhuang2016
Copy link
Member

@wjhuang2016 wjhuang2016 commented Jan 20, 2026

What problem does this PR solve?

Issue Number: close #65673, close #65679

Problem Summary:

Unit tests were flaky due to timing assumptions, async state propagation, and shared global state.

What changed and how does it work?

Case 1: br/pkg/utils/register_test.go (task register keepalive)

Why it was flaky:

  • The tests used time.Sleep(RegisterRetryInternal) while etcd lease keepalive/retry is async and timing-sensitive.
  • GetImportTasksFrom can transiently fail during TTL query (e.g. requested lease not found) when the lease is revoked/expired.

Fix:

  • Add taskRegister.sleepRetryInterval and use it in retry paths, with a failpoint (brie-task-register-retry-interval) to shorten retry sleeps in tests.
  • Replace fixed sleeps with require.Eventually checks.
  • Make the tests tolerant to requested lease not found on Close.

Case 2: pkg/util/gctuner/memory_limit_tuner_test.go (TestIssue48741)

Why it was flaky:

  • debug.SetMemoryLimit updates and GC triggering/adjusting are not strictly synchronous; 500ms timeouts were too aggressive on slower hosts.

Fix:

  • Relax require.Eventually timeouts.
  • Replace strict instantaneous require.Equal checks on debug.SetMemoryLimit(-1) with require.Eventually.

Case 3: pkg/util/security_test.go (TestVerifyCommonNameAndRotate)

Why it was flaky:

  • The client-side TLS error string is not stable across Go versions/handshake timing.
  • Using the global http.DefaultServeMux can cause cross-test interference.

Fix:

  • Use a per-test http.ServeMux.
  • Capture server-side TLS handshake errors via http.Server.ErrorLog and assert on the server-side error message (client certificate authentication failed).

Case 4: pkg/server/handler/optimizor/plan_replayer_test.go (plan replayer + bindings)

Why it was flaky:

  • server.RunInGoTestChan is a global channel and is closed after the first server starts; subsequent server starts in the same test binary could race if the channel is not recreated.
  • After PLAN REPLAYER LOAD, the session default DB can change; unqualified queries may hit the wrong schema and miss the intended binding.
  • Binding creation/reload and cache population is async; immediate @@last_plan_from_binding checks can be racy.
  • Dump-created global bindings can interfere with later LOAD in the same server if not cleaned up.

Fix:

  • Recreate server.RunInGoTestChan before server starts so <-RunInGoTestChan reliably waits for the current instance.
  • Explicitly USE planReplayer after PLAN REPLAYER LOAD and enable plan baselines.
  • Use require.Eventually for the binding effect check.
  • Adjust binding cache quota during the test and restore it afterward.
  • Drop the dump-created global binding to avoid later interference.

Case 5: pkg/server/handler/optimizor/statistics_handler_test.go (TestDumpStatsAPI)

Why it was flaky:

  • Same class as Case 4: waiting on a previously-closed global RunInGoTestChan can lose the startup barrier and race on reading ListenAddr/StatusListenerAddr.

Fix:

  • Recreate server.RunInGoTestChan before starting the server.

Case 6: tests/realtikvtest/txntest/stale_read_test.go (TestValidateReadOnlyInStalenessTransaction)

Why it was flaky:

  • AS OF TIMESTAMP NOW() can be too new under safeTS/scheduling lag; some environments need a small time buffer to make stale-read transactions stable.

Fix:

  • Use a slightly older timestamp expression: NOW(3) - INTERVAL 1 SECOND.

Check List

Tests

  • Unit test
    • make failpoint-enable && go test ./br/pkg/utils -run '^TestTaskRegister' -count=10 -tags=intest
    • make failpoint-enable && go test ./pkg/util/gctuner -run '^TestIssue48741$' -count=5 -tags=intest
    • go test ./pkg/util -run '^(TestInvalidTLS|TestVerifyCommonNameAndRotate|TestTLSVersion|TestCA)$' -count=10 -tags=intest
    • go test ./pkg/server/handler/optimizor -run '^TestPlanReplayerWithMultiForeignKey$' -count=50 -tags=intest -timeout 30m
  • Integration test
    • go test ./tests/realtikvtest/txntest -run '^TestValidateReadOnlyInStalenessTransaction$' -count=10 -tags=intest -timeout 60m
    • go test ./tests/realtikvtest/addindextest -run '^TestCreateNonUniqueIndex$' -count=30 -tags=intest -timeout 240m
  • Manual test (add detailed scripts or steps below)
  • No need to test
    • I checked and no code files have been changed.

Side effects

  • Performance regression: Consumes more CPU
  • Performance regression: Consumes more Memory
  • Breaking backward compatibility

Documentation

  • Affects user behaviors
  • Contains syntax changes
  • Contains variable changes
  • Contains experimental features
  • Changes MySQL compatibility

Release note

Please refer to Release Notes Language Style Guide to write a quality release note.

None

Copilot AI review requested due to automatic review settings January 20, 2026 08:29
@ti-chi-bot ti-chi-bot bot added do-not-merge/needs-triage-completed release-note-none Denotes a PR that doesn't merit a release note. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jan 20, 2026
@ti-chi-bot
Copy link

ti-chi-bot bot commented Jan 20, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign leavrth for approval. For more information see the Code Review Process.
Please ensure that each of them provides their approval before proceeding.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@wjhuang2016 wjhuang2016 force-pushed the test/stabilize-flaky-ut branch from 2a6f984 to 370a560 Compare January 20, 2026 08:35
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR stabilizes flaky unit tests across three packages by replacing timing assumptions with proper synchronization primitives and making tests more deterministic.

Changes:

  • Replaced fixed sleep calls with require.Eventually for asynchronous operations in task register and memory tuner tests
  • Added failpoint support for controlling retry intervals in task register to enable deterministic testing
  • Changed TLS rotation tests to check server-side error logs instead of unstable client-side error messages
  • Introduced per-test HTTP mux instead of global state to avoid test interference

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.

File Description
br/pkg/utils/register.go Added sleepRetryInterval helper with failpoint support and refactored error injection to prevent overwriting real errors
br/pkg/utils/register_test.go Replaced fixed sleeps with require.Eventually and added retry interval failpoint for deterministic testing
pkg/util/gctuner/memory_limit_tuner_test.go Replaced strict equality checks with require.Eventually and increased timeouts for slower environments
pkg/util/security_test.go Introduced thread-safe error logging buffer and switched from client-side to server-side TLS error checking

@wjhuang2016
Copy link
Member Author

/check-issue-triage-complete

@codecov
Copy link

codecov bot commented Jan 20, 2026

Codecov Report

❌ Patch coverage is 76.31579% with 9 lines in your changes missing coverage. Please review.
✅ Project coverage is 77.5654%. Comparing base (8e961f7) to head (4f383f2).
⚠️ Report is 11 commits behind head on master.

Additional details and impacted files
@@               Coverage Diff                @@
##             master     #65674        +/-   ##
================================================
- Coverage   77.8195%   77.5654%   -0.2542%     
================================================
  Files          1989       1919        -70     
  Lines        543412     533539      -9873     
================================================
- Hits         422881     413842      -9039     
- Misses       118872     119593       +721     
+ Partials       1659        104      -1555     
Flag Coverage Δ
integration 41.5227% <0.0000%> (-6.6647%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Components Coverage Δ
dumpling 56.7974% <ø> (ø)
parser ∅ <ø> (∅)
br 48.7696% <76.3157%> (-12.2794%) ⬇️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@wjhuang2016
Copy link
Member Author

/test check-issue-triage-complete

@ti-chi-bot
Copy link

ti-chi-bot bot commented Jan 20, 2026

@wjhuang2016: The specified target(s) for /test were not found.
The following commands are available to trigger required jobs:

/test build
/test check-dev
/test check-dev2
/test mysql-test
/test pull-br-integration-test
/test pull-build-next-gen
/test pull-integration-ddl-test
/test pull-integration-e2e-test
/test pull-integration-realcluster-test-next-gen
/test pull-lightning-integration-test
/test pull-mysql-client-test
/test pull-mysql-client-test-next-gen
/test pull-unit-test-ddlv1
/test pull-unit-test-next-gen
/test unit-test

The following commands are available to trigger optional jobs:

/test pingcap/tidb/canary_ghpr_unit_test
/test pull-br-integration-test-next-gen
/test pull-check-deps
/test pull-common-test
/test pull-e2e-test
/test pull-error-log-review
/test pull-integration-common-test
/test pull-integration-copr-test
/test pull-integration-ddl-test-next-gen
/test pull-integration-e2e-test-next-gen
/test pull-integration-jdbc-test
/test pull-integration-mysql-test
/test pull-integration-nodejs-test
/test pull-integration-python-orm-test
/test pull-mysql-test-next-gen
/test pull-sqllogic-test
/test pull-tiflash-integration-test

Use /test all to run the following jobs that were automatically triggered:

pingcap/tidb/ghpr_build
pingcap/tidb/ghpr_check
pingcap/tidb/ghpr_check2
pingcap/tidb/ghpr_mysql_test
pingcap/tidb/ghpr_unit_test
pingcap/tidb/pull_br_integration_test
pingcap/tidb/pull_build_next_gen
pingcap/tidb/pull_integration_ddl_test
pingcap/tidb/pull_integration_e2e_test
pingcap/tidb/pull_integration_realcluster_test_next_gen
pingcap/tidb/pull_mysql_client_test
pingcap/tidb/pull_mysql_client_test_next_gen
pingcap/tidb/pull_unit_test_next_gen
pull-check-deps
pull-error-log-review
Details

In response to this:

/test check-issue-triage-complete

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@tiprow
Copy link

tiprow bot commented Jan 20, 2026

@wjhuang2016: The specified target(s) for /test were not found.
The following commands are available to trigger required jobs:

/test fast_test_tiprow
/test tidb_parser_test

Use /test all to run all jobs.

Details

In response to this:

/test check-issue-triage-complete

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@wjhuang2016
Copy link
Member Author

/test check-issue-triage-complete

@ti-chi-bot
Copy link

ti-chi-bot bot commented Jan 20, 2026

@wjhuang2016: The specified target(s) for /test were not found.
The following commands are available to trigger required jobs:

/test build
/test check-dev
/test check-dev2
/test mysql-test
/test pull-br-integration-test
/test pull-build-next-gen
/test pull-integration-ddl-test
/test pull-integration-e2e-test
/test pull-integration-realcluster-test-next-gen
/test pull-lightning-integration-test
/test pull-mysql-client-test
/test pull-mysql-client-test-next-gen
/test pull-unit-test-ddlv1
/test pull-unit-test-next-gen
/test unit-test

The following commands are available to trigger optional jobs:

/test pingcap/tidb/canary_ghpr_unit_test
/test pull-br-integration-test-next-gen
/test pull-check-deps
/test pull-common-test
/test pull-e2e-test
/test pull-error-log-review
/test pull-integration-common-test
/test pull-integration-copr-test
/test pull-integration-ddl-test-next-gen
/test pull-integration-e2e-test-next-gen
/test pull-integration-jdbc-test
/test pull-integration-mysql-test
/test pull-integration-nodejs-test
/test pull-integration-python-orm-test
/test pull-mysql-test-next-gen
/test pull-sqllogic-test
/test pull-tiflash-integration-test

Use /test all to run the following jobs that were automatically triggered:

pingcap/tidb/ghpr_build
pingcap/tidb/ghpr_check
pingcap/tidb/ghpr_check2
pingcap/tidb/ghpr_mysql_test
pingcap/tidb/ghpr_unit_test
pingcap/tidb/pull_br_integration_test
pingcap/tidb/pull_build_next_gen
pingcap/tidb/pull_integration_ddl_test
pingcap/tidb/pull_integration_e2e_test
pingcap/tidb/pull_integration_realcluster_test_next_gen
pingcap/tidb/pull_mysql_client_test
pingcap/tidb/pull_mysql_client_test_next_gen
pingcap/tidb/pull_unit_test_next_gen
pull-check-deps
pull-error-log-review
Details

In response to this:

/test check-issue-triage-complete

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@tiprow
Copy link

tiprow bot commented Jan 20, 2026

@wjhuang2016: The specified target(s) for /test were not found.
The following commands are available to trigger required jobs:

/test fast_test_tiprow
/test tidb_parser_test

Use /test all to run all jobs.

Details

In response to this:

/test check-issue-triage-complete

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@wjhuang2016
Copy link
Member Author

/test mysql-test

@tiprow
Copy link

tiprow bot commented Jan 20, 2026

@wjhuang2016: The specified target(s) for /test were not found.
The following commands are available to trigger required jobs:

/test fast_test_tiprow
/test tidb_parser_test

Use /test all to run all jobs.

Details

In response to this:

/test mysql-test

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@wjhuang2016
Copy link
Member Author

/retest

@wjhuang2016 wjhuang2016 added the skip-issue-check Indicates that a PR no need to check linked issue. label Jan 20, 2026
@IANTHEREAL
Copy link
Contributor

@pantheon-bot please take a comprehensive bug hunt review

@pantheon-ai
Copy link

pantheon-ai bot commented Jan 22, 2026

Bug Hunt Review Results

Found 3 P2 issues (test quality/reliability):

P2-1: Unsafe require.Eventually with MustExec/MustQuery calls

Problem: require.Eventually runs its condition in a goroutine. When tk.MustQuery/tk.MustExec are called inside the condition, transient SQL errors trigger FailNow() off the test goroutine, causing the condition goroutine to exit without returning a boolean. This results in misleading timeouts (5-10s) with "Condition never satisfied" instead of the actual SQL error.

Locations:

  • pkg/executor/test/splittest/split_table_test.go:55-56
  • pkg/testkit/testkit.go:235-236

Impact: Makes flaky tests harder to debug; test failures show generic timeout rather than root cause.

Recommendation: Use require.EventuallyWithT or assert.CollectT, or refactor conditions to pure boolean checks without Must* calls.


P2-2: require.Eventually conditions swallow unexpected etcd errors

Problem: In br/pkg/utils/register_test.go, the require.Eventually conditions only check for "requested lease not found" errors via string matching and return false for all other errors. This means unexpected etcd errors (e.g., network failures, auth errors) are silently discarded and manifest as generic timeouts instead of actionable failure messages.

Locations:

  • br/pkg/utils/register_test.go:114 (TestTaskRegisterFailedGrant)
  • br/pkg/utils/register_test.go:167 (TestTaskRegisterFailedReput)

Impact: Reduces test failure diagnostics quality; developers see timeout instead of actual error.

Recommendation: Capture and assert on the last error, or use EventuallyWithT/Eventuallyf; consider using typed error matching (rpctypes.ErrLeaseNotFound) instead of string matching.


P2-3: statistics_handler_test.go nil-deref panic risk with global RunInGoTestChan

Problem: server.RunInGoTestChan is a global channel closed on first (*server.Server).Run() in the test binary. Subsequent server starts in the same test process wait on an already-closed channel, so <-server.RunInGoTestChan returns immediately without synchronizing with the new server's listener initialization. The immediately-following server.ListenAddr()/server.StatusListenerAddr() calls can dereference nil s.listener/s.statusListener and panic if the goroutine hasn't reached initTiDBListener() yet.

Locations:

  • pkg/server/handler/optimizor/statistics_handler_test.go:66-69
  • pkg/server/handler/optimizor/statistics_handler_test.go:318-321
  • pkg/server/handler/optimizor/statistics_handler_test.go:379-382

Impact: Test process crash/flake when multiple servers are started in the same test binary. The PR's fix in plan_replayer_test.go documents this exact hazard, but statistics_handler_test.go still has the unsafe pattern.

Recommendation: Recreate server.RunInGoTestChan = make(chan struct{}) before each server start, as done in plan_replayer_test.go:96-109.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated no new comments.

@ti-chi-bot
Copy link

ti-chi-bot bot commented Jan 22, 2026

@wjhuang2016: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
idc-jenkins-ci-tidb/mysql-test 4f383f2 link true /test mysql-test

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@wjhuang2016
Copy link
Member Author

Split into per-case issues/PRs:\n- #65732 (br/pkg/utils task register tests) closes #65726\n- #65733 (pkg/util/gctuner TestIssue48741) closes #65727\n- #65734 (pkg/util TLS tests) closes #65728\n- #65735 (pkg/server/handler/optimizor plan replayer tests) closes #65729\n- #65736 (pkg/server/handler/optimizor stats handler test) closes #65730\n- #65737 (tests/realtikvtest/txntest staleness txn test) closes #65731

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

release-note-none Denotes a PR that doesn't merit a release note. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. skip-issue-check Indicates that a PR no need to check linked issue.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Flaky tests: plan replayer binding check & staleness txn timestamp Flaky unit tests: stabilize task register, TLS, and memory limit tuner

2 participants