*: speed up unit tests and fix flakes #65601

wjhuang2016 · 2026-01-16T04:40:19Z

What problem does this PR solve?

Issue Number: close #65600

Problem Summary:

TiDB UT in CI can take up to hours; local profiling shows a small number of very slow testcases (and a few flake amplifiers) dominate the wall time.
Baseline (local full run, tools/bin/ut JUnit timing):
- Nextgen (NEXT_GEN=1): 51 testcases >10s, 3 testcases hit 10m timeout; run stage ~16m06s.
- Legacy: 21 testcases >10s; run stage ~7m10s.
After this PR (local full run):
- Nextgen: 0 testcases >10s (and 0 timeouts); run stage 5m27s; longest testcase 9.392s.
- Legacy: 0 testcases >10s; run stage 7m03s; longest testcase 9.610s.
Optimized slow tests: 61 unique >10s testcases (union of nextgen+legacy baselines) eliminated (now 0).
Flake amplifiers fixed: 16 high-impact flaky tests stabilized (timeouts, background leaks, races).
Bug fixes included: GET_LOCK() timeout behavior (MySQL compatibility) and @@global.tidb_slow_log_max_per_sec read stability.
Latest CI timing (Jenkins, 2026-01-16):
- pull_unit_test_next_gen#8431: total 31m26s; Test stage 29m57s; Bazel 28m55s (coverage merge 10m52s, non-coverage 18m03s); executed 187/507 tests.
- ghpr_unit_test#48886: total 44m21s; Test stage 42m51s; Bazel 41m51s (coverage merge 10m41s, non-coverage 31m10s); executed 195/507 tests.
- Note: Bazel coverage merge is a fixed ~10-11m cost (5796 tracefiles) even when most tests are remote-cached; older “~3h” next-gen runs were dominated by Kubernetes pod pending rather than test execution.

What changed and how does it work?

Scope / methodology

Profile with local full runs via tools/bin/ut (JUnit XML) and rank by per-testcase time.
Remove “single very slow testcase” patterns: oversized loops/data, exponential enumeration, fixed sleeps.
Keep determinism and stability: replace sleeps with bounded waits (require.Eventually), cap concurrency in tests, gate stress loops behind testflag.Long().

Key >10s offenders eliminated (local)

pkg/planner/core/tests/extractor: avoid exponential predicate cartesian-product enumeration (CI ~190s → local ~2–3s).
pkg/ttl/ttlworker.TestCancelWhileScan: batch insert instead of 10k single-row inserts (CI ~151s → local ~2s).
pkg/executor/test/admintest check-table locate tests: reduce rows/iterations and batch KV ops (CI ~130s → local ~2s).
Remaining >10s in phase2 were further trimmed to <10s:
- pkg/server/tests/commontest.TestTopSQLCPUProfile: remove redundant cases and skip normalized-plan decoding unless needed.
- pkg/ddl/ingest.TestAddGlobalIndexInIngest: shrink partitions/rows and cap DML injections.
- pkg/dxf/importinto.TestSubmitTaskNextgen: reduce etcd integration cluster size.
- pkg/executor/test/executor.TestQueryWithKill: bound runtime in normal UT.
- pkg/table/tables.TestCacheTableBasicReadAndWrite: make cache-table lease test deterministic and remove fixed sleeps.
- pkg/session/test/bootstraptest upgrade tests: remove redundant re-bootstrap matrices.

Flake amplifiers fixed (16)

pkg/executor TestFlashbackClusterWithManyDBs
pkg/planner/core/casetest/instanceplancache TestInstancePlanCacheConcurrencySysbench
pkg/ddl/tests/partition TestMultiSchemaPartitionByGlobalIndex
pkg/objstore/objectio TestNewCompressReader
pkg/statistics/handle/cache/internal/lfu TestMemoryControlWithUpdate
pkg/store/gcworker TestLeaderTick
pkg/ddl TestBackfillingSchedulerGlobalSortMode
pkg/server/handler/optimizor TestPlanReplayerWithMultiForeignKey
pkg/executor/test/analyzetest TestKillAutoAnalyze
pkg/util/memory TestRelease
pkg/util/gctuner TestIssue48741
pkg/store/gcworker TestGCWithPendingTxn
pkg/expression/integration_test TestGetLock
pkg/store TestStoreSwitchPeer
pkg/util TestVerifyCommonNameAndRotate
br/pkg/utils TestTaskRegisterFailedGrant (and TestTaskRegisterFailedReput)

Bug fixes (non-test behavior)

pkg/expression/builtin_miscellaneous.go: GET_LOCK() returns 0 (not error) on tikverr.ErrLockWaitTimeout, aligning with MySQL behavior.
pkg/sessionctx/variable/sysvar.go: @@global.tidb_slow_log_max_per_sec returns 0 when the limiter is unlimited (rate.Inf).

Change details (grouped by directory)

tools/check:
- tools/check/ut.go: add ut run --timeout <duration> to override per-test -test.timeout; improve cover profile merge error handling and always clean up temp dirs.
pkg/bindinfo:
- pkg/bindinfo/tests/bind_usage_info_test.go: reduce fixed sleeps and restore global knobs/sysvars after the test.
pkg/ddl:
- pkg/ddl/ingest/integration_test.go: shrink ingest DDL scenarios; cap concurrent DML injections; gate heavier loops behind testflag.Long().
- pkg/ddl/backfilling.go, pkg/ddl/job_worker.go: make delay-related failpoints accept configurable values to avoid hard-coded sleeps.
- pkg/ddl/tests/*: reduce polling rounds in non-long UT; add bounded waits for async delete-range completion; tighten teardown order.
pkg/domain:
- pkg/domain/crossks/cross_ks_test.go: reduce etcd integration cluster size and avoid expensive shared-client patterns.
pkg/dxf:
- pkg/dxf/importinto/job_testkit_test.go: reduce etcd integration cluster size for nextgen keyspace tests.
- pkg/dxf/framework/integrationtests/*: reduce randomized case counts / tighten waits.
pkg/executor:
- Analyze/checksum stability: finalize pending jobs on error, wait for sys process IDs before killing, bound waits.
- Heavy-loop tests: reduce data/iterations; gate stress loops behind testflag.Long().
- Kill/cancel tests: bound runtime (e.g. TestQueryWithKill).
- Spilling/interrupt/failpoint tests: reduce input size and repetition; replace fixed sleeps with bounded waits.
pkg/expression:
- GET_LOCK() timeout behavior fix (MySQL compatibility).
pkg/infoschema:
- Cluster table privilege tests use EXPLAIN to avoid executing heavy cluster queries.
pkg/objstore:
- pkg/objstore/gcs.go: allow unauthenticated GCS client for localhost endpoints in intest (avoid default-cred dependency in tests).
- pkg/objstore/gcs_test.go: implement expected token endpoints in httptest server to keep the test bounded.
- pkg/objstore/azblob_test.go: close writer and reuse checksum to avoid resource leak and reduce repeated work.
- pkg/objstore/objectio/writer_test.go: close resources and remove fragile goroutine-count assertions.
pkg/planner:
- Extractor tests: replace exponential enumeration with “single-column full coverage + pairwise combinations”.
- MPP/hint/plan-cache concurrency cases: cap case count and make concurrency tests retryable / bounded.
pkg/server:
- Reduce sleeps/data in max-exec-time tests.
- Make audit-plugin retry test bounded (cap concurrency & DB pool).
- TopSQL tests: reduce redundant cases; shorten reporting interval; use random ports and fail fast in proxy-protocol tests.
- Stabilize binding reload checks via eventual waits.
pkg/session:
- Bootstrap/upgrade suites: remove redundant re-bootstrap matrices; batch dist-task state coverage.
pkg/sessiontxn:
- Simplify fragile timing assertions in RC/TSO optimize tests.
pkg/statistics:
- Stabilize async cache writes and stats GC/ddl-event waits under -cover.
pkg/store:
- Reduce coprocessor backoff in tests via configurable failpoints; add bounded retries for occasional TiFlash slow paths.
- GC worker tests: stop TTL job manager in bootstrap and bound GC completion waits.
pkg/table:
- Cache table tests: add failpoint-controlled write-lease duration and replace fixed sleeps with require.Eventually.
br/pkg/utils:
- Make task register tests deterministic and fast: avoid side effects under failpoints, replace fixed sleeps with require.Eventually, and add failpoint-controlled retry interval.
pkg/util/gctuner:
- Stabilize memory limit tests by waiting for the fallback memory-limit update to take effect before asserting.
pkg/timer, pkg/ttl, pkg/util:
- Replace fixed sleeps with eventual waits; reduce loop counts; remove allocator/ABI fragile exact assertions; shorten integration-style UT workers.
- pkg/util/security_test.go: stabilize TestVerifyCommonNameAndRotate by validating the server-side handshake error log instead of brittle client-side error strings.

Local timing artifacts (for reference)

Nextgen baseline: ~/.codex-context/tidb--906fbbe4/artifacts/ut_full_nextgen_20260114_122930/parsed/slow_testcases.tsv
Nextgen current: ~/.codex-context/tidb--906fbbe4/artifacts/ut_full_nextgen_20260116_111713/parsed/slow_testcases.tsv
Legacy baseline: ~/.codex-context/tidb--906fbbe4/artifacts/ut_full_legacy_20260115_165149/parsed/slow_testcases.tsv
Legacy current: ~/.codex-context/tidb--906fbbe4/artifacts/ut_full_legacy_20260116_103514/parsed/slow_testcases.tsv

Check List

Tests

Unit test
Integration test
Manual test (add detailed scripts or steps below)
No need to test
- I checked and no code files have been changed.

Side effects

Performance regression: Consumes more CPU
Performance regression: Consumes more Memory
Breaking backward compatibility

Documentation

Release note

Please refer to Release Notes Language Style Guide to write a quality release note.

Fix `GET_LOCK()` to return 0 on lock wait timeout and align with MySQL behavior. Also make `tidb_slow_log_max_per_sec` return 0 when unlimited.

codecov · 2026-01-16T05:16:39Z

Codecov Report

❌ Patch coverage is 65.80087% with 79 lines in your changes missing coverage. Please review.
✅ Project coverage is 79.4026%. Comparing base (f8627fd) to head (abe1ade).
⚠️ Report is 3 commits behind head on master.

Additional details and impacted files

@@               Coverage Diff                @@
##             master     #65601        +/-   ##
================================================
+ Coverage   77.8383%   79.4026%   +1.5642%     
================================================
  Files          1978       1908        -70     
  Lines        542180     531165     -11015     
================================================
- Hits         422024     421759       -265     
+ Misses       118496     107950     -10546     
+ Partials       1660       1456       -204

Flag	Coverage Δ
integration	`47.6671% <7.3913%> (-0.5223%)`	⬇️
unit	`76.6824% <65.8008%> (+0.2159%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Components	Coverage Δ
dumpling	`56.7974% <ø> (ø)`
parser	`∅ <ø> (∅)`
br	`66.3545% <76.3157%> (+5.2406%)`	⬆️

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

wjhuang2016 · 2026-01-16T06:11:35Z

/test tidb_parser_test

ti-chi-bot · 2026-01-16T06:11:39Z

@wjhuang2016: The specified target(s) for /test were not found.
The following commands are available to trigger required jobs:

/test build

/test check-dev

/test check-dev2

/test mysql-test

/test pull-br-integration-test

/test pull-build-next-gen

/test pull-integration-ddl-test

/test pull-integration-e2e-test

/test pull-integration-realcluster-test-next-gen

/test pull-lightning-integration-test

/test pull-mysql-client-test

/test pull-mysql-client-test-next-gen

/test pull-unit-test-ddlv1

/test pull-unit-test-next-gen

/test unit-test

The following commands are available to trigger optional jobs:

/test pingcap/tidb/canary_ghpr_unit_test

/test pull-br-integration-test-next-gen

/test pull-check-deps

/test pull-common-test

/test pull-e2e-test

/test pull-error-log-review

/test pull-integration-common-test

/test pull-integration-copr-test

/test pull-integration-ddl-test-next-gen

/test pull-integration-e2e-test-next-gen

/test pull-integration-jdbc-test

/test pull-integration-mysql-test

/test pull-integration-nodejs-test

/test pull-integration-python-orm-test

/test pull-mysql-test-next-gen

/test pull-sqllogic-test

/test pull-tiflash-integration-test

Use /test all to run the following jobs that were automatically triggered:

pingcap/tidb/ghpr_build

pingcap/tidb/ghpr_check

pingcap/tidb/ghpr_check2

pingcap/tidb/ghpr_mysql_test

pingcap/tidb/ghpr_unit_test

pingcap/tidb/pull_br_integration_test

pingcap/tidb/pull_build_next_gen

pingcap/tidb/pull_integration_ddl_test

pingcap/tidb/pull_integration_e2e_test

pingcap/tidb/pull_integration_realcluster_test_next_gen

pingcap/tidb/pull_lightning_integration_test

pingcap/tidb/pull_mysql_client_test

pingcap/tidb/pull_mysql_client_test_next_gen

pingcap/tidb/pull_unit_test_ddlv1

pingcap/tidb/pull_unit_test_next_gen

pull-check-deps

pull-error-log-review

Details

In response to this:

/test tidb_parser_test

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

ti-chi-bot · 2026-01-16T10:42:32Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign benmeadowcroft, gmhdbjd, leavrth, terry1purcell, xuhuaiyu for approval. For more information see the Code Review Process.
Please ensure that each of them provides their approval before proceeding.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

ti-chi-bot · 2026-01-16T11:32:38Z

@wjhuang2016: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
pull-unit-test-ddlv1	`abe1ade`	link	true	`/test pull-unit-test-ddlv1`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

wjhuang2016 added 5 commits January 16, 2026 12:26

bazel: update BUILD files for analyze memory assert

9315cbd

build: sync bazel shard_count for memorycontrol test

6f8d585

*: speed up unit tests and fix flakes

17b48bd

docs: move UT summary to PR description

7e4cd10

docs: move UT timing details to PR description

7cfe45e

ti-chi-bot bot added release-note Denotes a PR that will be considered when it comes time to generate release notes. do-not-merge/needs-triage-completed size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. component/statistics sig/planner SIG: Planner labels Jan 16, 2026

build: fix bazel deps for lockstore test

d402d29

ti-chi-bot bot removed the do-not-merge/needs-triage-completed label Jan 16, 2026

wjhuang2016 added 2 commits January 16, 2026 13:25

build: sync bazel deps and gofmt

eb4f21a

test: gofmt updated test files

8881688

wjhuang2016 added 3 commits January 16, 2026 16:16

make: run Bazel UT without remote cache

8c63ae1

make: make Bazel no-remote-cache flag opt-in

e10d621

test: fix flaky unit tests

abe1ade

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

*: speed up unit tests and fix flakes #65601

*: speed up unit tests and fix flakes #65601

Uh oh!

wjhuang2016 commented Jan 16, 2026 •

edited

Loading

Uh oh!

codecov bot commented Jan 16, 2026 •

edited

Loading

Uh oh!

wjhuang2016 commented Jan 16, 2026

Uh oh!

ti-chi-bot bot commented Jan 16, 2026

Uh oh!

ti-chi-bot bot commented Jan 16, 2026

Uh oh!

ti-chi-bot bot commented Jan 16, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

*: speed up unit tests and fix flakes #65601

Are you sure you want to change the base?

*: speed up unit tests and fix flakes #65601

Uh oh!

Conversation

wjhuang2016 commented Jan 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What problem does this PR solve?

What changed and how does it work?

Scope / methodology

Key >10s offenders eliminated (local)

Flake amplifiers fixed (16)

Bug fixes (non-test behavior)

Change details (grouped by directory)

Local timing artifacts (for reference)

Check List

Release note

Uh oh!

codecov bot commented Jan 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

wjhuang2016 commented Jan 16, 2026

Uh oh!

ti-chi-bot bot commented Jan 16, 2026

Uh oh!

ti-chi-bot bot commented Jan 16, 2026

Uh oh!

ti-chi-bot bot commented Jan 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

wjhuang2016 commented Jan 16, 2026 •

edited

Loading

codecov bot commented Jan 16, 2026 •

edited

Loading

ti-chi-bot bot commented Jan 16, 2026 •

edited

Loading