Skip to content

Conversation

@wjhuang2016
Copy link
Member

@wjhuang2016 wjhuang2016 commented Jan 16, 2026

What problem does this PR solve?

Issue Number: close #65600

Problem Summary:

  • TiDB UT in CI can take up to hours; local profiling shows a small number of very slow testcases (and a few flake amplifiers) dominate the wall time.
  • Baseline (local full run, tools/bin/ut JUnit timing):
    • Nextgen (NEXT_GEN=1): 51 testcases >10s, 3 testcases hit 10m timeout; run stage ~16m06s.
    • Legacy: 21 testcases >10s; run stage ~7m10s.
  • After this PR (local full run):
    • Nextgen: 0 testcases >10s (and 0 timeouts); run stage 5m27s; longest testcase 9.392s.
    • Legacy: 0 testcases >10s; run stage 7m03s; longest testcase 9.610s.
  • Optimized slow tests: 61 unique >10s testcases (union of nextgen+legacy baselines) eliminated (now 0).
  • Flake amplifiers fixed: 16 high-impact flaky tests stabilized (timeouts, background leaks, races).
  • Bug fixes included: GET_LOCK() timeout behavior (MySQL compatibility) and @@global.tidb_slow_log_max_per_sec read stability.
  • Latest CI timing (Jenkins, 2026-01-16):
    • pull_unit_test_next_gen#8431: total 31m26s; Test stage 29m57s; Bazel 28m55s (coverage merge 10m52s, non-coverage 18m03s); executed 187/507 tests.
    • ghpr_unit_test#48886: total 44m21s; Test stage 42m51s; Bazel 41m51s (coverage merge 10m41s, non-coverage 31m10s); executed 195/507 tests.
    • Note: Bazel coverage merge is a fixed ~10-11m cost (5796 tracefiles) even when most tests are remote-cached; older “~3h” next-gen runs were dominated by Kubernetes pod pending rather than test execution.

What changed and how does it work?

Scope / methodology

  • Profile with local full runs via tools/bin/ut (JUnit XML) and rank by per-testcase time.
  • Remove “single very slow testcase” patterns: oversized loops/data, exponential enumeration, fixed sleeps.
  • Keep determinism and stability: replace sleeps with bounded waits (require.Eventually), cap concurrency in tests, gate stress loops behind testflag.Long().

Key >10s offenders eliminated (local)

  • pkg/planner/core/tests/extractor: avoid exponential predicate cartesian-product enumeration (CI ~190s → local ~2–3s).
  • pkg/ttl/ttlworker.TestCancelWhileScan: batch insert instead of 10k single-row inserts (CI ~151s → local ~2s).
  • pkg/executor/test/admintest check-table locate tests: reduce rows/iterations and batch KV ops (CI ~130s → local ~2s).
  • Remaining >10s in phase2 were further trimmed to <10s:
    • pkg/server/tests/commontest.TestTopSQLCPUProfile: remove redundant cases and skip normalized-plan decoding unless needed.
    • pkg/ddl/ingest.TestAddGlobalIndexInIngest: shrink partitions/rows and cap DML injections.
    • pkg/dxf/importinto.TestSubmitTaskNextgen: reduce etcd integration cluster size.
    • pkg/executor/test/executor.TestQueryWithKill: bound runtime in normal UT.
    • pkg/table/tables.TestCacheTableBasicReadAndWrite: make cache-table lease test deterministic and remove fixed sleeps.
    • pkg/session/test/bootstraptest upgrade tests: remove redundant re-bootstrap matrices.

Flake amplifiers fixed (16)

  • pkg/executor TestFlashbackClusterWithManyDBs
  • pkg/planner/core/casetest/instanceplancache TestInstancePlanCacheConcurrencySysbench
  • pkg/ddl/tests/partition TestMultiSchemaPartitionByGlobalIndex
  • pkg/objstore/objectio TestNewCompressReader
  • pkg/statistics/handle/cache/internal/lfu TestMemoryControlWithUpdate
  • pkg/store/gcworker TestLeaderTick
  • pkg/ddl TestBackfillingSchedulerGlobalSortMode
  • pkg/server/handler/optimizor TestPlanReplayerWithMultiForeignKey
  • pkg/executor/test/analyzetest TestKillAutoAnalyze
  • pkg/util/memory TestRelease
  • pkg/util/gctuner TestIssue48741
  • pkg/store/gcworker TestGCWithPendingTxn
  • pkg/expression/integration_test TestGetLock
  • pkg/store TestStoreSwitchPeer
  • pkg/util TestVerifyCommonNameAndRotate
  • br/pkg/utils TestTaskRegisterFailedGrant (and TestTaskRegisterFailedReput)

Bug fixes (non-test behavior)

  • pkg/expression/builtin_miscellaneous.go: GET_LOCK() returns 0 (not error) on tikverr.ErrLockWaitTimeout, aligning with MySQL behavior.
  • pkg/sessionctx/variable/sysvar.go: @@global.tidb_slow_log_max_per_sec returns 0 when the limiter is unlimited (rate.Inf).

Change details (grouped by directory)

  • tools/check:

    • tools/check/ut.go: add ut run --timeout <duration> to override per-test -test.timeout; improve cover profile merge error handling and always clean up temp dirs.
  • pkg/bindinfo:

    • pkg/bindinfo/tests/bind_usage_info_test.go: reduce fixed sleeps and restore global knobs/sysvars after the test.
  • pkg/ddl:

    • pkg/ddl/ingest/integration_test.go: shrink ingest DDL scenarios; cap concurrent DML injections; gate heavier loops behind testflag.Long().
    • pkg/ddl/backfilling.go, pkg/ddl/job_worker.go: make delay-related failpoints accept configurable values to avoid hard-coded sleeps.
    • pkg/ddl/tests/*: reduce polling rounds in non-long UT; add bounded waits for async delete-range completion; tighten teardown order.
  • pkg/domain:

    • pkg/domain/crossks/cross_ks_test.go: reduce etcd integration cluster size and avoid expensive shared-client patterns.
  • pkg/dxf:

    • pkg/dxf/importinto/job_testkit_test.go: reduce etcd integration cluster size for nextgen keyspace tests.
    • pkg/dxf/framework/integrationtests/*: reduce randomized case counts / tighten waits.
  • pkg/executor:

    • Analyze/checksum stability: finalize pending jobs on error, wait for sys process IDs before killing, bound waits.
    • Heavy-loop tests: reduce data/iterations; gate stress loops behind testflag.Long().
    • Kill/cancel tests: bound runtime (e.g. TestQueryWithKill).
    • Spilling/interrupt/failpoint tests: reduce input size and repetition; replace fixed sleeps with bounded waits.
  • pkg/expression:

    • GET_LOCK() timeout behavior fix (MySQL compatibility).
  • pkg/infoschema:

    • Cluster table privilege tests use EXPLAIN to avoid executing heavy cluster queries.
  • pkg/objstore:

    • pkg/objstore/gcs.go: allow unauthenticated GCS client for localhost endpoints in intest (avoid default-cred dependency in tests).
    • pkg/objstore/gcs_test.go: implement expected token endpoints in httptest server to keep the test bounded.
    • pkg/objstore/azblob_test.go: close writer and reuse checksum to avoid resource leak and reduce repeated work.
    • pkg/objstore/objectio/writer_test.go: close resources and remove fragile goroutine-count assertions.
  • pkg/planner:

    • Extractor tests: replace exponential enumeration with “single-column full coverage + pairwise combinations”.
    • MPP/hint/plan-cache concurrency cases: cap case count and make concurrency tests retryable / bounded.
  • pkg/server:

    • Reduce sleeps/data in max-exec-time tests.
    • Make audit-plugin retry test bounded (cap concurrency & DB pool).
    • TopSQL tests: reduce redundant cases; shorten reporting interval; use random ports and fail fast in proxy-protocol tests.
    • Stabilize binding reload checks via eventual waits.
  • pkg/session:

    • Bootstrap/upgrade suites: remove redundant re-bootstrap matrices; batch dist-task state coverage.
  • pkg/sessiontxn:

    • Simplify fragile timing assertions in RC/TSO optimize tests.
  • pkg/statistics:

    • Stabilize async cache writes and stats GC/ddl-event waits under -cover.
  • pkg/store:

    • Reduce coprocessor backoff in tests via configurable failpoints; add bounded retries for occasional TiFlash slow paths.
    • GC worker tests: stop TTL job manager in bootstrap and bound GC completion waits.
  • pkg/table:

    • Cache table tests: add failpoint-controlled write-lease duration and replace fixed sleeps with require.Eventually.
  • br/pkg/utils:

    • Make task register tests deterministic and fast: avoid side effects under failpoints, replace fixed sleeps with require.Eventually, and add failpoint-controlled retry interval.
  • pkg/util/gctuner:

    • Stabilize memory limit tests by waiting for the fallback memory-limit update to take effect before asserting.
  • pkg/timer, pkg/ttl, pkg/util:

    • Replace fixed sleeps with eventual waits; reduce loop counts; remove allocator/ABI fragile exact assertions; shorten integration-style UT workers.
    • pkg/util/security_test.go: stabilize TestVerifyCommonNameAndRotate by validating the server-side handshake error log instead of brittle client-side error strings.

Local timing artifacts (for reference)

  • Nextgen baseline: ~/.codex-context/tidb--906fbbe4/artifacts/ut_full_nextgen_20260114_122930/parsed/slow_testcases.tsv
  • Nextgen current: ~/.codex-context/tidb--906fbbe4/artifacts/ut_full_nextgen_20260116_111713/parsed/slow_testcases.tsv
  • Legacy baseline: ~/.codex-context/tidb--906fbbe4/artifacts/ut_full_legacy_20260115_165149/parsed/slow_testcases.tsv
  • Legacy current: ~/.codex-context/tidb--906fbbe4/artifacts/ut_full_legacy_20260116_103514/parsed/slow_testcases.tsv

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
  • No need to test
    • I checked and no code files have been changed.

Side effects

  • Performance regression: Consumes more CPU
  • Performance regression: Consumes more Memory
  • Breaking backward compatibility

Documentation

  • Affects user behaviors
  • Contains syntax changes
  • Contains variable changes
  • Contains experimental features
  • Changes MySQL compatibility

Release note

Please refer to Release Notes Language Style Guide to write a quality release note.

Fix `GET_LOCK()` to return 0 on lock wait timeout and align with MySQL behavior. Also make `tidb_slow_log_max_per_sec` return 0 when unlimited.

@ti-chi-bot ti-chi-bot bot added release-note Denotes a PR that will be considered when it comes time to generate release notes. do-not-merge/needs-triage-completed size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. component/statistics sig/planner SIG: Planner labels Jan 16, 2026
@codecov
Copy link

codecov bot commented Jan 16, 2026

Codecov Report

❌ Patch coverage is 65.80087% with 79 lines in your changes missing coverage. Please review.
✅ Project coverage is 79.4026%. Comparing base (f8627fd) to head (abe1ade).
⚠️ Report is 3 commits behind head on master.

Additional details and impacted files
@@               Coverage Diff                @@
##             master     #65601        +/-   ##
================================================
+ Coverage   77.8383%   79.4026%   +1.5642%     
================================================
  Files          1978       1908        -70     
  Lines        542180     531165     -11015     
================================================
- Hits         422024     421759       -265     
+ Misses       118496     107950     -10546     
+ Partials       1660       1456       -204     
Flag Coverage Δ
integration 47.6671% <7.3913%> (-0.5223%) ⬇️
unit 76.6824% <65.8008%> (+0.2159%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Components Coverage Δ
dumpling 56.7974% <ø> (ø)
parser ∅ <ø> (∅)
br 66.3545% <76.3157%> (+5.2406%) ⬆️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@wjhuang2016
Copy link
Member Author

/test tidb_parser_test

@ti-chi-bot
Copy link

ti-chi-bot bot commented Jan 16, 2026

@wjhuang2016: The specified target(s) for /test were not found.
The following commands are available to trigger required jobs:

/test build
/test check-dev
/test check-dev2
/test mysql-test
/test pull-br-integration-test
/test pull-build-next-gen
/test pull-integration-ddl-test
/test pull-integration-e2e-test
/test pull-integration-realcluster-test-next-gen
/test pull-lightning-integration-test
/test pull-mysql-client-test
/test pull-mysql-client-test-next-gen
/test pull-unit-test-ddlv1
/test pull-unit-test-next-gen
/test unit-test

The following commands are available to trigger optional jobs:

/test pingcap/tidb/canary_ghpr_unit_test
/test pull-br-integration-test-next-gen
/test pull-check-deps
/test pull-common-test
/test pull-e2e-test
/test pull-error-log-review
/test pull-integration-common-test
/test pull-integration-copr-test
/test pull-integration-ddl-test-next-gen
/test pull-integration-e2e-test-next-gen
/test pull-integration-jdbc-test
/test pull-integration-mysql-test
/test pull-integration-nodejs-test
/test pull-integration-python-orm-test
/test pull-mysql-test-next-gen
/test pull-sqllogic-test
/test pull-tiflash-integration-test

Use /test all to run the following jobs that were automatically triggered:

pingcap/tidb/ghpr_build
pingcap/tidb/ghpr_check
pingcap/tidb/ghpr_check2
pingcap/tidb/ghpr_mysql_test
pingcap/tidb/ghpr_unit_test
pingcap/tidb/pull_br_integration_test
pingcap/tidb/pull_build_next_gen
pingcap/tidb/pull_integration_ddl_test
pingcap/tidb/pull_integration_e2e_test
pingcap/tidb/pull_integration_realcluster_test_next_gen
pingcap/tidb/pull_lightning_integration_test
pingcap/tidb/pull_mysql_client_test
pingcap/tidb/pull_mysql_client_test_next_gen
pingcap/tidb/pull_unit_test_ddlv1
pingcap/tidb/pull_unit_test_next_gen
pull-check-deps
pull-error-log-review
Details

In response to this:

/test tidb_parser_test

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@ti-chi-bot
Copy link

ti-chi-bot bot commented Jan 16, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign benmeadowcroft, gmhdbjd, leavrth, terry1purcell, xuhuaiyu for approval. For more information see the Code Review Process.
Please ensure that each of them provides their approval before proceeding.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot
Copy link

ti-chi-bot bot commented Jan 16, 2026

@wjhuang2016: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-unit-test-ddlv1 abe1ade link true /test pull-unit-test-ddlv1

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

component/statistics release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/planner SIG: Planner size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

CI unit tests can take hours due to slow/flaky testcases

1 participant