Skip to content

[BP-2.3][FLINK-37892][tests] Fix spurious TimeoutException in TestUtils#waitUntil for non-monotonic conditions#28401

Open
lihaosky wants to merge 1 commit into
apache:release-2.3from
confluentinc:FLINK-37892-fix-waituntil-race-2.3
Open

[BP-2.3][FLINK-37892][tests] Fix spurious TimeoutException in TestUtils#waitUntil for non-monotonic conditions#28401
lihaosky wants to merge 1 commit into
apache:release-2.3from
confluentinc:FLINK-37892-fix-waituntil-race-2.3

Conversation

@lihaosky

Copy link
Copy Markdown
Contributor

What is the purpose of the change

Backport of #28400 to release-2.3 (clean cherry-pick of 61723a0).

Fixes the failure mode behind FLINK-37892 (SplitFetcherManagerTest#testCloseBlockingWaitingForFetcherShutdown flaking on CI).

TestUtils#waitUntil evaluates the condition twice — once in the polling loop and again to decide whether to throw. A non-monotonic condition that is momentarily true and then false again can exit the loop on the first evaluation and fail the second, producing a TimeoutException within milliseconds despite a 30s budget. SplitFetcherManagerTest passes exactly such a condition (findThread(THREAD_NAME_PREFIX).size() == 2 — a live thread count that can flicker as fetcher threads start/exit).

Brief change log

  • Restructure TestUtils#waitUntil(Supplier, Duration, String) to evaluate the condition once per iteration and latch the result: a condition observed as true always completes the wait; TimeoutException is thrown only when the condition was actually false at the deadline.
  • Add TestUtilsTest with a regression test whose condition is satisfied exactly once and false afterwards.

Verifying this change

This change added tests and can be verified as follows:

  • TestUtilsTest#testWaitUntilSucceedsForConditionSatisfiedOnlyOnce fails against the old implementation with the same spurious TimeoutException (in ~0.04s) and passes with the fix.
  • SplitFetcherManagerTest passes against the fixed utility.

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): no
  • The public API, i.e., is any changed class annotated with @Public(Evolving): no
  • The serializers: no
  • The runtime per-record code paths (performance sensitive): no
  • Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: no
  • The S3 file system connector: no

Documentation

  • Does this pull request introduce a new feature? no
  • If yes, how is the feature documented? not applicable

Was generative AI tooling used to co-author this PR?

Yes

…ntil for non-monotonic conditions (apache#28400)

TestUtils#waitUntil evaluated the condition twice: once in the polling
loop and again to decide whether to throw. A condition that is
momentarily true and then false again (e.g. a live thread count matched
with ==) could exit the loop on the first evaluation and then fail the
second one, producing a TimeoutException within milliseconds despite a
30s budget. This is the failure mode behind
SplitFetcherManagerTest#testCloseBlockingWaitingForFetcherShutdown
flaking on CI with sub-second 'timeouts'.

Evaluate the condition once per iteration and latch the result, so a
condition observed as true always completes the wait. Add a regression
test with a condition that is satisfied exactly once.

(cherry picked from commit 61723a0)
@flinkbot

flinkbot commented Jun 12, 2026

Copy link
Copy Markdown
Collaborator

CI report:

Bot commands The @flinkbot bot supports the following commands:
  • @flinkbot run azure re-run the last Azure build

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants