Skip to content

[FLINK-39902][tests] Fix race in RescaleTimelineITCase.testRescaleTerminatedByJobFinished#28378

Open
MartijnVisser wants to merge 1 commit into
apache:masterfrom
MartijnVisser:flink-39902-rescale-jobfinished
Open

[FLINK-39902][tests] Fix race in RescaleTimelineITCase.testRescaleTerminatedByJobFinished#28378
MartijnVisser wants to merge 1 commit into
apache:masterfrom
MartijnVisser:flink-39902-rescale-jobfinished

Conversation

@MartijnVisser

Copy link
Copy Markdown
Contributor

What is the purpose of the change

RescaleTimelineITCase.testRescaleTerminatedByJobFinished is flaky on slow CI. The test requests an upscale to a parallelism that exceeds the available slots, so the rescale never changes the running parallelism and is only observable as a recorded history entry. It then unblocks the no-op task immediately, racing the scheduler recording that second rescale: on a slow machine the job finishes before the rescale is recorded, the history stays at size 1, and the size-2 wait times out.

Brief change log

  • Wait until the second rescale has been recorded (history size 2) before unblocking the task, so the in-progress rescale resolves to JOB_FINISHED once the job finishes.
  • Move the assumeThat(enabledRescaleHistory(...)) skip ahead of the requirement update so the disabled-history variant skips cleanly.

Verifying this change

Existing assertions are unchanged; the fix only enforces the ordering the sibling tests already get. Verified by running testRescaleTerminatedByJobFinished repeatedly in a loop locally without failure.

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): no
  • The public API, i.e., is any changed class annotated with @Public(Evolving): no
  • The serializers: no
  • The runtime per-record code paths (performance sensitive): no
  • Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: no
  • The S3 file system connector: no

Documentation

  • Does this pull request introduce a new feature? no
  • If yes, how is the feature documented? not applicable

Was generative AI tooling used to co-author this PR?
  • Yes (Claude Opus 4.8 (1M context))

Generated-by: Claude Opus 4.8 (1M context)

…minatedByJobFinished

Unblocking the task raced with the scheduler recording the second rescale; on
a slow machine the job finished first, leaving the history at size 1 and
timing out the wait. Wait for the rescale to be recorded before unblocking.

The assumeThat(enabledRescaleHistory) check had to move before the update RPC
because the new size-2 history wait is only meaningful when rescale history is
enabled; for the disabled parameter the history never grows and the wait would
hang.

Generated-by: Claude Opus 4.8 (1M context)
@flinkbot

flinkbot commented Jun 10, 2026

Copy link
Copy Markdown
Collaborator

CI report:

Bot commands The @flinkbot bot supports the following commands:
  • @flinkbot run azure re-run the last Azure build

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants