[FLINK-38408][checkpoint] Complete the checkpoint CompletableFuture after updating statistics to ensures semantic correctness and prevent test failure #27050

1996fanrui · 2025-09-26T15:27:00Z

What is the purpose of the change

MapStateNullValueCheckpointingITCase failed with No checkpoint was created yet

Root Cause Analysis

Problem Location

Log analysis revealed that the checkpoint had actually completed successfully:

07:19:37,522 [jobmanager-io-thread-1] INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Completed checkpoint 1 for job b809cf46d67c23697786fd514565c737 (4464 bytes, checkpointDuration=45 ms, finalizationTime=4 ms)

However, the test code could not find the completed checkpoint when calling CommonTestUtils.getLatestCompletedCheckpointPath().

Root Cause

The problem occurs in the execution order of the CheckpointCoordinator.completePendingCheckpoint() method:

flink/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointCoordinator.java

Line 1389 in 39a4628

reportCompletedCheckpoint(completedCheckpoint);

pendingCheckpoint.getCompletionFuture().complete(completedCheckpoint);
reportCompletedCheckpoint(completedCheckpoint);

Checkpoint Coordinator mechanism:

A: pendingCheckpoint.getCompletionFuture().complete(completedCheckpoint) completes the completion future first{}
B: reportCompletedCheckpoint(completedCheckpoint) updates checkpoint statistics.

Test code timeline:

C: Detect future completion
D: Call getLatestCompletedCheckpointPath() immediately

Usually, the execution sequence is A -> B -> C -> D, it works well.

The bug happens if execution sequence is A > C -> D -> B.

Reproduction Method

In the completePendingCheckpoint() method, inserting Thread.sleep(100) between complete() and reportCompletedCheckpoint() can reproduce this issue 100%.

Brief change log: Adjust the execution order in CheckpointCoordinator

[FLINK-38408][checkpoint] Complete the checkpoint CompletableFuture after updating statistics to ensures semantic correctness and prevent test failure

Changes:

// Update statistics first 
reportCompletedCheckpoint(completedCheckpoint);
// Complete the future later
pendingCheckpoint.getCompletionFuture().complete(completedCheckpoint);

Benefits:

Fundamentally eliminates race conditions
Ensures semantic correctness: Waiting parties are notified only when the checkpoint is fully processed

Verifying this change

Added testCompletionFutureCompletesAfterReporting

Does this pull request potentially affect one of the following parts:

Dependencies (does it add or upgrade a dependency): no
The public API, i.e., is any changed class annotated with @Public(Evolving): no
The serializers: no
The runtime per-record code paths (performance sensitive): no
Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: no
The S3 file system connector: no

Documentation

Does this pull request introduce a new feature? no

flinkbot · 2025-09-26T15:32:30Z

CI report:

2036713 Azure: SUCCESS

Bot commands

The @flinkbot bot supports the following commands:

@flinkbot run azure re-run the last Azure build

Izeren

Thank you for the change @1996fanrui. Overall, LGTM, my main concern is about potential test flakiness, PTAL

Izeren · 2025-10-05T18:14:27Z

flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointCoordinator.java

                lastSubsumed = null;
            }

-            pendingCheckpoint.getCompletionFuture().complete(completedCheckpoint);


I have concerns that change like this can have potential impacts like:

Deadlock / race condition if reportCompletedCheckpoint would trigger any handler that also waits on the checkpoint future before its completion (in general, unlikely situation, and should be caught by existing test)

Checkpoint completion will be slightly delayed, but reporting is a quick operation, so doesn't seem to be critical

If reporting throws exception it will result in checkpoint being completed exceptionally. Could we confirm that this behaviour matches the previous one?

Deadlock / race condition if reportCompletedCheckpoint would trigger any handler that also waits on the checkpoint future before its completion (in general, unlikely situation, and should be caught by existing test)

Yes, it is a unlike situation. reportCompletedCheckpoint only has one parameter, which is completedCheckpoint, so reportCompletedCheckpoint is unable to access pendingCheckpoint.getCompletionFuture().

Checkpoint completion will be slightly delayed, but reporting is a quick operation, so doesn't seem to be critical

Generally, both of them are quick. Of course, complete a CompletableFuture is super quick.

If reporting throws exception it will result in checkpoint being completed exceptionally. Could we confirm that this behaviour matches the previous one?

Judging from the code, the behavior will definitely change for this case. But I think the new behavior makes more sense.

Before this PR, the CompletableFuture is completed even if it is not reported or report is failed. It causes wrong semantic, client receives the checkpoint 10(or X) is completed, then get nothing when fetch more metadata for checkpoint 10. (That is why MapStateNullValueCheckpointingITCase fails occasionally)

After this PR, client could fetch the correct result once the client received the complete signal.

The semantic will be clearer.

Izeren · 2025-10-05T18:16:31Z

flink-runtime/src/test/java/org/apache/flink/runtime/checkpoint/CheckpointCoordinatorTest.java

+                            }
+                        });
+
+        assertThat(tracker.getReportStartedFuture().get(20, TimeUnit.SECONDS))


That is likely to end up being flaky test. Test in CI could freeze for 15min and more, so 20 seconds timeout may not be sufficient in general.
I suggest to use indefinite timeout of at least a few hours

I would like to clarify two types of CI timeouts: one is the total timeout of CI, and the other is the timeout of a single unit test or some logical timeout within the unit test.

For the former, I think 15 minutes or more is reasonable.

For the latter, if the timeout for each single test is 15 minutes, the total CI duration will be terrible. Flink may have more than 10k tests.

The test sometimes is unstable, but not that bad. The default policy of Flink CI is to fail if there is no progress for 15 consecutive minutes. However, these are generally caused by bugs, for example, deadlock or something like this. It's rare to see a CI process stuck for 15 minutes due to a lack of resources.

For this case, tracker.getReportStartedFuture() is so quick on my local, it always be less than 100 ms, that is why I think 20 seconds is safe here.

For other examples, I checked some callers[1][2] from flink code, some of them are 10 seconds, and some of them are 60 seconds. I could update it from 20 seconds to 60 seconds if you think 20 seconds is not safe enough. TBH, 15 minutes is a little long, it will delay the exception or CI if there are some bugs.

[1]

flink/flink-clients/src/test/java/org/apache/flink/client/deployment/application/ApplicationDispatcherBootstrapTest.java

Line 79 in cc55c56

private static final int TIMEOUT_SECONDS = 10;

[2].

flink/flink-tests/src/test/java/org/apache/flink/runtime/jobmaster/JobMasterTriggerSavepointITCase.java

Line 245 in cc55c56

clusterClient.getJobStatus(jobGraph.getJobID()).get(60, TimeUnit.SECONDS);

The reason I brought it up is that I got similar feedback from @dmvk in the past, where he suggested that CI VM can freeze for 15 min even if the test is quick, because there are multiple tests that are running and you don't have guarantee that your particular test will be always executed quickly.

My overall view on this is the following, If otherwise quick test for some reason takes longer than 15 minutes then either it faced something like a deadlock or overall CI run was impacted by "bad change"/"external factors". Unless you have a deadlock in your own test, the whole CI run is more likely to timeout than not, so it doesn't make things worse. For the cases when you do have a deadlock, per test timeout could allow you to verify more tests in a failed run, which is beneficial, but the benefit is limited to the non-parallel suit that fails.

To sum up, I don't see big difference between 15 min and 1 hour, but 20 seconds is very likely not enough

I understand your concern about CI stability. While I still think 15 minutes is quite conservative for this specific case, I'm willing to use it for now to unblock the fix. We can always revisit this timeout if we observe issues in practice. I'll update it to 15 minutes. Thanks for the discussion.

Izeren · 2025-10-05T18:26:07Z

flink-runtime/src/test/java/org/apache/flink/runtime/checkpoint/CheckpointCoordinatorTest.java

+                .as("reportCompletedCheckpoint should be started soon when checkpoint is acked.")
+                .isNull();
+
+        for (int i = 0; i < 30; i++) {


Similarly to above, I am not sure you can confirm whether expected change did not occur because of being blocked vs corresponding thread being inactive. Will be better to wait indefinitely here

The sleep here represents the opposite case, where we expect the CompletableFuture to not complete. Therefore, the sleep is intentionally used to verify isNotDone.

If the sleep is set to 10 minutes, then the test will take 10 minutes. Here is 3 seconds, generally, it is enough to check CompletableFuture.isNotDone.

In this case, how would your test distinguish between: Future wasn't complete because of "happens before" condition vs Future wasn't complete because VM froze and responsible thread was not making progress for more than 3 seconds.

I am less concerned about this one as it shouldn't introduce flakiness, but testing it this way you have weaker guarantees of "happens before" condition being actually tested.

The purpose of adding this test is that if isDone occurs here, then there must be a bug. It will let developer is aware of bugs.

how would your test distinguish between: Future wasn't complete because of "happens before" condition vs Future wasn't complete because VM froze and responsible thread was not making progress for more than 3 seconds.

From current testing, it cannot to distinguish them. Here we are testing that an unexpected case did not occur. If VM froze happens, both expected case or expected case do not be executed. So I really do not know how to distinguish them.

but testing it this way you have weaker guarantees of "happens before" condition being actually tested.

I also hope to avoid sleep in tests as much as possible. However, I haven't figured out how to use CompletableFuture or CountDownLatch to replace it.

May I know do you have any suggestions on this? I'd really appreciate a better alternative.

If it is non-trivial to rewrite this test without busy wait to avoid this issue, I am happy to accept this tests implementation as is. Most of the time this test will not be a subject to VM freeze. Even if some bad change will be lucky enough to get green CI tests and be merged, we will see tests being red in other runs shortly after. Assuming that we will still be able to pinpoint the issue promptly. I don't have objections to keep it as is.

Izeren · 2025-10-05T18:26:57Z

flink-runtime/src/test/java/org/apache/flink/runtime/checkpoint/CheckpointCoordinatorTest.java

+
+        tracker.getReportBlockingFuture().complete(null);
+
+        CompletedCheckpoint result = checkpointFuture.get(5, TimeUnit.SECONDS);


Izeren · 2025-10-05T18:27:02Z

flink-runtime/src/test/java/org/apache/flink/runtime/checkpoint/CheckpointCoordinatorTest.java

+                .as("Checkpoint future should complete after reportCompletedCheckpoint finishes")
+                .isNotNull();
+
+        ackTask.get(5, TimeUnit.SECONDS);


Izeren

LGTM, thank you @1996fanrui for the change!

…fter updating statistics to ensures semantic correctness and prevent test failure

1996fanrui · 2025-10-16T08:38:44Z

Thanks @Izeren for the review, merging

1996fanrui force-pushed the 38408/no-checkpoint branch from dc2613d to 9afe37c Compare September 29, 2025 16:05

1996fanrui marked this pull request as ready for review September 29, 2025 18:54

1996fanrui marked this pull request as draft September 30, 2025 09:07

1996fanrui force-pushed the 38408/no-checkpoint branch 2 times, most recently from 736fe94 to b0e8240 Compare October 2, 2025 09:06

1996fanrui marked this pull request as ready for review October 2, 2025 09:13

Izeren reviewed Oct 5, 2025

View reviewed changes

github-actions bot added the community-reviewed PR has been reviewed by the community. label Oct 7, 2025

1996fanrui force-pushed the 38408/no-checkpoint branch from b0e8240 to 551ed1f Compare October 14, 2025 12:53

Izeren approved these changes Oct 15, 2025

View reviewed changes

1996fanrui force-pushed the 38408/no-checkpoint branch 2 times, most recently from 9443a22 to e41d3df Compare October 15, 2025 16:28

[FLINK-38408][checkpoint] Complete the checkpoint CompletableFuture a…

2036713

…fter updating statistics to ensures semantic correctness and prevent test failure

1996fanrui force-pushed the 38408/no-checkpoint branch from e41d3df to 2036713 Compare October 15, 2025 22:46

1996fanrui merged commit e44d638 into apache:master Oct 16, 2025


		tracker.getReportBlockingFuture().complete(null);

		CompletedCheckpoint result = checkpointFuture.get(5, TimeUnit.SECONDS);

[FLINK-38408][checkpoint] Complete the checkpoint CompletableFuture after updating statistics to ensures semantic correctness and prevent test failure #27050

[FLINK-38408][checkpoint] Complete the checkpoint CompletableFuture after updating statistics to ensures semantic correctness and prevent test failure #27050

Uh oh!

Conversation

1996fanrui commented Sep 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What is the purpose of the change

Root Cause Analysis

Problem Location

Root Cause

Reproduction Method

Brief change log: Adjust the execution order in CheckpointCoordinator

Verifying this change

Does this pull request potentially affect one of the following parts:

Documentation

Uh oh!

flinkbot commented Sep 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

CI report:

Uh oh!

Izeren left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

1996fanrui Oct 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Izeren left a comment

Choose a reason for hiding this comment

Uh oh!

1996fanrui commented Oct 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

1996fanrui commented Sep 26, 2025 •

edited

Loading

flinkbot commented Sep 26, 2025 •

edited

Loading

1996fanrui Oct 12, 2025 •

edited

Loading