-
Notifications
You must be signed in to change notification settings - Fork 13.8k
[FLINK-38483][checkpoint] Fix the bug that Job cannot be recovered from unaligned checkpoint due to Cannot get old subtasks from a descriptor that represents no state.
exception
#27084
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
@flinkbot run azure |
I'm just prepping 2.1.1 release and seems like it would be good to punt in before cut. I'm going to take a look tomorrow morning... |
public int[] getOldSubtaskInstances() { | ||
throw new UnsupportedOperationException( | ||
"Cannot get old subtasks from a descriptor that represents no state."); | ||
return new int[0]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can eliminate some GC:
private static final int[] EMPTY_INT_ARRAY = new int[0];
@Override
public int[] getOldSubtaskInstances() {
return EMPTY_INT_ARRAY;
}
…om unaligned checkpoint due to `Cannot get old subtasks from a descriptor that represents no state.` exception
a7e62db
to
5adba03
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm just prepping 2.1.1 release and seems like it would be good to punt in before cut. I'm going to take a look tomorrow morning...
Thanks for driving release, and it make sense to include this.
I have address your comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
Hmmm, still some unrelated issues in azure pipe |
@flinkbot run azure |
WordCount is definitely not using unaligned checkpoint so most probably we just need to kick the pipe trough the flaky tests |
2 issues about CI:
|
After green CI good to go🚢 |
Hey @1996fanrui @gaborgsomogyi, do you know the latest on the CI failures? I’m trying to understand how to avoid failures that don’t seem related to the PR, and how to tell if they’re caused by flaky tests. In my case, the CI failed but I don’t see any test failures: https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=70091&view=ms.vss-test-web.build-test-results-tab” |
It may be caused by test environment. I checked the CI list[1], no CI is green since I saw lot of [1] https://dev.azure.com/apache-flink/apache-flink/_build?definitionId=2&_a=summary |
Thanks for the info @1996fanrui ! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
approve assuming the CI errors are unrelated.
FYI, I've a try to fix it on master: #27095 |
What is the purpose of the change
Job cannot be recovered from unaligned checkpoint , exception: Cannot get old subtasks from a descriptor that represents no state.
[FLINK-38483][checkpoint] Fix the bug that Job cannot be recovered from unaligned checkpoint due to
Cannot get old subtasks from a descriptor that represents no state.
exceptionRoot Cause Analysis
Technical Background
The issue stems from the
NO_STATE
descriptor implementation inInflightDataRescalingDescriptor
. When processing unaligned checkpoints during rescaling, the system encounters gates or partitions that have no inflight data. TheNO_STATE
descriptor was designed to represent these empty states.The Problem
The original
NO_STATE
descriptor implementation threwUnsupportedOperationException
for bothgetOldSubtaskInstances()
andgetRescaleMappings()
methods:This design choice was problematic because: Rescaling Logic Expects Values. Such as: jobs have mixed hash exchanges where some partitions are empty (due to filtering, low data volume, or scaling effects) while others contain data.
Reproduction Case
The issue can be consistently reproduced using:
org.apache.flink.test.checkpointing.UnalignedCheckpointRescaleWithMixedExchangesITCase.createPartEmptyHashExchangeDAG
.This test creates a DAG where the downstream MapAfterKeyBy task receives input from two hash exchanges: one with actual data and one that is empty due to filtering.
Solution Implementation
The Fix
The solution replaces the exception-throwing behavior with safe default values:
Why This Solution Is Risk-Free
Semantic Correctness:
getOldSubtaskInstances()
correctly represents "no old subtasks"SYMMETRIC_IDENTITY
mapping correctly represents "no rescaling needed"No State Guarantee: The
NO_STATE
descriptor is only used when there is genuinely no inflight data to process. Therefore:Existing Pattern: This approach maintains consistency with other parts of the codebase
Consistency with Existing Code
The solution aligns with the existing
NoRescalingDescriptor
implementation, which already uses the same pattern:This consistency ensures that:
Verifying this change
UnalignedCheckpointRescaleWithMixedExchangesITCase.createPartEmptyHashExchangeDAG
.Does this pull request potentially affect one of the following parts:
@Public(Evolving)
: noDocumentation